From 1a5a9906d4e8d1976b701f889d8f35d54b928f25 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Wed, 21 Mar 2012 16:33:42 -0700 Subject: mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode In some cases it may happen that pmd_none_or_clear_bad() is called with the mmap_sem hold in read mode. In those cases the huge page faults can allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a false positive from pmd_bad() that will not like to see a pmd materializing as trans huge. It's not khugepaged causing the problem, khugepaged holds the mmap_sem in write mode (and all those sites must hold the mmap_sem in read mode to prevent pagetables to go away from under them, during code review it seems vm86 mode on 32bit kernels requires that too unless it's restricted to 1 thread per process or UP builds). The race is only with the huge pagefaults that can convert a pmd_none() into a pmd_trans_huge(). Effectively all these pmd_none_or_clear_bad() sites running with mmap_sem in read mode are somewhat speculative with the page faults, and the result is always undefined when they run simultaneously. This is probably why it wasn't common to run into this. For example if the madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page fault, the hugepage will not be zapped, if the page fault runs first it will be zapped. Altering pmd_bad() not to error out if it finds hugepmds won't be enough to fix this, because zap_pmd_range would then proceed to call zap_pte_range (which would be incorrect if the pmd become a pmd_trans_huge()). The simplest way to fix this is to read the pmd in the local stack (regardless of what we read, no need of actual CPU barriers, only compiler barrier needed), and be sure it is not changing under the code that computes its value. Even if the real pmd is changing under the value we hold on the stack, we don't care. If we actually end up in zap_pte_range it means the pmd was not none already and it was not huge, and it can't become huge from under us (khugepaged locking explained above). All we need is to enforce that there is no way anymore that in a code path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad can run into a hugepmd. The overhead of a barrier() is just a compiler tweak and should not be measurable (I only added it for THP builds). I don't exclude different compiler versions may have prevented the race too by caching the value of *pmd on the stack (that hasn't been verified, but it wouldn't be impossible considering pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines and there's no external function called in between pmd_trans_huge and pmd_none_or_clear_bad). if (pmd_trans_huge(*pmd)) { if (next-addr != HPAGE_PMD_SIZE) { VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem)); split_huge_page_pmd(vma->vm_mm, pmd); } else if (zap_huge_pmd(tlb, vma, pmd, addr)) continue; /* fall through */ } if (pmd_none_or_clear_bad(pmd)) Because this race condition could be exercised without special privileges this was reported in CVE-2012-1179. The race was identified and fully explained by Ulrich who debugged it. I'm quoting his accurate explanation below, for reference. ====== start quote ======= mapcount 0 page_mapcount 1 kernel BUG at mm/huge_memory.c:1384! At some point prior to the panic, a "bad pmd ..." message similar to the following is logged on the console: mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7). The "bad pmd ..." message is logged by pmd_clear_bad() before it clears the page's PMD table entry. 143 void pmd_clear_bad(pmd_t *pmd) 144 { -> 145 pmd_ERROR(*pmd); 146 pmd_clear(pmd); 147 } After the PMD table entry has been cleared, there is an inconsistency between the actual number of PMD table entries that are mapping the page and the page's map count (_mapcount field in struct page). When the page is subsequently reclaimed, __split_huge_page() detects this inconsistency. 1381 if (mapcount != page_mapcount(page)) 1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n", 1383 mapcount, page_mapcount(page)); -> 1384 BUG_ON(mapcount != page_mapcount(page)); The root cause of the problem is a race of two threads in a multithreaded process. Thread B incurs a page fault on a virtual address that has never been accessed (PMD entry is zero) while Thread A is executing an madvise() system call on a virtual address within the same 2 MB (huge page) range. virtual address space .---------------------. | | | | .-|---------------------| | | | | | |<-- B(fault) | | | 2 MB | |/////////////////////|-. huge < |/////////////////////| > A(range) page | |/////////////////////|-' | | | | | | '-|---------------------| | | | | '---------------------' - Thread A is executing an madvise(..., MADV_DONTNEED) system call on the virtual address range "A(range)" shown in the picture. sys_madvise // Acquire the semaphore in shared mode. down_read(¤t->mm->mmap_sem) ... madvise_vma switch (behavior) case MADV_DONTNEED: madvise_dontneed zap_page_range unmap_vmas unmap_page_range zap_pud_range zap_pmd_range // // Assume that this huge page has never been accessed. // I.e. content of the PMD entry is zero (not mapped). // if (pmd_trans_huge(*pmd)) { // We don't get here due to the above assumption. } // // Assume that Thread B incurred a page fault and .---------> // sneaks in here as shown below. | // | if (pmd_none_or_clear_bad(pmd)) | { | if (unlikely(pmd_bad(*pmd))) | pmd_clear_bad | { | pmd_ERROR | // Log "bad pmd ..." message here. | pmd_clear | // Clear the page's PMD entry. | // Thread B incremented the map count | // in page_add_new_anon_rmap(), but | // now the page is no longer mapped | // by a PMD entry (-> inconsistency). | } | } | v - Thread B is handling a page fault on virtual address "B(fault)" shown in the picture. ... do_page_fault __do_page_fault // Acquire the semaphore in shared mode. down_read_trylock(&mm->mmap_sem) ... handle_mm_fault if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) // We get here due to the above assumption (PMD entry is zero). do_huge_pmd_anonymous_page alloc_hugepage_vma // Allocate a new transparent huge page here. ... __do_huge_pmd_anonymous_page ... spin_lock(&mm->page_table_lock) ... page_add_new_anon_rmap // Here we increment the page's map count (starts at -1). atomic_set(&page->_mapcount, 0) set_pmd_at // Here we set the page's PMD entry which will be cleared // when Thread A calls pmd_clear_bad(). ... spin_unlock(&mm->page_table_lock) The mmap_sem does not prevent the race because both threads are acquiring it in shared mode (down_read). Thread B holds the page_table_lock while the page's map count and PMD table entry are updated. However, Thread A does not synchronize on that lock. ====== end quote ======= [akpm@linux-foundation.org: checkpatch fixes] Reported-by: Ulrich Obergfell Signed-off-by: Andrea Arcangeli Acked-by: Johannes Weiner Cc: Mel Gorman Cc: Hugh Dickins Cc: Dave Jones Acked-by: Larry Woodman Acked-by: Rik van Riel Cc: [2.6.38+] Cc: Mark Salter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c index b466cab..328cb37 100644 --- a/arch/x86/kernel/vm86_32.c +++ b/arch/x86/kernel/vm86_32.c @@ -172,6 +172,7 @@ static void mark_screen_rdonly(struct mm_struct *mm) spinlock_t *ptl; int i; + down_write(&mm->mmap_sem); pgd = pgd_offset(mm, 0xA0000); if (pgd_none_or_clear_bad(pgd)) goto out; @@ -190,6 +191,7 @@ static void mark_screen_rdonly(struct mm_struct *mm) } pte_unmap_unlock(pte, ptl); out: + up_write(&mm->mmap_sem); flush_tlb(); } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 7dcd2a2..3efa725 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -409,6 +409,9 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } else { spin_unlock(&walk->mm->page_table_lock); } + + if (pmd_trans_unstable(pmd)) + return 0; /* * The mmap_sem held all the way back in m_start() is what * keeps khugepaged out of here and from collapsing things @@ -507,6 +510,8 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, struct page *page; split_huge_page_pmd(walk->mm, pmd); + if (pmd_trans_unstable(pmd)) + return 0; pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) { @@ -670,6 +675,8 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, int err = 0; split_huge_page_pmd(walk->mm, pmd); + if (pmd_trans_unstable(pmd)) + return 0; /* find the first VMA at or above 'addr' */ vma = find_vma(walk->mm, addr); @@ -961,6 +968,8 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, spin_unlock(&walk->mm->page_table_lock); } + if (pmd_trans_unstable(pmd)) + return 0; orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); do { struct page *page = can_gather_numa_stats(*pte, md->vma, addr); diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 76bff2b..a03c098 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -425,6 +425,8 @@ extern void untrack_pfn_vma(struct vm_area_struct *vma, unsigned long pfn, unsigned long size); #endif +#ifdef CONFIG_MMU + #ifndef CONFIG_TRANSPARENT_HUGEPAGE static inline int pmd_trans_huge(pmd_t pmd) { @@ -441,7 +443,66 @@ static inline int pmd_write(pmd_t pmd) return 0; } #endif /* __HAVE_ARCH_PMD_WRITE */ +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + +/* + * This function is meant to be used by sites walking pagetables with + * the mmap_sem hold in read mode to protect against MADV_DONTNEED and + * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd + * into a null pmd and the transhuge page fault can convert a null pmd + * into an hugepmd or into a regular pmd (if the hugepage allocation + * fails). While holding the mmap_sem in read mode the pmd becomes + * stable and stops changing under us only if it's not null and not a + * transhuge pmd. When those races occurs and this function makes a + * difference vs the standard pmd_none_or_clear_bad, the result is + * undefined so behaving like if the pmd was none is safe (because it + * can return none anyway). The compiler level barrier() is critically + * important to compute the two checks atomically on the same pmdval. + */ +static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) +{ + /* depend on compiler for an atomic pmd read */ + pmd_t pmdval = *pmd; + /* + * The barrier will stabilize the pmdval in a register or on + * the stack so that it will stop changing under the code. + */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + barrier(); +#endif + if (pmd_none(pmdval)) + return 1; + if (unlikely(pmd_bad(pmdval))) { + if (!pmd_trans_huge(pmdval)) + pmd_clear_bad(pmd); + return 1; + } + return 0; +} + +/* + * This is a noop if Transparent Hugepage Support is not built into + * the kernel. Otherwise it is equivalent to + * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in + * places that already verified the pmd is not none and they want to + * walk ptes while holding the mmap sem in read mode (write mode don't + * need this). If THP is not enabled, the pmd can't go away under the + * code even if MADV_DONTNEED runs, but if THP is enabled we need to + * run a pmd_trans_unstable before walking the ptes after + * split_huge_page_pmd returns (because it may have run when the pmd + * become null, but then a page fault can map in a THP and not a + * regular page). + */ +static inline int pmd_trans_unstable(pmd_t *pmd) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + return pmd_none_or_trans_huge_or_clear_bad(pmd); +#else + return 0; #endif +} + +#endif /* CONFIG_MMU */ #endif /* !__ASSEMBLY__ */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 26c6f4e..3728181 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5230,6 +5230,8 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, spinlock_t *ptl; split_huge_page_pmd(walk->mm, pmd); + if (pmd_trans_unstable(pmd)) + return 0; pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) @@ -5390,6 +5392,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, spinlock_t *ptl; split_huge_page_pmd(walk->mm, pmd); + if (pmd_trans_unstable(pmd)) + return 0; retry: pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; addr += PAGE_SIZE) { diff --git a/mm/memory.c b/mm/memory.c index 347e5fa..e01abb9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1247,16 +1247,24 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, do { next = pmd_addr_end(addr, end); if (pmd_trans_huge(*pmd)) { - if (next-addr != HPAGE_PMD_SIZE) { + if (next - addr != HPAGE_PMD_SIZE) { VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem)); split_huge_page_pmd(vma->vm_mm, pmd); } else if (zap_huge_pmd(tlb, vma, pmd, addr)) - continue; + goto next; /* fall through */ } - if (pmd_none_or_clear_bad(pmd)) - continue; + /* + * Here there can be other concurrent MADV_DONTNEED or + * trans huge page faults running, and if the pmd is + * none or trans huge it can change under us. This is + * because MADV_DONTNEED holds the mmap_sem in read + * mode. + */ + if (pmd_none_or_trans_huge_or_clear_bad(pmd)) + goto next; next = zap_pte_range(tlb, vma, pmd, addr, next, details); +next: cond_resched(); } while (pmd++, addr = next, addr != end); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 47296fe..0a37570 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -512,7 +512,7 @@ static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud, do { next = pmd_addr_end(addr, end); split_huge_page_pmd(vma->vm_mm, pmd); - if (pmd_none_or_clear_bad(pmd)) + if (pmd_none_or_trans_huge_or_clear_bad(pmd)) continue; if (check_pte_range(vma, pmd, addr, next, nodes, flags, private)) diff --git a/mm/mincore.c b/mm/mincore.c index 636a868..936b4ce 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -164,7 +164,7 @@ static void mincore_pmd_range(struct vm_area_struct *vma, pud_t *pud, } /* fall through */ } - if (pmd_none_or_clear_bad(pmd)) + if (pmd_none_or_trans_huge_or_clear_bad(pmd)) mincore_unmapped_range(vma, addr, next, vec); else mincore_pte_range(vma, pmd, addr, next, vec); diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 2f5cf10..aa9701e 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -59,7 +59,7 @@ again: continue; split_huge_page_pmd(walk->mm, pmd); - if (pmd_none_or_clear_bad(pmd)) + if (pmd_none_or_trans_huge_or_clear_bad(pmd)) goto again; err = walk_pte_range(pmd, addr, next, walk); if (err) diff --git a/mm/swapfile.c b/mm/swapfile.c index 00a962c..44595a3 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -932,9 +932,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); - if (unlikely(pmd_trans_huge(*pmd))) - continue; - if (pmd_none_or_clear_bad(pmd)) + if (pmd_none_or_trans_huge_or_clear_bad(pmd)) continue; ret = unuse_pte_range(vma, pmd, addr, next, entry, page); if (ret) -- cgit v0.10.2 From 1de5b41cd3b2474c2770b825266d372073e1b28b Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Wed, 21 Mar 2012 16:33:42 -0700 Subject: fs/namei.c: fix warnings on 32-bit i386 allnoconfig: fs/namei.c: In function 'has_zero': fs/namei.c:1617: warning: integer constant is too large for 'unsigned long' type fs/namei.c:1617: warning: integer constant is too large for 'unsigned long' type fs/namei.c: In function 'hash_name': fs/namei.c:1635: warning: integer constant is too large for 'unsigned long' type There must be a tidier way of doing this. Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/namei.c b/fs/namei.c index 20a4fcf..561db47 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1455,9 +1455,15 @@ done: } EXPORT_SYMBOL(full_name_hash); +#ifdef CONFIG_64BIT #define ONEBYTES 0x0101010101010101ul #define SLASHBYTES 0x2f2f2f2f2f2f2f2ful #define HIGHBITS 0x8080808080808080ul +#else +#define ONEBYTES 0x01010101ul +#define SLASHBYTES 0x2f2f2f2ful +#define HIGHBITS 0x80808080ul +#endif /* Return the high bit set in the first byte that is a zero */ static inline unsigned long has_zero(unsigned long a) -- cgit v0.10.2 From dc716e96f5a467835e8121e1caaf25d66a901cb3 Mon Sep 17 00:00:00 2001 From: Marcos Paulo de Souza Date: Wed, 21 Mar 2012 16:33:43 -0700 Subject: drivers/idle/intel_idle.c: fix confusing code identation Fix a code indentation in the function intel_idle_cpu_init that looks confusing.o Suggested-by: Srivatsa S. Bhat Reviewed-by: Srivatsa S. Bhat Signed-off-by: Marcos Paulo de Souza Cc: "Brown, Len" Cc: Len Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 1c15e9b..d0f59c3 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -507,8 +507,7 @@ int intel_idle_cpu_init(int cpu) int num_substates; if (cstate > max_cstate) { - printk(PREFIX "max_cstate %d reached\n", - max_cstate); + printk(PREFIX "max_cstate %d reached\n", max_cstate); break; } @@ -524,8 +523,9 @@ int intel_idle_cpu_init(int cpu) dev->states_usage[dev->state_count].driver_data = (void *)get_driver_data(cstate); - dev->state_count += 1; - } + dev->state_count += 1; + } + dev->cpu = cpu; if (cpuidle_register_device(dev)) { -- cgit v0.10.2 From 7904ac84244b59f536c2a5d1066a10f46df07b08 Mon Sep 17 00:00:00 2001 From: Earl Chew Date: Wed, 21 Mar 2012 16:33:43 -0700 Subject: seq_file: fix mishandling of consecutive pread() invocations. The following program illustrates the problem: char buf[8192]; int fd = open("/proc/self/maps", O_RDONLY); n = pread(fd, buf, sizeof(buf), 0); printf("%d\n", n); /* lseek(fd, 0, SEEK_CUR); */ /* Uncomment to work around */ n = pread(fd, buf, sizeof(buf), 0); printf("%d\n", n); The second printf() prints zero, but uncommenting the lseek() corrects its behaviour. To fix, make seq_read() mirror seq_lseek() when processing changes in *ppos. Restore m->version first, then if required traverse and update read_pos on success. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=11856 Signed-off-by: Earl Chew Cc: Alexey Dobriyan Cc: Al Viro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/seq_file.c b/fs/seq_file.c index 4023d6b..aa242dc 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -140,9 +140,21 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos) mutex_lock(&m->lock); + /* + * seq_file->op->..m_start/m_stop/m_next may do special actions + * or optimisations based on the file->f_version, so we want to + * pass the file->f_version to those methods. + * + * seq_file->version is just copy of f_version, and seq_file + * methods can treat it simply as file version. + * It is copied in first and copied out after all operations. + * It is convenient to have it as part of structure to avoid the + * need of passing another argument to all the seq_file methods. + */ + m->version = file->f_version; + /* Don't assume *ppos is where we left it */ if (unlikely(*ppos != m->read_pos)) { - m->read_pos = *ppos; while ((err = traverse(m, *ppos)) == -EAGAIN) ; if (err) { @@ -152,21 +164,11 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos) m->index = 0; m->count = 0; goto Done; + } else { + m->read_pos = *ppos; } } - /* - * seq_file->op->..m_start/m_stop/m_next may do special actions - * or optimisations based on the file->f_version, so we want to - * pass the file->f_version to those methods. - * - * seq_file->version is just copy of f_version, and seq_file - * methods can treat it simply as file version. - * It is copied in first and copied out after all operations. - * It is convenient to have it as part of structure to avoid the - * need of passing another argument to all the seq_file methods. - */ - m->version = file->f_version; /* grab buffer if we didn't have one */ if (!m->buf) { m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL); -- cgit v0.10.2 From fa47ac59020e91082386f65a01f3e8cc6116ef95 Mon Sep 17 00:00:00 2001 From: Matt Fleming Date: Wed, 21 Mar 2012 16:33:44 -0700 Subject: xtensa: don't reimplement force_sigsegv() Instead of open coding the sequence from force_sigsegv() just call it. This also fixes a bug because we were modifying ka->sa.sa_handler (which is a copy of sighand->action[]), whereas the intention of the code was to modify sighand->action[] directly. As the original code was working with a copy it had no effect on signal delivery. Acked-by: Oleg Nesterov Cc: Chris Zankel Signed-off-by: Matt Fleming Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/xtensa/kernel/signal.c b/arch/xtensa/kernel/signal.c index f2220b5..4f53770 100644 --- a/arch/xtensa/kernel/signal.c +++ b/arch/xtensa/kernel/signal.c @@ -425,9 +425,7 @@ static void setup_frame(int sig, struct k_sigaction *ka, siginfo_t *info, return; give_sigsegv: - if (sig == SIGSEGV) - ka->sa.sa_handler = SIG_DFL; - force_sig(SIGSEGV, current); + force_sigsegv(sig, current); } /* -- cgit v0.10.2 From ff6d21e7aafe3cf4b20697f67e656caa4daef40b Mon Sep 17 00:00:00 2001 From: Matt Fleming Date: Wed, 21 Mar 2012 16:33:44 -0700 Subject: xtensa: no need to reset handler if SA_ONESHOT get_signal_to_deliver() already resets the signal handler if SA_ONESHOT is set in ka->sa.sa_flags, there's no need to do it again in handle_signal(). Furthermore, because we were modifying ka->sa.sa_handler (which is a copy of sighand->action[]) instead of sighand->action[] the original code actually had no effect on signal delivery. Acked-by: Oleg Nesterov Cc: Chris Zankel Signed-off-by: Matt Fleming Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/xtensa/kernel/signal.c b/arch/xtensa/kernel/signal.c index 4f53770..24655e3 100644 --- a/arch/xtensa/kernel/signal.c +++ b/arch/xtensa/kernel/signal.c @@ -536,9 +536,6 @@ int do_signal(struct pt_regs *regs, sigset_t *oldset) /* Set up the stack frame */ setup_frame(signr, &ka, &info, oldset, regs); - if (ka.sa.sa_flags & SA_ONESHOT) - ka.sa.sa_handler = SIG_DFL; - spin_lock_irq(¤t->sighand->siglock); sigorsets(¤t->blocked, ¤t->blocked, &ka.sa.sa_mask); if (!(ka.sa.sa_flags & SA_NODEFER)) -- cgit v0.10.2 From 3785006ac3c8941feb63097c416de92114a6bc39 Mon Sep 17 00:00:00 2001 From: Matt Fleming Date: Wed, 21 Mar 2012 16:33:45 -0700 Subject: xtensa: don't mask signals if we fail to setup signal stack setup_frame() needs to return an indication of whether it succeeded or failed in setting up the signal stack frame. If setup_frame() fails then we must not modify current->blocked. Acked-by: Oleg Nesterov Cc: Chris Zankel Signed-off-by: Matt Fleming Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/xtensa/kernel/signal.c b/arch/xtensa/kernel/signal.c index 24655e3..17ceab8 100644 --- a/arch/xtensa/kernel/signal.c +++ b/arch/xtensa/kernel/signal.c @@ -336,8 +336,8 @@ gen_return_code(unsigned char *codemem) } -static void setup_frame(int sig, struct k_sigaction *ka, siginfo_t *info, - sigset_t *set, struct pt_regs *regs) +static int setup_frame(int sig, struct k_sigaction *ka, siginfo_t *info, + sigset_t *set, struct pt_regs *regs) { struct rt_sigframe *frame; int err = 0; @@ -422,10 +422,11 @@ static void setup_frame(int sig, struct k_sigaction *ka, siginfo_t *info, current->comm, current->pid, signal, frame, regs->pc); #endif - return; + return 0; give_sigsegv: force_sigsegv(sig, current); + return -EFAULT; } /* @@ -534,7 +535,9 @@ int do_signal(struct pt_regs *regs, sigset_t *oldset) /* Whee! Actually deliver the signal. */ /* Set up the stack frame */ - setup_frame(signr, &ka, &info, oldset, regs); + ret = setup_frame(signr, &ka, &info, oldset, regs); + if (ret) + return ret; spin_lock_irq(¤t->sighand->siglock); sigorsets(¤t->blocked, ¤t->blocked, &ka.sa.sa_mask); -- cgit v0.10.2 From d12f7c4a2f5d5c03ace0543d8cc70966c54d0692 Mon Sep 17 00:00:00 2001 From: Matt Fleming Date: Wed, 21 Mar 2012 16:33:45 -0700 Subject: xtensa: use set_current_blocked() and block_sigmask() As described in commit e6fa16ab9c1e ("signal: sigprocmask() should do retarget_shared_pending()") the modification of current->blocked is incorrect as we need to check whether the signal we're about to block is pending in the shared queue. Also, use the new helper function introduced in commit 5e6292c0f28f ("signal: add block_sigmask() for adding sigmask to current->blocked") which centralises the code for updating current->blocked after successfully delivering a signal and reduces the amount of duplicate code across architectures. In the past some architectures got this code wrong, so using this helper function should stop that from happening again. Acked-by: Oleg Nesterov Cc: Chris Zankel Signed-off-by: Matt Fleming Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/xtensa/kernel/signal.c b/arch/xtensa/kernel/signal.c index 17ceab8..b69b000 100644 --- a/arch/xtensa/kernel/signal.c +++ b/arch/xtensa/kernel/signal.c @@ -260,10 +260,7 @@ asmlinkage long xtensa_rt_sigreturn(long a0, long a1, long a2, long a3, goto badframe; sigdelsetmask(&set, ~_BLOCKABLE); - spin_lock_irq(¤t->sighand->siglock); - current->blocked = set; - recalc_sigpending(); - spin_unlock_irq(¤t->sighand->siglock); + set_current_blocked(&set); if (restore_sigcontext(regs, frame)) goto badframe; @@ -448,11 +445,8 @@ asmlinkage long xtensa_rt_sigsuspend(sigset_t __user *unewset, return -EFAULT; sigdelsetmask(&newset, ~_BLOCKABLE); - spin_lock_irq(¤t->sighand->siglock); saveset = current->blocked; - current->blocked = newset; - recalc_sigpending(); - spin_unlock_irq(¤t->sighand->siglock); + set_current_blocked(&newset); regs->areg[2] = -EINTR; while (1) { @@ -539,12 +533,7 @@ int do_signal(struct pt_regs *regs, sigset_t *oldset) if (ret) return ret; - spin_lock_irq(¤t->sighand->siglock); - sigorsets(¤t->blocked, ¤t->blocked, &ka.sa.sa_mask); - if (!(ka.sa.sa_flags & SA_NODEFER)) - sigaddset(¤t->blocked, signr); - recalc_sigpending(); - spin_unlock_irq(¤t->sighand->siglock); + block_sigmask(&ka, signr); if (current->ptrace & PT_SINGLESTEP) task_pt_regs(current)->icountlevel = 1; -- cgit v0.10.2 From ce24d8a14207c2036df86d2bd3d14b4393eb51e3 Mon Sep 17 00:00:00 2001 From: Matt Fleming Date: Wed, 21 Mar 2012 16:33:46 -0700 Subject: sparc: use block_sigmask() Use the new helper function introduced in commit 5e6292c0f28f ("signal: add block_sigmask() for adding sigmask to current->blocked") which centralises the code for updating current->blocked after successfully delivering a signal and reduces the amount of duplicate code across architectures. In the past some architectures got this code wrong, so using this helper function should stop that from happening again. Acked-by: Oleg Nesterov Acked-by: "David S. Miller" Signed-off-by: Matt Fleming Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c index 023b886..c8f5b50 100644 --- a/arch/sparc/kernel/signal32.c +++ b/arch/sparc/kernel/signal32.c @@ -776,7 +776,6 @@ static inline int handle_signal32(unsigned long signr, struct k_sigaction *ka, siginfo_t *info, sigset_t *oldset, struct pt_regs *regs) { - sigset_t blocked; int err; if (ka->sa.sa_flags & SA_SIGINFO) @@ -787,11 +786,7 @@ static inline int handle_signal32(unsigned long signr, struct k_sigaction *ka, if (err) return err; - sigorsets(&blocked, ¤t->blocked, &ka->sa.sa_mask); - if (!(ka->sa.sa_flags & SA_NOMASK)) - sigaddset(&blocked, signr); - set_current_blocked(&blocked); - + block_sigmask(ka, signr); tracehook_signal_handler(signr, info, ka, regs, 0); return 0; diff --git a/arch/sparc/kernel/signal_32.c b/arch/sparc/kernel/signal_32.c index d54c6e5..7bb71b6 100644 --- a/arch/sparc/kernel/signal_32.c +++ b/arch/sparc/kernel/signal_32.c @@ -465,7 +465,6 @@ static inline int handle_signal(unsigned long signr, struct k_sigaction *ka, siginfo_t *info, sigset_t *oldset, struct pt_regs *regs) { - sigset_t blocked; int err; if (ka->sa.sa_flags & SA_SIGINFO) @@ -476,11 +475,7 @@ handle_signal(unsigned long signr, struct k_sigaction *ka, if (err) return err; - sigorsets(&blocked, ¤t->blocked, &ka->sa.sa_mask); - if (!(ka->sa.sa_flags & SA_NOMASK)) - sigaddset(&blocked, signr); - set_current_blocked(&blocked); - + block_sigmask(ka, signr); tracehook_signal_handler(signr, info, ka, regs, 0); return 0; diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c index f0836cd..d8a67e6 100644 --- a/arch/sparc/kernel/signal_64.c +++ b/arch/sparc/kernel/signal_64.c @@ -479,18 +479,14 @@ static inline int handle_signal(unsigned long signr, struct k_sigaction *ka, siginfo_t *info, sigset_t *oldset, struct pt_regs *regs) { - sigset_t blocked; int err; err = setup_rt_frame(ka, regs, signr, oldset, (ka->sa.sa_flags & SA_SIGINFO) ? info : NULL); if (err) return err; - sigorsets(&blocked, ¤t->blocked, &ka->sa.sa_mask); - if (!(ka->sa.sa_flags & SA_NOMASK)) - sigaddset(&blocked, signr); - set_current_blocked(&blocked); + block_sigmask(ka, signr); tracehook_signal_handler(signr, info, ka, regs, 0); return 0; -- cgit v0.10.2 From 2a1c9b1fc0a0ea2e30cdeb69062647c5c5ae661f Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Wed, 21 Mar 2012 16:33:46 -0700 Subject: mm, oom: avoid looping when chosen thread detaches its mm oom_kill_task() returns non-zero iff the chosen process does not have any threads with an attached ->mm. In such a case, it's better to just return to the page allocator and retry the allocation because memory could have been freed in the interim and the oom condition may no longer exist. It's unnecessary to loop in the oom killer and find another thread to kill. This allows both oom_kill_task() and oom_kill_process() to be converted to void functions. If the oom condition persists, the oom killer will be recalled. Acked-by: KAMEZAWA Hiroyuki Signed-off-by: David Rientjes Cc: KOSAKI Motohiro Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 2958fd8..a26695f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -434,14 +434,14 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, } #define K(x) ((x) << (PAGE_SHIFT-10)) -static int oom_kill_task(struct task_struct *p) +static void oom_kill_task(struct task_struct *p) { struct task_struct *q; struct mm_struct *mm; p = find_lock_task_mm(p); if (!p) - return 1; + return; /* mm cannot be safely dereferenced after task_unlock(p) */ mm = p->mm; @@ -477,15 +477,13 @@ static int oom_kill_task(struct task_struct *p) set_tsk_thread_flag(p, TIF_MEMDIE); force_sig(SIGKILL, p); - - return 0; } #undef K -static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, - unsigned int points, unsigned long totalpages, - struct mem_cgroup *memcg, nodemask_t *nodemask, - const char *message) +static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, + unsigned int points, unsigned long totalpages, + struct mem_cgroup *memcg, nodemask_t *nodemask, + const char *message) { struct task_struct *victim = p; struct task_struct *child; @@ -501,7 +499,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, */ if (p->flags & PF_EXITING) { set_tsk_thread_flag(p, TIF_MEMDIE); - return 0; + return; } task_lock(p); @@ -533,7 +531,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } } while_each_thread(p, t); - return oom_kill_task(victim); + oom_kill_task(victim); } /* @@ -580,15 +578,10 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask) check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0, NULL); limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT; read_lock(&tasklist_lock); -retry: p = select_bad_process(&points, limit, memcg, NULL); - if (!p || PTR_ERR(p) == -1UL) - goto out; - - if (oom_kill_process(p, gfp_mask, 0, points, limit, memcg, NULL, - "Memory cgroup out of memory")) - goto retry; -out: + if (p && PTR_ERR(p) != -1UL) + oom_kill_process(p, gfp_mask, 0, points, limit, memcg, NULL, + "Memory cgroup out of memory"); read_unlock(&tasklist_lock); } #endif @@ -745,33 +738,24 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, if (sysctl_oom_kill_allocating_task && !oom_unkillable_task(current, NULL, nodemask) && current->mm) { - /* - * oom_kill_process() needs tasklist_lock held. If it returns - * non-zero, current could not be killed so we must fallback to - * the tasklist scan. - */ - if (!oom_kill_process(current, gfp_mask, order, 0, totalpages, - NULL, nodemask, - "Out of memory (oom_kill_allocating_task)")) - goto out; + oom_kill_process(current, gfp_mask, order, 0, totalpages, NULL, + nodemask, + "Out of memory (oom_kill_allocating_task)"); + goto out; } -retry: p = select_bad_process(&points, totalpages, NULL, mpol_mask); - if (PTR_ERR(p) == -1UL) - goto out; - /* Found nothing?!?! Either we hang forever, or we panic. */ if (!p) { dump_header(NULL, gfp_mask, order, NULL, mpol_mask); read_unlock(&tasklist_lock); panic("Out of memory and no killable processes...\n"); } - - if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL, - nodemask, "Out of memory")) - goto retry; - killed = 1; + if (PTR_ERR(p) != -1UL) { + oom_kill_process(p, gfp_mask, order, points, totalpages, NULL, + nodemask, "Out of memory"); + killed = 1; + } out: read_unlock(&tasklist_lock); -- cgit v0.10.2 From 647f2bdf4a00dbcaa8964286501d68e7d2e6da93 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Wed, 21 Mar 2012 16:33:46 -0700 Subject: mm, oom: fold oom_kill_task() into oom_kill_process() oom_kill_task() has a single caller, so fold it into its parent function, oom_kill_process(). Slightly reduces the number of lines in the oom killer. Acked-by: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: David Rientjes Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/oom_kill.c b/mm/oom_kill.c index a26695f..d402b2c 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -434,52 +434,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, } #define K(x) ((x) << (PAGE_SHIFT-10)) -static void oom_kill_task(struct task_struct *p) -{ - struct task_struct *q; - struct mm_struct *mm; - - p = find_lock_task_mm(p); - if (!p) - return; - - /* mm cannot be safely dereferenced after task_unlock(p) */ - mm = p->mm; - - pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", - task_pid_nr(p), p->comm, K(p->mm->total_vm), - K(get_mm_counter(p->mm, MM_ANONPAGES)), - K(get_mm_counter(p->mm, MM_FILEPAGES))); - task_unlock(p); - - /* - * Kill all user processes sharing p->mm in other thread groups, if any. - * They don't get access to memory reserves or a higher scheduler - * priority, though, to avoid depletion of all memory or task - * starvation. This prevents mm->mmap_sem livelock when an oom killed - * task cannot exit because it requires the semaphore and its contended - * by another thread trying to allocate memory itself. That thread will - * now get access to memory reserves since it has a pending fatal - * signal. - */ - for_each_process(q) - if (q->mm == mm && !same_thread_group(q, p) && - !(q->flags & PF_KTHREAD)) { - if (q->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) - continue; - - task_lock(q); /* Protect ->comm from prctl() */ - pr_err("Kill process %d (%s) sharing same memory\n", - task_pid_nr(q), q->comm); - task_unlock(q); - force_sig(SIGKILL, q); - } - - set_tsk_thread_flag(p, TIF_MEMDIE); - force_sig(SIGKILL, p); -} -#undef K - static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, unsigned int points, unsigned long totalpages, struct mem_cgroup *memcg, nodemask_t *nodemask, @@ -488,6 +442,7 @@ static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, struct task_struct *victim = p; struct task_struct *child; struct task_struct *t = p; + struct mm_struct *mm; unsigned int victim_points = 0; if (printk_ratelimit()) @@ -531,8 +486,44 @@ static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } } while_each_thread(p, t); - oom_kill_task(victim); + victim = find_lock_task_mm(victim); + if (!victim) + return; + + /* mm cannot safely be dereferenced after task_unlock(victim) */ + mm = victim->mm; + pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", + task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), + K(get_mm_counter(victim->mm, MM_ANONPAGES)), + K(get_mm_counter(victim->mm, MM_FILEPAGES))); + task_unlock(victim); + + /* + * Kill all user processes sharing victim->mm in other thread groups, if + * any. They don't get access to memory reserves, though, to avoid + * depletion of all memory. This prevents mm->mmap_sem livelock when an + * oom killed thread cannot exit because it requires the semaphore and + * its contended by another thread trying to allocate memory itself. + * That thread will now get access to memory reserves since it has a + * pending fatal signal. + */ + for_each_process(p) + if (p->mm == mm && !same_thread_group(p, victim) && + !(p->flags & PF_KTHREAD)) { + if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) + continue; + + task_lock(p); /* Protect ->comm from prctl() */ + pr_err("Kill process %d (%s) sharing same memory\n", + task_pid_nr(p), p->comm); + task_unlock(p); + force_sig(SIGKILL, p); + } + + set_tsk_thread_flag(victim, TIF_MEMDIE); + force_sig(SIGKILL, victim); } +#undef K /* * Determines whether the kernel must panic because of the panic_on_oom sysctl. -- cgit v0.10.2 From 8447d950e7445cae71ad66d0e33784f8388aaf9d Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Wed, 21 Mar 2012 16:33:47 -0700 Subject: mm, oom: do not emit oom killer warning if chosen thread is already exiting If a thread is chosen for oom kill and is already PF_EXITING, then the oom killer simply sets TIF_MEMDIE and returns. This allows the thread to have access to memory reserves so that it may quickly exit. This logic is preceeded with a comment saying there's no need to alarm the sysadmin. This patch adds truth to that statement. There's no need to emit any warning about the oom condition if the thread is already exiting since it will not be killed. In this condition, just silently return the oom killer since its only giving access to memory reserves and is otherwise a no-op. Acked-by: KOSAKI Motohiro Acked-by: KAMEZAWA Hiroyuki Signed-off-by: David Rientjes Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d402b2c..8561060 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -445,9 +445,6 @@ static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, struct mm_struct *mm; unsigned int victim_points = 0; - if (printk_ratelimit()) - dump_header(p, gfp_mask, order, memcg, nodemask); - /* * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly @@ -457,6 +454,9 @@ static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, return; } + if (printk_ratelimit()) + dump_header(p, gfp_mask, order, memcg, nodemask); + task_lock(p); pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", message, task_pid_nr(p), p->comm, points); -- cgit v0.10.2 From dc3f21eadeea6d9898271ff32d35d5e00c6872ea Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Wed, 21 Mar 2012 16:33:47 -0700 Subject: mm, oom: introduce independent oom killer ratelimit state printk_ratelimit() uses the global ratelimit state for all printks. The oom killer should not be subjected to this state just because another subsystem or driver may be flooding the kernel log. This patch introduces printk ratelimiting specifically for the oom killer. Signed-off-by: David Rientjes Acked-by: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 8561060..517299c 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -34,6 +34,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -444,6 +445,8 @@ static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, struct task_struct *t = p; struct mm_struct *mm; unsigned int victim_points = 0; + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); /* * If the task is already exiting, don't alarm the sysadmin or kill @@ -454,7 +457,7 @@ static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, return; } - if (printk_ratelimit()) + if (__ratelimit(&oom_rs)) dump_header(p, gfp_mask, order, memcg, nodemask); task_lock(p); -- cgit v0.10.2 From c3f0327f8e9d7a503f0d64573c311eddd61f197d Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Wed, 21 Mar 2012 16:33:48 -0700 Subject: mm: add rss counters consistency check Warn about non-zero rss counters at final mmdrop. This check will prevent reoccurences of bugs such as that fixed in "mm: fix rss count leakage during migration". I didn't hide this check under CONFIG_VM_DEBUG because it rather small and rss counters cover whole page-table management, so this is a good invariant. Signed-off-by: Konstantin Khlebnikov Cc: Hugh Dickins Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/kernel/fork.c b/kernel/fork.c index c4f38a8..a9e99f3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -511,6 +511,23 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) return NULL; } +static void check_mm(struct mm_struct *mm) +{ + int i; + + for (i = 0; i < NR_MM_COUNTERS; i++) { + long x = atomic_long_read(&mm->rss_stat.count[i]); + + if (unlikely(x)) + printk(KERN_ALERT "BUG: Bad rss-counter state " + "mm:%p idx:%d val:%ld\n", mm, i, x); + } + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + VM_BUG_ON(mm->pmd_huge_pte); +#endif +} + /* * Allocate and initialize an mm_struct. */ @@ -538,9 +555,7 @@ void __mmdrop(struct mm_struct *mm) mm_free_pgd(mm); destroy_context(mm); mmu_notifier_mm_destroy(mm); -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - VM_BUG_ON(mm->pmd_huge_pte); -#endif + check_mm(mm); free_mm(mm); } EXPORT_SYMBOL_GPL(__mmdrop); -- cgit v0.10.2 From 6131728914810a6c02e08750e13e45870101e862 Mon Sep 17 00:00:00 2001 From: Hillf Danton Date: Wed, 21 Mar 2012 16:33:48 -0700 Subject: mm/vmscan.c: cleanup with s/reclaim_mode/isolate_mode/ With tons of reclaim_mode (defined as one field of struct scan_control) already in the file, it is clearer to rename the local reclaim_mode when setting up the isolation mode. Signed-off-by: Hillf Danton Acked-by: KAMEZAWA Hiroyuki Acked-by: KOSAKI Motohiro Reviewed-by: Rik van Riel Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/vmscan.c b/mm/vmscan.c index c52b235..61a6688 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1509,7 +1509,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz, unsigned long nr_file; unsigned long nr_dirty = 0; unsigned long nr_writeback = 0; - isolate_mode_t reclaim_mode = ISOLATE_INACTIVE; + isolate_mode_t isolate_mode = ISOLATE_INACTIVE; struct zone *zone = mz->zone; while (unlikely(too_many_isolated(zone, file, sc))) { @@ -1522,20 +1522,20 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz, set_reclaim_mode(priority, sc, false); if (sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM) - reclaim_mode |= ISOLATE_ACTIVE; + isolate_mode |= ISOLATE_ACTIVE; lru_add_drain(); if (!sc->may_unmap) - reclaim_mode |= ISOLATE_UNMAPPED; + isolate_mode |= ISOLATE_UNMAPPED; if (!sc->may_writepage) - reclaim_mode |= ISOLATE_CLEAN; + isolate_mode |= ISOLATE_CLEAN; spin_lock_irq(&zone->lru_lock); nr_taken = isolate_lru_pages(nr_to_scan, mz, &page_list, &nr_scanned, sc->order, - reclaim_mode, 0, file); + isolate_mode, 0, file); if (global_reclaim(sc)) { zone->pages_scanned += nr_scanned; if (current_is_kswapd()) @@ -1699,21 +1699,21 @@ static void shrink_active_list(unsigned long nr_to_scan, struct page *page; struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz); unsigned long nr_rotated = 0; - isolate_mode_t reclaim_mode = ISOLATE_ACTIVE; + isolate_mode_t isolate_mode = ISOLATE_ACTIVE; struct zone *zone = mz->zone; lru_add_drain(); if (!sc->may_unmap) - reclaim_mode |= ISOLATE_UNMAPPED; + isolate_mode |= ISOLATE_UNMAPPED; if (!sc->may_writepage) - reclaim_mode |= ISOLATE_CLEAN; + isolate_mode |= ISOLATE_CLEAN; spin_lock_irq(&zone->lru_lock); nr_taken = isolate_lru_pages(nr_to_scan, mz, &l_hold, &nr_scanned, sc->order, - reclaim_mode, 1, file); + isolate_mode, 1, file); if (global_reclaim(sc)) zone->pages_scanned += nr_scanned; -- cgit v0.10.2 From 69c978232aaa99476f9bd002c2a29a84fa3779b5 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Wed, 21 Mar 2012 16:33:49 -0700 Subject: mm: make get_mm_counter static-inline Make get_mm_counter() always static inline, it is simple enough for that. And remove unused set_mm_counter() bloat-o-meter: add/remove: 0/1 grow/shrink: 4/12 up/down: 99/-341 (-242) function old new delta try_to_unmap_one 886 952 +66 sys_remap_file_pages 1214 1230 +16 dup_mm 1684 1700 +16 do_exit 2277 2278 +1 zap_page_range 208 205 -3 unmap_region 304 296 -8 static.oom_kill_process 554 546 -8 try_to_unmap_file 1716 1700 -16 getrusage 925 909 -16 flush_old_exec 1704 1688 -16 static.dump_header 416 390 -26 acct_update_integrals 218 187 -31 do_task_stat 2986 2954 -32 get_mm_counter 34 - -34 xacct_add_tsk 371 334 -37 task_statm 172 118 -54 task_mem 383 323 -60 try_to_unmap_one() grows because update_hiwater_rss() now completely inline. Signed-off-by: Konstantin Khlebnikov Cc: KAMEZAWA Hiroyuki Acked-by: Kirill A. Shutemov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/mm.h b/include/linux/mm.h index 17b27cd..378bcce 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1058,19 +1058,20 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, /* * per-process(per-mm_struct) statistics. */ -static inline void set_mm_counter(struct mm_struct *mm, int member, long value) -{ - atomic_long_set(&mm->rss_stat.count[member], value); -} - -#if defined(SPLIT_RSS_COUNTING) -unsigned long get_mm_counter(struct mm_struct *mm, int member); -#else static inline unsigned long get_mm_counter(struct mm_struct *mm, int member) { - return atomic_long_read(&mm->rss_stat.count[member]); -} + long val = atomic_long_read(&mm->rss_stat.count[member]); + +#ifdef SPLIT_RSS_COUNTING + /* + * counter is updated in asynchronous manner and may go to minus. + * But it's never be expected number for users. + */ + if (val < 0) + val = 0; #endif + return (unsigned long)val; +} static inline void add_mm_counter(struct mm_struct *mm, int member, long value) { diff --git a/mm/memory.c b/mm/memory.c index e01abb9..a5de734 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -160,24 +160,6 @@ static void check_sync_rss_stat(struct task_struct *task) __sync_task_rss_stat(task, task->mm); } -unsigned long get_mm_counter(struct mm_struct *mm, int member) -{ - long val = 0; - - /* - * Don't use task->mm here...for avoiding to use task_get_mm().. - * The caller must guarantee task->mm is not invalid. - */ - val = atomic_long_read(&mm->rss_stat.count[member]); - /* - * counter is updated in asynchronous manner and may go to minus. - * But it's never be expected number for users. - */ - if (val < 0) - return 0; - return (unsigned long)val; -} - void sync_mm_rss(struct task_struct *task, struct mm_struct *mm) { __sync_task_rss_stat(task, mm); -- cgit v0.10.2 From c38446cc65e1f2b3eb8630c53943b94c4f65f670 Mon Sep 17 00:00:00 2001 From: Hillf Danton Date: Wed, 21 Mar 2012 16:33:50 -0700 Subject: mm: vmscan: fix misused nr_reclaimed in shrink_mem_cgroup_zone() The value of nr_reclaimed is the number of pages reclaimed in the current round of the loop, whereas nr_to_reclaim should be compared with the number of pages reclaimed in all rounds. In each round of the loop, reclaimed pages are cut off from the reclaim goal, and the loop stops once the goal achieved. Signed-off-by: Hillf Danton Cc: KOSAKI Motohiro Cc: Mel Gorman Cc: Rik van Riel Cc: Hugh Dickins Cc: Michal Hocko Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/vmscan.c b/mm/vmscan.c index 61a6688..8dfa598 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2112,7 +2112,12 @@ restart: * with multiple processes reclaiming pages, the total * freeing target can get unreasonably large. */ - if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY) + if (nr_reclaimed >= nr_to_reclaim) + nr_to_reclaim = 0; + else + nr_to_reclaim -= nr_reclaimed; + + if (!nr_to_reclaim && priority < DEF_PRIORITY) break; } blk_finish_plug(&plug); -- cgit v0.10.2 From 67f96aa252e606cdf6c3cf1032952ec207ec0cf0 Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Wed, 21 Mar 2012 16:33:50 -0700 Subject: mm: make swapin readahead skip over holes Ever since abandoning the virtual scan of processes, for scalability reasons, swap space has been a little more fragmented than before. This can lead to the situation where a large memory user is killed, swap space ends up full of "holes" and swapin readahead is totally ineffective. On my home system, after killing a leaky firefox it took over an hour to page just under 2GB of memory back in, slowing the virtual machines down to a crawl. This patch makes swapin readahead simply skip over holes, instead of stopping at them. This allows the system to swap things back in at rates of several MB/second, instead of a few hundred kB/second. The checks done in valid_swaphandles are already done in read_swap_cache_async as well, allowing us to remove a fair amount of code. [akpm@linux-foundation.org: fix it for page_cluster >= 32] Signed-off-by: Rik van Riel Cc: Minchan Kim Cc: KOSAKI Motohiro Acked-by: Johannes Weiner Acked-by: Mel Gorman Cc: Adrian Drzewiecki Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/swap.h b/include/linux/swap.h index 3e60228..64a7dba 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -329,7 +329,6 @@ extern long total_swap_pages; extern void si_swapinfo(struct sysinfo *); extern swp_entry_t get_swap_page(void); extern swp_entry_t get_swap_page_of_type(int); -extern int valid_swaphandles(swp_entry_t, unsigned long *); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); extern int swap_duplicate(swp_entry_t); diff --git a/mm/swap_state.c b/mm/swap_state.c index ea6b32d..9d3dd37 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -372,25 +372,23 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr) { - int nr_pages; struct page *page; - unsigned long offset; - unsigned long end_offset; + unsigned long offset = swp_offset(entry); + unsigned long start_offset, end_offset; + unsigned long mask = (1UL << page_cluster) - 1; - /* - * Get starting offset for readaround, and number of pages to read. - * Adjust starting address by readbehind (for NUMA interleave case)? - * No, it's very unlikely that swap layout would follow vma layout, - * more likely that neighbouring swap pages came from the same node: - * so use the same "addr" to choose the same node for each swap read. - */ - nr_pages = valid_swaphandles(entry, &offset); - for (end_offset = offset + nr_pages; offset < end_offset; offset++) { + /* Read a page_cluster sized and aligned cluster around offset. */ + start_offset = offset & ~mask; + end_offset = offset | mask; + if (!start_offset) /* First page is swap header. */ + start_offset++; + + for (offset = start_offset; offset <= end_offset ; offset++) { /* Ok, do the async read-ahead now */ page = read_swap_cache_async(swp_entry(swp_type(entry), offset), gfp_mask, vma, addr); if (!page) - break; + continue; page_cache_release(page); } lru_add_drain(); /* Push any new pages onto the LRU now */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 44595a3..b82c028 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2288,58 +2288,6 @@ int swapcache_prepare(swp_entry_t entry) } /* - * swap_lock prevents swap_map being freed. Don't grab an extra - * reference on the swaphandle, it doesn't matter if it becomes unused. - */ -int valid_swaphandles(swp_entry_t entry, unsigned long *offset) -{ - struct swap_info_struct *si; - int our_page_cluster = page_cluster; - pgoff_t target, toff; - pgoff_t base, end; - int nr_pages = 0; - - if (!our_page_cluster) /* no readahead */ - return 0; - - si = swap_info[swp_type(entry)]; - target = swp_offset(entry); - base = (target >> our_page_cluster) << our_page_cluster; - end = base + (1 << our_page_cluster); - if (!base) /* first page is swap header */ - base++; - - spin_lock(&swap_lock); - if (end > si->max) /* don't go beyond end of map */ - end = si->max; - - /* Count contiguous allocated slots above our target */ - for (toff = target; ++toff < end; nr_pages++) { - /* Don't read in free or bad pages */ - if (!si->swap_map[toff]) - break; - if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD) - break; - } - /* Count contiguous allocated slots below our target */ - for (toff = target; --toff >= base; nr_pages++) { - /* Don't read in free or bad pages */ - if (!si->swap_map[toff]) - break; - if (swap_count(si->swap_map[toff]) == SWAP_MAP_BAD) - break; - } - spin_unlock(&swap_lock); - - /* - * Indicate starting offset, and return number of pages to get: - * if only 1, say 0, since there's then no readahead to be done. - */ - *offset = ++toff; - return nr_pages? ++nr_pages: 0; -} - -/* * add_swap_count_continuation - called when a swap count is duplicated * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's * page of the original vmalloc'ed swap_map, to hold the continuation count -- cgit v0.10.2 From fe2c2a106663130a5ab45cb0e3414b52df2fff0c Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Wed, 21 Mar 2012 16:33:51 -0700 Subject: vmscan: reclaim at order 0 when compaction is enabled When built with CONFIG_COMPACTION, kswapd should not try to free contiguous pages, because it is not trying hard enough to have a real chance at being successful, but still disrupts the LRU enough to break other things. Do not do higher order page isolation unless we really are in lumpy reclaim mode. Stop reclaiming pages once we have enough free pages that compaction can deal with things, and we hit the normal order 0 watermarks used by kswapd. Also remove a line of code that increments balanced right before exiting the function. Signed-off-by: Rik van Riel Cc: Andrea Arcangeli Acked-by: Mel Gorman Cc: Johannes Weiner Cc: Minchan Kim Cc: KOSAKI Motohiro Cc: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/vmscan.c b/mm/vmscan.c index 8dfa598..d7dad2a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1138,7 +1138,7 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file) * @mz: The mem_cgroup_zone to pull pages from. * @dst: The temp list to put pages on to. * @nr_scanned: The number of pages that were scanned. - * @order: The caller's attempted allocation order + * @sc: The scan_control struct for this reclaim session * @mode: One of the LRU isolation modes * @active: True [1] if isolating active pages * @file: True [1] if isolating file [!anon] pages @@ -1147,8 +1147,8 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file) */ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, struct mem_cgroup_zone *mz, struct list_head *dst, - unsigned long *nr_scanned, int order, isolate_mode_t mode, - int active, int file) + unsigned long *nr_scanned, struct scan_control *sc, + isolate_mode_t mode, int active, int file) { struct lruvec *lruvec; struct list_head *src; @@ -1194,7 +1194,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, BUG(); } - if (!order) + if (!sc->order || !(sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM)) continue; /* @@ -1208,8 +1208,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, */ zone_id = page_zone_id(page); page_pfn = page_to_pfn(page); - pfn = page_pfn & ~((1 << order) - 1); - end_pfn = pfn + (1 << order); + pfn = page_pfn & ~((1 << sc->order) - 1); + end_pfn = pfn + (1 << sc->order); for (; pfn < end_pfn; pfn++) { struct page *cursor_page; @@ -1275,7 +1275,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, *nr_scanned = scan; - trace_mm_vmscan_lru_isolate(order, + trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, @@ -1533,9 +1533,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz, spin_lock_irq(&zone->lru_lock); - nr_taken = isolate_lru_pages(nr_to_scan, mz, &page_list, - &nr_scanned, sc->order, - isolate_mode, 0, file); + nr_taken = isolate_lru_pages(nr_to_scan, mz, &page_list, &nr_scanned, + sc, isolate_mode, 0, file); if (global_reclaim(sc)) { zone->pages_scanned += nr_scanned; if (current_is_kswapd()) @@ -1711,8 +1710,7 @@ static void shrink_active_list(unsigned long nr_to_scan, spin_lock_irq(&zone->lru_lock); - nr_taken = isolate_lru_pages(nr_to_scan, mz, &l_hold, - &nr_scanned, sc->order, + nr_taken = isolate_lru_pages(nr_to_scan, mz, &l_hold, &nr_scanned, sc, isolate_mode, 1, file); if (global_reclaim(sc)) zone->pages_scanned += nr_scanned; @@ -2758,7 +2756,7 @@ loop_again: */ for (i = 0; i <= end_zone; i++) { struct zone *zone = pgdat->node_zones + i; - int nr_slab; + int nr_slab, testorder; unsigned long balance_gap; if (!populated_zone(zone)) @@ -2791,7 +2789,20 @@ loop_again: (zone->present_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / KSWAPD_ZONE_BALANCE_GAP_RATIO); - if (!zone_watermark_ok_safe(zone, order, + /* + * Kswapd reclaims only single pages with compaction + * enabled. Trying too hard to reclaim until contiguous + * free pages have become available can hurt performance + * by evicting too much useful data from memory. + * Do not reclaim more than needed for compaction. + */ + testorder = order; + if (COMPACTION_BUILD && order && + compaction_suitable(zone, order) != + COMPACT_SKIPPED) + testorder = 0; + + if (!zone_watermark_ok_safe(zone, testorder, high_wmark_pages(zone) + balance_gap, end_zone, 0)) { shrink_zone(priority, zone, &sc); @@ -2820,7 +2831,7 @@ loop_again: continue; } - if (!zone_watermark_ok_safe(zone, order, + if (!zone_watermark_ok_safe(zone, testorder, high_wmark_pages(zone), end_zone, 0)) { all_zones_ok = 0; /* @@ -2917,6 +2928,10 @@ out: if (zone->all_unreclaimable && priority != DEF_PRIORITY) continue; + /* Would compaction fail due to lack of free memory? */ + if (compaction_suitable(zone, order) == COMPACT_SKIPPED) + goto loop_again; + /* Confirm the zone is balanced for order-0 */ if (!zone_watermark_ok(zone, 0, high_wmark_pages(zone), 0, 0)) { @@ -2926,8 +2941,6 @@ out: /* If balanced, clear the congested flag */ zone_clear_flag(zone, ZONE_CONGESTED); - if (i <= *classzone_idx) - balanced += zone->present_pages; } } -- cgit v0.10.2 From 7be62de99adcab4449d416977b4274985c5fe023 Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Wed, 21 Mar 2012 16:33:52 -0700 Subject: vmscan: kswapd carefully call compaction With CONFIG_COMPACTION enabled, kswapd does not try to free contiguous free pages, even when it is woken for a higher order request. This could be bad for eg. jumbo frame network allocations, which are done from interrupt context and cannot compact memory themselves. Higher than before allocation failure rates in the network receive path have been observed in kernels with compaction enabled. Teach kswapd to defragment the memory zones in a node, but only if required and compaction is not deferred in a zone. [akpm@linux-foundation.org: reduce scope of zones_need_compaction] Signed-off-by: Rik van Riel Acked-by: Mel Gorman Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Minchan Kim Cc: KOSAKI Motohiro Cc: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/compaction.h b/include/linux/compaction.h index bb2bbdb..7a9323a 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -23,6 +23,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order); extern unsigned long try_to_compact_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask, bool sync); +extern int compact_pgdat(pg_data_t *pgdat, int order); extern unsigned long compaction_suitable(struct zone *zone, int order); /* Do not skip compaction more than 64 times */ @@ -62,6 +63,11 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist, return COMPACT_CONTINUE; } +static inline int compact_pgdat(pg_data_t *pgdat, int order) +{ + return COMPACT_CONTINUE; +} + static inline unsigned long compaction_suitable(struct zone *zone, int order) { return COMPACT_SKIPPED; diff --git a/mm/compaction.c b/mm/compaction.c index d9ebebe..36f0f61 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -675,44 +675,61 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist, /* Compact all zones within a node */ -static int compact_node(int nid) +static int __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc) { int zoneid; - pg_data_t *pgdat; struct zone *zone; - if (nid < 0 || nid >= nr_node_ids || !node_online(nid)) - return -EINVAL; - pgdat = NODE_DATA(nid); - /* Flush pending updates to the LRU lists */ lru_add_drain_all(); for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { - struct compact_control cc = { - .nr_freepages = 0, - .nr_migratepages = 0, - .order = -1, - .sync = true, - }; zone = &pgdat->node_zones[zoneid]; if (!populated_zone(zone)) continue; - cc.zone = zone; - INIT_LIST_HEAD(&cc.freepages); - INIT_LIST_HEAD(&cc.migratepages); + cc->nr_freepages = 0; + cc->nr_migratepages = 0; + cc->zone = zone; + INIT_LIST_HEAD(&cc->freepages); + INIT_LIST_HEAD(&cc->migratepages); - compact_zone(zone, &cc); + if (cc->order < 0 || !compaction_deferred(zone)) + compact_zone(zone, cc); - VM_BUG_ON(!list_empty(&cc.freepages)); - VM_BUG_ON(!list_empty(&cc.migratepages)); + VM_BUG_ON(!list_empty(&cc->freepages)); + VM_BUG_ON(!list_empty(&cc->migratepages)); } return 0; } +int compact_pgdat(pg_data_t *pgdat, int order) +{ + struct compact_control cc = { + .order = order, + .sync = false, + }; + + return __compact_pgdat(pgdat, &cc); +} + +static int compact_node(int nid) +{ + pg_data_t *pgdat; + struct compact_control cc = { + .order = -1, + .sync = true, + }; + + if (nid < 0 || nid >= nr_node_ids || !node_online(nid)) + return -EINVAL; + pgdat = NODE_DATA(nid); + + return __compact_pgdat(pgdat, &cc); +} + /* Compact all nodes in the system */ static int compact_nodes(void) { diff --git a/mm/vmscan.c b/mm/vmscan.c index d7dad2a..b2b4c4a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2919,6 +2919,8 @@ out: * and it is potentially going to sleep here. */ if (order) { + int zones_need_compaction = 1; + for (i = 0; i <= end_zone; i++) { struct zone *zone = pgdat->node_zones + i; @@ -2939,9 +2941,17 @@ out: goto loop_again; } + /* Check if the memory needs to be defragmented. */ + if (zone_watermark_ok(zone, order, + low_wmark_pages(zone), *classzone_idx, 0)) + zones_need_compaction = 0; + /* If balanced, clear the congested flag */ zone_clear_flag(zone, ZONE_CONGESTED); } + + if (zones_need_compaction) + compact_pgdat(pgdat, order); } /* -- cgit v0.10.2 From aff622495c9a0b56148192e53bdec539f5e147f2 Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Wed, 21 Mar 2012 16:33:52 -0700 Subject: vmscan: only defer compaction for failed order and higher Currently a failed order-9 (transparent hugepage) compaction can lead to memory compaction being temporarily disabled for a memory zone. Even if we only need compaction for an order 2 allocation, eg. for jumbo frames networking. The fix is relatively straightforward: keep track of the highest order at which compaction is succeeding, and only defer compaction for orders at which compaction is failing. Signed-off-by: Rik van Riel Cc: Andrea Arcangeli Acked-by: Mel Gorman Cc: Johannes Weiner Cc: Minchan Kim Cc: KOSAKI Motohiro Cc: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 7a9323a..51a90b7 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -34,20 +34,26 @@ extern unsigned long compaction_suitable(struct zone *zone, int order); * allocation success. 1 << compact_defer_limit compactions are skipped up * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT */ -static inline void defer_compaction(struct zone *zone) +static inline void defer_compaction(struct zone *zone, int order) { zone->compact_considered = 0; zone->compact_defer_shift++; + if (order < zone->compact_order_failed) + zone->compact_order_failed = order; + if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT) zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT; } /* Returns true if compaction should be skipped this time */ -static inline bool compaction_deferred(struct zone *zone) +static inline bool compaction_deferred(struct zone *zone, int order) { unsigned long defer_limit = 1UL << zone->compact_defer_shift; + if (order < zone->compact_order_failed) + return false; + /* Avoid possible overflow */ if (++zone->compact_considered > defer_limit) zone->compact_considered = defer_limit; @@ -73,11 +79,11 @@ static inline unsigned long compaction_suitable(struct zone *zone, int order) return COMPACT_SKIPPED; } -static inline void defer_compaction(struct zone *zone) +static inline void defer_compaction(struct zone *zone, int order) { } -static inline bool compaction_deferred(struct zone *zone) +static inline bool compaction_deferred(struct zone *zone, int order) { return 1; } diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 650ba2f..dff7115 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -365,6 +365,7 @@ struct zone { */ unsigned int compact_considered; unsigned int compact_defer_shift; + int compact_order_failed; #endif ZONE_PADDING(_pad1_) diff --git a/mm/compaction.c b/mm/compaction.c index 36f0f61..c4b344a 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -695,9 +695,19 @@ static int __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc) INIT_LIST_HEAD(&cc->freepages); INIT_LIST_HEAD(&cc->migratepages); - if (cc->order < 0 || !compaction_deferred(zone)) + if (cc->order < 0 || !compaction_deferred(zone, cc->order)) compact_zone(zone, cc); + if (cc->order > 0) { + int ok = zone_watermark_ok(zone, cc->order, + low_wmark_pages(zone), 0, 0); + if (ok && cc->order > zone->compact_order_failed) + zone->compact_order_failed = cc->order + 1; + /* Currently async compaction is never deferred. */ + else if (!ok && cc->sync) + defer_compaction(zone, cc->order); + } + VM_BUG_ON(!list_empty(&cc->freepages)); VM_BUG_ON(!list_empty(&cc->migratepages)); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a13ded1..572b93e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1990,7 +1990,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, if (!order) return NULL; - if (compaction_deferred(preferred_zone)) { + if (compaction_deferred(preferred_zone, order)) { *deferred_compaction = true; return NULL; } @@ -2012,6 +2012,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, if (page) { preferred_zone->compact_considered = 0; preferred_zone->compact_defer_shift = 0; + if (order >= preferred_zone->compact_order_failed) + preferred_zone->compact_order_failed = order + 1; count_vm_event(COMPACTSUCCESS); return page; } @@ -2028,7 +2030,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, * defer if the failure was a sync compaction failure. */ if (sync_migration) - defer_compaction(preferred_zone); + defer_compaction(preferred_zone, order); cond_resched(); } diff --git a/mm/vmscan.c b/mm/vmscan.c index b2b4c4a..87e4d6a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2198,7 +2198,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) * If compaction is deferred, reclaim up to a point where * compaction will have a chance of success when re-enabled */ - if (compaction_deferred(zone)) + if (compaction_deferred(zone, sc->order)) return watermark_ok; /* If compaction is not ready to start, keep reclaiming */ -- cgit v0.10.2 From 8575ec29f61da83a2bf382c8c490499dc022101e Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 21 Mar 2012 16:33:53 -0700 Subject: compact_pgdat: workaround lockdep warning in kswapd I get this lockdep warning from swapping load on linux-next, due to "vmscan: kswapd carefully call compaction". ================================= [ INFO: inconsistent lock state ] 3.3.0-rc2-next-20120201 #5 Not tainted --------------------------------- inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. kswapd0/28 [HC0[0]:SC0[0]:HE1:SE1] takes: (pcpu_alloc_mutex){+.+.?.}, at: [] pcpu_alloc+0x67/0x325 {RECLAIM_FS-ON-W} state was registered at: [] mark_held_locks+0xd7/0x103 [] lockdep_trace_alloc+0x85/0x9e [] __kmalloc+0x6c/0x14b [] pcpu_mem_zalloc+0x59/0x62 [] pcpu_extend_area_map+0x26/0xb1 [] pcpu_alloc+0x182/0x325 [] __alloc_percpu+0xb/0xd [] snmp_mib_init+0x1e/0x2e [] ipv4_mib_init_net+0x7a/0x184 [] ops_init.clone.0+0x6b/0x73 [] register_pernet_operations+0x61/0xa0 [] register_pernet_subsys+0x29/0x42 [] inet_init+0x1ad/0x252 [] do_one_initcall+0x7a/0x12f [] kernel_init+0x9d/0x11e [] kernel_thread_helper+0x4/0x10 irq event stamp: 656613 hardirqs last enabled at (656613): [] __mutex_unlock_slowpath+0x104/0x128 hardirqs last disabled at (656612): [] __mutex_unlock_slowpath+0x5c/0x128 softirqs last enabled at (655568): [] __do_softirq+0x120/0x136 softirqs last disabled at (654757): [] call_softirq+0x1c/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(pcpu_alloc_mutex); lock(pcpu_alloc_mutex); *** DEADLOCK *** no locks held by kswapd0/28. stack backtrace: Pid: 28, comm: kswapd0 Not tainted 3.3.0-rc2-next-20120201 #5 Call Trace: [] print_usage_bug+0x1bf/0x1d0 [] ? print_irq_inversion_bug+0x1d9/0x1d9 [] mark_lock_irq+0xbb/0x22e [] ? free_hot_cold_page+0x13d/0x14f [] mark_lock+0x251/0x331 [] mark_irqflags+0x12f/0x141 [] __lock_acquire+0x58d/0x753 [] ? pcpu_alloc+0x67/0x325 [] lock_acquire+0x54/0x6a [] ? pcpu_alloc+0x67/0x325 [] ? add_preempt_count+0xa9/0xae [] mutex_lock_nested+0x5e/0x315 [] ? pcpu_alloc+0x67/0x325 [] ? __lock_acquire+0x6dc/0x753 [] ? __pagevec_release+0x2c/0x2c [] pcpu_alloc+0x67/0x325 [] ? __pagevec_release+0x2c/0x2c [] __alloc_percpu+0xb/0xd [] schedule_on_each_cpu+0x23/0x110 [] lru_add_drain_all+0x10/0x12 [] __compact_pgdat+0x20/0x182 [] compact_pgdat+0x27/0x29 [] ? zone_watermark_ok+0x1a/0x1c [] balance_pgdat+0x732/0x751 [] kswapd+0x15f/0x178 [] ? balance_pgdat+0x751/0x751 [] kthread+0x84/0x8c [] kernel_thread_helper+0x4/0x10 [] ? finish_task_switch+0x85/0xea [] ? retint_restore_args+0xe/0xe [] ? __init_kthread_worker+0x56/0x56 [] ? gs_change+0xb/0xb The RECLAIM_FS notations indicate that it's doing the GFP_FS checking that Nick hacked into lockdep a while back: I think we're intended to read that "" in the DEADLOCK scenario as "". I'm hazy, I have not reached any conclusion as to whether it's right to complain or not; but I believe it's uneasy about kswapd now doing the mutex_lock(&pcpu_alloc_mutex) which lru_add_drain_all() entails. Nor have I reached any conclusion as to whether it's important for kswapd to do that draining or not. But so as not to get blocked on this, with lockdep disabled from giving further reports, here's a patch which removes the lru_add_drain_all() from kswapd's callpath (and calls it only once from compact_nodes(), instead of once per node). Signed-off-by: Hugh Dickins Cc: Rik van Riel Acked-by: Mel Gorman Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/compaction.c b/mm/compaction.c index c4b344a..a08bf21 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -680,9 +680,6 @@ static int __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc) int zoneid; struct zone *zone; - /* Flush pending updates to the LRU lists */ - lru_add_drain_all(); - for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { zone = &pgdat->node_zones[zoneid]; @@ -727,17 +724,12 @@ int compact_pgdat(pg_data_t *pgdat, int order) static int compact_node(int nid) { - pg_data_t *pgdat; struct compact_control cc = { .order = -1, .sync = true, }; - if (nid < 0 || nid >= nr_node_ids || !node_online(nid)) - return -EINVAL; - pgdat = NODE_DATA(nid); - - return __compact_pgdat(pgdat, &cc); + return __compact_pgdat(NODE_DATA(nid), &cc); } /* Compact all nodes in the system */ @@ -745,6 +737,9 @@ static int compact_nodes(void) { int nid; + /* Flush pending updates to the LRU lists */ + lru_add_drain_all(); + for_each_online_node(nid) compact_node(nid); @@ -777,7 +772,14 @@ ssize_t sysfs_compact_node(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { - compact_node(dev->id); + int nid = dev->id; + + if (nid >= 0 && nid < nr_node_ids && node_online(nid)) { + /* Flush pending updates to the LRU lists */ + lru_add_drain_all(); + + compact_node(nid); + } return count; } -- cgit v0.10.2 From aad6ec3777bf4930d4f7293745cc4c17a2d87947 Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Wed, 21 Mar 2012 16:33:54 -0700 Subject: mm: compaction: make compact_control order signed "order" is -1 when compacting via /proc/sys/vm/compact_memory. Making it unsigned causes a bug in __compact_pgdat() when we test: if (cc->order < 0 || !compaction_deferred(zone, cc->order)) compact_zone(zone, cc); [akpm@linux-foundation.org: make __compact_pgdat()'s comparison match other code sites] Signed-off-by: Dan Carpenter Cc: Mel Gorman Cc: Minchan Kim Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/compaction.c b/mm/compaction.c index a08bf21..74a8c82 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -35,7 +35,7 @@ struct compact_control { unsigned long migrate_pfn; /* isolate_migratepages search base */ bool sync; /* Synchronous migration */ - unsigned int order; /* order a direct compactor needs */ + int order; /* order a direct compactor needs */ int migratetype; /* MOVABLE, RECLAIMABLE etc */ struct zone *zone; }; @@ -692,7 +692,7 @@ static int __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc) INIT_LIST_HEAD(&cc->freepages); INIT_LIST_HEAD(&cc->migratepages); - if (cc->order < 0 || !compaction_deferred(zone, cc->order)) + if (cc->order == -1 || !compaction_deferred(zone, cc->order)) compact_zone(zone, cc); if (cc->order > 0) { -- cgit v0.10.2 From 4bfc130d5afa28395288d1b57092906349604b41 Mon Sep 17 00:00:00 2001 From: Xiao Guangrong Date: Wed, 21 Mar 2012 16:33:54 -0700 Subject: hugetlbfs: fix hugetlb_get_unmapped_area() Use/update cached_hole_size and free_area_cache properly to speedup finding of a free region. Signed-off-by: Xiao Guangrong Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Michal Hocko Cc: Hillf Danton Cc: Andrea Arcangeli Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 1e85a7a..b7bc786 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -154,10 +154,12 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr, return addr; } - start_addr = mm->free_area_cache; - - if (len <= mm->cached_hole_size) + if (len > mm->cached_hole_size) + start_addr = mm->free_area_cache; + else { start_addr = TASK_UNMAPPED_BASE; + mm->cached_hole_size = 0; + } full_search: addr = ALIGN(start_addr, huge_page_size(h)); @@ -171,13 +173,18 @@ full_search: */ if (start_addr != TASK_UNMAPPED_BASE) { start_addr = TASK_UNMAPPED_BASE; + mm->cached_hole_size = 0; goto full_search; } return -ENOMEM; } - if (!vma || addr + len <= vma->vm_start) + if (!vma || addr + len <= vma->vm_start) { + mm->free_area_cache = addr + len; return addr; + } + if (addr + mm->cached_hole_size < vma->vm_start) + mm->cached_hole_size = vma->vm_start - addr; addr = ALIGN(vma->vm_end, huge_page_size(h)); } } -- cgit v0.10.2 From cbde83e21c4fd50bfc4240408355c1e5d393063d Mon Sep 17 00:00:00 2001 From: Xiao Guangrong Date: Wed, 21 Mar 2012 16:33:55 -0700 Subject: hugetlb: try to search again if it is really needed Search again only if some holes may be skipped in the first pass. [akpm@linux-foundation.org: clean up crazy compound definition] Signed-off-by: Xiao Guangrong Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Michal Hocko Cc: Hillf Danton Cc: Andrea Arcangeli Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index 8ecbb4b..c20e81c 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -309,9 +309,10 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file, struct hstate *h = hstate_file(file); struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev_vma; - unsigned long base = mm->mmap_base, addr = addr0; + unsigned long base = mm->mmap_base; + unsigned long addr = addr0; unsigned long largest_hole = mm->cached_hole_size; - int first_time = 1; + unsigned long start_addr; /* don't allow allocations above current base */ if (mm->free_area_cache > base) @@ -322,6 +323,8 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file, mm->free_area_cache = base; } try_again: + start_addr = mm->free_area_cache; + /* make sure it can fit in the remaining address space */ if (mm->free_area_cache < len) goto fail; @@ -368,10 +371,9 @@ fail: * if hint left us with no space for the requested * mapping then try again: */ - if (first_time) { + if (start_addr != base) { mm->free_area_cache = base; largest_hole = 0; - first_time = 0; goto try_again; } /* -- cgit v0.10.2 From f44d21985eb6af7361d3785e26923355172147bd Mon Sep 17 00:00:00 2001 From: Xiao Guangrong Date: Wed, 21 Mar 2012 16:33:56 -0700 Subject: mm: do not reset cached_hole_size when vma is unmapped In the current code, cached_hole_size is set to the maximum value if the unmapped vma is less that free_area_cache so the next search will search from the base address. Actually, we can keep cached_hole_size so that if the next required size is more than cached_hole_size, it can search from free_area_cache. Signed-off-by: Xiao Guangrong Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Michal Hocko Cc: Hillf Danton Cc: Andrea Arcangeli Cc: KAMEZAWA Hiroyuki Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/mmap.c b/mm/mmap.c index da15a79..4f31764 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1426,10 +1426,8 @@ void arch_unmap_area(struct mm_struct *mm, unsigned long addr) /* * Is this a new hole at the lowest possible address? */ - if (addr >= TASK_UNMAPPED_BASE && addr < mm->free_area_cache) { + if (addr >= TASK_UNMAPPED_BASE && addr < mm->free_area_cache) mm->free_area_cache = addr; - mm->cached_hole_size = ~0UL; - } } /* -- cgit v0.10.2 From b716ad953a2bc4a543143c1d9836b7007a4b182f Mon Sep 17 00:00:00 2001 From: Xiao Guangrong Date: Wed, 21 Mar 2012 16:33:56 -0700 Subject: mm: search from free_area_cache for the bigger size If the required size is bigger than cached_hole_size it is better to search from free_area_cache - it is easier to get a free region, specifically for the 64 bit process whose address space is large enough Do it just as hugetlb_get_unmapped_area_topdown() in arch/x86/mm/hugetlbpage.c Signed-off-by: Xiao Guangrong Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Michal Hocko Cc: Hillf Danton Cc: Andrea Arcangeli Cc: KAMEZAWA Hiroyuki Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c index 0514890..ef59642 100644 --- a/arch/x86/kernel/sys_x86_64.c +++ b/arch/x86/kernel/sys_x86_64.c @@ -195,7 +195,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; - unsigned long addr = addr0; + unsigned long addr = addr0, start_addr; /* requested length too big for entire address space */ if (len > TASK_SIZE) @@ -223,25 +223,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, mm->free_area_cache = mm->mmap_base; } +try_again: /* either no address requested or can't fit in requested address hole */ - addr = mm->free_area_cache; - - /* make sure it can fit in the remaining address space */ - if (addr > len) { - unsigned long tmp_addr = align_addr(addr - len, filp, - ALIGN_TOPDOWN); - - vma = find_vma(mm, tmp_addr); - if (!vma || tmp_addr + len <= vma->vm_start) - /* remember the address as a hint for next time */ - return mm->free_area_cache = tmp_addr; - } - - if (mm->mmap_base < len) - goto bottomup; + start_addr = addr = mm->free_area_cache; - addr = mm->mmap_base-len; + if (addr < len) + goto fail; + addr -= len; do { addr = align_addr(addr, filp, ALIGN_TOPDOWN); @@ -263,6 +252,17 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, addr = vma->vm_start-len; } while (len < vma->vm_start); +fail: + /* + * if hint left us with no space for the requested + * mapping then try again: + */ + if (start_addr != mm->mmap_base) { + mm->free_area_cache = mm->mmap_base; + mm->cached_hole_size = 0; + goto try_again; + } + bottomup: /* * A failed mmap() very likely causes application failure, diff --git a/mm/mmap.c b/mm/mmap.c index 4f31764..9e0c0de 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1442,7 +1442,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; - unsigned long addr = addr0; + unsigned long addr = addr0, start_addr; /* requested length too big for entire address space */ if (len > TASK_SIZE) @@ -1466,22 +1466,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, mm->free_area_cache = mm->mmap_base; } +try_again: /* either no address requested or can't fit in requested address hole */ - addr = mm->free_area_cache; + start_addr = addr = mm->free_area_cache; - /* make sure it can fit in the remaining address space */ - if (addr > len) { - vma = find_vma(mm, addr-len); - if (!vma || addr <= vma->vm_start) - /* remember the address as a hint for next time */ - return (mm->free_area_cache = addr-len); - } - - if (mm->mmap_base < len) - goto bottomup; - - addr = mm->mmap_base-len; + if (addr < len) + goto fail; + addr -= len; do { /* * Lookup failure means no vma is above this address, @@ -1501,7 +1493,21 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, addr = vma->vm_start-len; } while (len < vma->vm_start); -bottomup: +fail: + /* + * if hint left us with no space for the requested + * mapping then try again: + * + * Note: this is different with the case of bottomup + * which does the fully line-search, but we use find_vma + * here that causes some holes skipped. + */ + if (start_addr != mm->mmap_base) { + mm->free_area_cache = mm->mmap_base; + mm->cached_hole_size = 0; + goto try_again; + } + /* * A failed mmap() very likely causes application failure, * so fall back to the bottom-up function here. This scenario -- cgit v0.10.2 From 5aaabe831eb527e0d9284f0745d830a755f70393 Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Wed, 21 Mar 2012 16:33:57 -0700 Subject: pagemap: avoid splitting thp when reading /proc/pid/pagemap Thp split is not necessary if we explicitly check whether pmds are mapping thps or not. This patch introduces this check and adds code to generate pagemap entries for pmds mapping thps, which results in less performance impact of pagemap on thp. Signed-off-by: Naoya Horiguchi Reviewed-by: Andi Kleen Reviewed-by: KAMEZAWA Hiroyuki Cc: David Rientjes Cc: Wu Fengguang Cc: Andrea Arcangeli Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 3efa725..95264c0 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -608,6 +608,9 @@ struct pagemapread { u64 *buffer; }; +#define PAGEMAP_WALK_SIZE (PMD_SIZE) +#define PAGEMAP_WALK_MASK (PMD_MASK) + #define PM_ENTRY_BYTES sizeof(u64) #define PM_STATUS_BITS 3 #define PM_STATUS_OFFSET (64 - PM_STATUS_BITS) @@ -666,6 +669,27 @@ static u64 pte_to_pagemap_entry(pte_t pte) return pme; } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset) +{ + u64 pme = 0; + /* + * Currently pmd for thp is always present because thp can not be + * swapped-out, migrated, or HWPOISONed (split in such cases instead.) + * This if-check is just to prepare for future implementation. + */ + if (pmd_present(pmd)) + pme = PM_PFRAME(pmd_pfn(pmd) + offset) + | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT; + return pme; +} +#else +static inline u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset) +{ + return 0; +} +#endif + static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -673,15 +697,37 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct pagemapread *pm = walk->private; pte_t *pte; int err = 0; + u64 pfn = PM_NOT_PRESENT; - split_huge_page_pmd(walk->mm, pmd); if (pmd_trans_unstable(pmd)) return 0; /* find the first VMA at or above 'addr' */ vma = find_vma(walk->mm, addr); + spin_lock(&walk->mm->page_table_lock); + if (pmd_trans_huge(*pmd)) { + if (pmd_trans_splitting(*pmd)) { + spin_unlock(&walk->mm->page_table_lock); + wait_split_huge_page(vma->anon_vma, pmd); + } else { + for (; addr != end; addr += PAGE_SIZE) { + unsigned long offset; + + offset = (addr & ~PAGEMAP_WALK_MASK) >> + PAGE_SHIFT; + pfn = thp_pmd_to_pagemap_entry(*pmd, offset); + err = add_to_pagemap(addr, pfn, pm); + if (err) + break; + } + spin_unlock(&walk->mm->page_table_lock); + return err; + } + } else { + spin_unlock(&walk->mm->page_table_lock); + } + for (; addr != end; addr += PAGE_SIZE) { - u64 pfn = PM_NOT_PRESENT; /* check to see if we've left 'vma' behind * and need a new, higher one */ @@ -764,8 +810,6 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask, * determine which areas of memory are actually mapped and llseek to * skip over unmapped regions. */ -#define PAGEMAP_WALK_SIZE (PMD_SIZE) -#define PAGEMAP_WALK_MASK (PMD_MASK) static ssize_t pagemap_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { -- cgit v0.10.2 From 025c5b2451e42c9e8dfdecd6dc84956ce8f321b5 Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Wed, 21 Mar 2012 16:33:57 -0700 Subject: thp: optimize away unnecessary page table locking Currently when we check if we can handle thp as it is or we need to split it into regular sized pages, we hold page table lock prior to check whether a given pmd is mapping thp or not. Because of this, when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead. To remove it, this patch introduces a optimized check function and replace several similar logics with it. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Naoya Horiguchi Cc: David Rientjes Cc: Andi Kleen Cc: Wu Fengguang Cc: Andrea Arcangeli Cc: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Cc: Jiri Slaby Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 95264c0..328843d 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -394,20 +394,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pte_t *pte; spinlock_t *ptl; - spin_lock(&walk->mm->page_table_lock); - if (pmd_trans_huge(*pmd)) { - if (pmd_trans_splitting(*pmd)) { - spin_unlock(&walk->mm->page_table_lock); - wait_split_huge_page(vma->anon_vma, pmd); - } else { - smaps_pte_entry(*(pte_t *)pmd, addr, - HPAGE_PMD_SIZE, walk); - spin_unlock(&walk->mm->page_table_lock); - mss->anonymous_thp += HPAGE_PMD_SIZE; - return 0; - } - } else { + if (pmd_trans_huge_lock(pmd, vma) == 1) { + smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk); spin_unlock(&walk->mm->page_table_lock); + mss->anonymous_thp += HPAGE_PMD_SIZE; + return 0; } if (pmd_trans_unstable(pmd)) @@ -705,26 +696,19 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, /* find the first VMA at or above 'addr' */ vma = find_vma(walk->mm, addr); spin_lock(&walk->mm->page_table_lock); - if (pmd_trans_huge(*pmd)) { - if (pmd_trans_splitting(*pmd)) { - spin_unlock(&walk->mm->page_table_lock); - wait_split_huge_page(vma->anon_vma, pmd); - } else { - for (; addr != end; addr += PAGE_SIZE) { - unsigned long offset; - - offset = (addr & ~PAGEMAP_WALK_MASK) >> - PAGE_SHIFT; - pfn = thp_pmd_to_pagemap_entry(*pmd, offset); - err = add_to_pagemap(addr, pfn, pm); - if (err) - break; - } - spin_unlock(&walk->mm->page_table_lock); - return err; + if (pmd_trans_huge_lock(pmd, vma) == 1) { + for (; addr != end; addr += PAGE_SIZE) { + unsigned long offset; + + offset = (addr & ~PAGEMAP_WALK_MASK) >> + PAGE_SHIFT; + pfn = thp_pmd_to_pagemap_entry(*pmd, offset); + err = add_to_pagemap(addr, pfn, pm); + if (err) + break; } - } else { spin_unlock(&walk->mm->page_table_lock); + return err; } for (; addr != end; addr += PAGE_SIZE) { @@ -992,24 +976,17 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, pte_t *pte; md = walk->private; - spin_lock(&walk->mm->page_table_lock); - if (pmd_trans_huge(*pmd)) { - if (pmd_trans_splitting(*pmd)) { - spin_unlock(&walk->mm->page_table_lock); - wait_split_huge_page(md->vma->anon_vma, pmd); - } else { - pte_t huge_pte = *(pte_t *)pmd; - struct page *page; - - page = can_gather_numa_stats(huge_pte, md->vma, addr); - if (page) - gather_stats(page, md, pte_dirty(huge_pte), - HPAGE_PMD_SIZE/PAGE_SIZE); - spin_unlock(&walk->mm->page_table_lock); - return 0; - } - } else { + + if (pmd_trans_huge_lock(pmd, md->vma) == 1) { + pte_t huge_pte = *(pte_t *)pmd; + struct page *page; + + page = can_gather_numa_stats(huge_pte, md->vma, addr); + if (page) + gather_stats(page, md, pte_dirty(huge_pte), + HPAGE_PMD_SIZE/PAGE_SIZE); spin_unlock(&walk->mm->page_table_lock); + return 0; } if (pmd_trans_unstable(pmd)) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 1b92129..f56cacb 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -113,6 +113,18 @@ extern void __vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, unsigned long end, long adjust_next); +extern int __pmd_trans_huge_lock(pmd_t *pmd, + struct vm_area_struct *vma); +/* mmap_sem must be held on entry */ +static inline int pmd_trans_huge_lock(pmd_t *pmd, + struct vm_area_struct *vma) +{ + VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem)); + if (pmd_trans_huge(*pmd)) + return __pmd_trans_huge_lock(pmd, vma); + else + return 0; +} static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, unsigned long end, @@ -176,6 +188,11 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, long adjust_next) { } +static inline int pmd_trans_huge_lock(pmd_t *pmd, + struct vm_area_struct *vma) +{ + return 0; +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* _LINUX_HUGE_MM_H */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8f7fc39..f0e5306 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1031,32 +1031,23 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, { int ret = 0; - spin_lock(&tlb->mm->page_table_lock); - if (likely(pmd_trans_huge(*pmd))) { - if (unlikely(pmd_trans_splitting(*pmd))) { - spin_unlock(&tlb->mm->page_table_lock); - wait_split_huge_page(vma->anon_vma, - pmd); - } else { - struct page *page; - pgtable_t pgtable; - pgtable = get_pmd_huge_pte(tlb->mm); - page = pmd_page(*pmd); - pmd_clear(pmd); - tlb_remove_pmd_tlb_entry(tlb, pmd, addr); - page_remove_rmap(page); - VM_BUG_ON(page_mapcount(page) < 0); - add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); - VM_BUG_ON(!PageHead(page)); - tlb->mm->nr_ptes--; - spin_unlock(&tlb->mm->page_table_lock); - tlb_remove_page(tlb, page); - pte_free(tlb->mm, pgtable); - ret = 1; - } - } else + if (__pmd_trans_huge_lock(pmd, vma) == 1) { + struct page *page; + pgtable_t pgtable; + pgtable = get_pmd_huge_pte(tlb->mm); + page = pmd_page(*pmd); + pmd_clear(pmd); + tlb_remove_pmd_tlb_entry(tlb, pmd, addr); + page_remove_rmap(page); + VM_BUG_ON(page_mapcount(page) < 0); + add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); + VM_BUG_ON(!PageHead(page)); + tlb->mm->nr_ptes--; spin_unlock(&tlb->mm->page_table_lock); - + tlb_remove_page(tlb, page); + pte_free(tlb->mm, pgtable); + ret = 1; + } return ret; } @@ -1066,21 +1057,15 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, { int ret = 0; - spin_lock(&vma->vm_mm->page_table_lock); - if (likely(pmd_trans_huge(*pmd))) { - ret = !pmd_trans_splitting(*pmd); - spin_unlock(&vma->vm_mm->page_table_lock); - if (unlikely(!ret)) - wait_split_huge_page(vma->anon_vma, pmd); - else { - /* - * All logical pages in the range are present - * if backed by a huge page. - */ - memset(vec, 1, (end - addr) >> PAGE_SHIFT); - } - } else + if (__pmd_trans_huge_lock(pmd, vma) == 1) { + /* + * All logical pages in the range are present + * if backed by a huge page. + */ spin_unlock(&vma->vm_mm->page_table_lock); + memset(vec, 1, (end - addr) >> PAGE_SHIFT); + ret = 1; + } return ret; } @@ -1110,20 +1095,11 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma, goto out; } - spin_lock(&mm->page_table_lock); - if (likely(pmd_trans_huge(*old_pmd))) { - if (pmd_trans_splitting(*old_pmd)) { - spin_unlock(&mm->page_table_lock); - wait_split_huge_page(vma->anon_vma, old_pmd); - ret = -1; - } else { - pmd = pmdp_get_and_clear(mm, old_addr, old_pmd); - VM_BUG_ON(!pmd_none(*new_pmd)); - set_pmd_at(mm, new_addr, new_pmd, pmd); - spin_unlock(&mm->page_table_lock); - ret = 1; - } - } else { + ret = __pmd_trans_huge_lock(old_pmd, vma); + if (ret == 1) { + pmd = pmdp_get_and_clear(mm, old_addr, old_pmd); + VM_BUG_ON(!pmd_none(*new_pmd)); + set_pmd_at(mm, new_addr, new_pmd, pmd); spin_unlock(&mm->page_table_lock); } out: @@ -1136,24 +1112,41 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, struct mm_struct *mm = vma->vm_mm; int ret = 0; - spin_lock(&mm->page_table_lock); + if (__pmd_trans_huge_lock(pmd, vma) == 1) { + pmd_t entry; + entry = pmdp_get_and_clear(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + set_pmd_at(mm, addr, pmd, entry); + spin_unlock(&vma->vm_mm->page_table_lock); + ret = 1; + } + + return ret; +} + +/* + * Returns 1 if a given pmd maps a stable (not under splitting) thp. + * Returns -1 if it maps a thp under splitting. Returns 0 otherwise. + * + * Note that if it returns 1, this routine returns without unlocking page + * table locks. So callers must unlock them. + */ +int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) +{ + spin_lock(&vma->vm_mm->page_table_lock); if (likely(pmd_trans_huge(*pmd))) { if (unlikely(pmd_trans_splitting(*pmd))) { - spin_unlock(&mm->page_table_lock); + spin_unlock(&vma->vm_mm->page_table_lock); wait_split_huge_page(vma->anon_vma, pmd); + return -1; } else { - pmd_t entry; - - entry = pmdp_get_and_clear(mm, addr, pmd); - entry = pmd_modify(entry, newprot); - set_pmd_at(mm, addr, pmd, entry); - spin_unlock(&vma->vm_mm->page_table_lock); - ret = 1; + /* Thp mapped by 'pmd' is stable, so we can + * handle it as it is. */ + return 1; } - } else - spin_unlock(&vma->vm_mm->page_table_lock); - - return ret; + } + spin_unlock(&vma->vm_mm->page_table_lock); + return 0; } pmd_t *page_check_address_pmd(struct page *page, -- cgit v0.10.2 From e873c49fbfdd595481976b915850e682441bcbec Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Wed, 21 Mar 2012 16:33:58 -0700 Subject: pagemap: export KPF_THP This flag shows that a given page is a subpage of a transparent hugepage. It helps us debug and test the kernel by showing physical address of thp. Signed-off-by: Naoya Horiguchi Reviewed-by: Wu Fengguang Reviewed-by: KAMEZAWA Hiroyuki Acked-by: KOSAKI Motohiro Cc: David Rientjes Cc: Andi Kleen Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/proc/page.c b/fs/proc/page.c index 6d8e6a9..7fcd0d6 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -115,6 +115,8 @@ u64 stable_page_flags(struct page *page) u |= 1 << KPF_COMPOUND_TAIL; if (PageHuge(page)) u |= 1 << KPF_HUGE; + else if (PageTransCompound(page)) + u |= 1 << KPF_THP; /* * Caveats on high order pages: page->_count will only be set diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h index bd92a89..26a6571 100644 --- a/include/linux/kernel-page-flags.h +++ b/include/linux/kernel-page-flags.h @@ -30,6 +30,7 @@ #define KPF_NOPAGE 20 #define KPF_KSM 21 +#define KPF_THP 22 /* kernel hacking assistances * WARNING: subject to change, never rely on them! -- cgit v0.10.2 From 807f0ccfe15551afd514c062585045c88ca62037 Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Wed, 21 Mar 2012 16:33:58 -0700 Subject: pagemap: document KPF_THP and make page-types aware of it page-types, which is a common user of pagemap, gets aware of thp with this patch. This helps system admins and kernel hackers know about how thp works. Here is a sample output of page-types over a thp: $ page-types -p --raw --list voffset offset len flags ... 7f9d40200 3f8400 1 ___U_lA____Ma_bH______t____________ 7f9d40201 3f8401 1ff ________________T_____t____________ flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000410000 511 1 ________________T_____t____________ compound_tail,thp 0x000000000040d868 1 0 ___U_lA____Ma_bH______t____________ uptodate,lru,active,mmap,anonymous,swapbacked,compound_head,thp Signed-off-by: Naoya Horiguchi Acked-by: Wu Fengguang Reviewed-by: KAMEZAWA Hiroyuki Acked-by: KOSAKI Motohiro Cc: David Rientjes Cc: Andi Kleen Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c index 7445caa..0b13f02 100644 --- a/Documentation/vm/page-types.c +++ b/Documentation/vm/page-types.c @@ -98,6 +98,7 @@ #define KPF_HWPOISON 19 #define KPF_NOPAGE 20 #define KPF_KSM 21 +#define KPF_THP 22 /* [32-] kernel hacking assistances */ #define KPF_RESERVED 32 @@ -147,6 +148,7 @@ static const char *page_flag_names[] = { [KPF_HWPOISON] = "X:hwpoison", [KPF_NOPAGE] = "n:nopage", [KPF_KSM] = "x:ksm", + [KPF_THP] = "t:thp", [KPF_RESERVED] = "r:reserved", [KPF_MLOCKED] = "m:mlocked", diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index df09b96..4600cbe 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -60,6 +60,7 @@ There are three components to pagemap: 19. HWPOISON 20. NOPAGE 21. KSM + 22. THP Short descriptions to the page flags: @@ -97,6 +98,9 @@ Short descriptions to the page flags: 21. KSM identical memory pages dynamically shared between one or more processes +22. THP + contiguous pages which construct transparent hugepages + [IO related page flags] 1. ERROR IO error occurred 3. UPTODATE page has up-to-date data -- cgit v0.10.2 From 092b50bacd1cdbffef2643b7a46f2a215407919c Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Wed, 21 Mar 2012 16:33:59 -0700 Subject: pagemap: introduce data structure for pagemap entry Currently a local variable of pagemap entry in pagemap_pte_range() is named pfn and typed with u64, but it's not correct (pfn should be unsigned long.) This patch introduces special type for pagemap entries and replaces code with it. Signed-off-by: Naoya Horiguchi Cc: David Rientjes Cc: Andi Kleen Cc: Wu Fengguang Cc: Andrea Arcangeli Cc: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 328843d..c7e3a16 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -594,9 +594,13 @@ const struct file_operations proc_clear_refs_operations = { .llseek = noop_llseek, }; +typedef struct { + u64 pme; +} pagemap_entry_t; + struct pagemapread { int pos, len; - u64 *buffer; + pagemap_entry_t *buffer; }; #define PAGEMAP_WALK_SIZE (PMD_SIZE) @@ -619,10 +623,15 @@ struct pagemapread { #define PM_NOT_PRESENT PM_PSHIFT(PAGE_SHIFT) #define PM_END_OF_BUFFER 1 -static int add_to_pagemap(unsigned long addr, u64 pfn, +static inline pagemap_entry_t make_pme(u64 val) +{ + return (pagemap_entry_t) { .pme = val }; +} + +static int add_to_pagemap(unsigned long addr, pagemap_entry_t *pme, struct pagemapread *pm) { - pm->buffer[pm->pos++] = pfn; + pm->buffer[pm->pos++] = *pme; if (pm->pos >= pm->len) return PM_END_OF_BUFFER; return 0; @@ -634,8 +643,10 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end, struct pagemapread *pm = walk->private; unsigned long addr; int err = 0; + pagemap_entry_t pme = make_pme(PM_NOT_PRESENT); + for (addr = start; addr < end; addr += PAGE_SIZE) { - err = add_to_pagemap(addr, PM_NOT_PRESENT, pm); + err = add_to_pagemap(addr, &pme, pm); if (err) break; } @@ -648,36 +659,33 @@ static u64 swap_pte_to_pagemap_entry(pte_t pte) return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT); } -static u64 pte_to_pagemap_entry(pte_t pte) +static void pte_to_pagemap_entry(pagemap_entry_t *pme, pte_t pte) { - u64 pme = 0; if (is_swap_pte(pte)) - pme = PM_PFRAME(swap_pte_to_pagemap_entry(pte)) - | PM_PSHIFT(PAGE_SHIFT) | PM_SWAP; + *pme = make_pme(PM_PFRAME(swap_pte_to_pagemap_entry(pte)) + | PM_PSHIFT(PAGE_SHIFT) | PM_SWAP); else if (pte_present(pte)) - pme = PM_PFRAME(pte_pfn(pte)) - | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT; - return pme; + *pme = make_pme(PM_PFRAME(pte_pfn(pte)) + | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE -static u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset) +static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, + pmd_t pmd, int offset) { - u64 pme = 0; /* * Currently pmd for thp is always present because thp can not be * swapped-out, migrated, or HWPOISONed (split in such cases instead.) * This if-check is just to prepare for future implementation. */ if (pmd_present(pmd)) - pme = PM_PFRAME(pmd_pfn(pmd) + offset) - | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT; - return pme; + *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset) + | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT); } #else -static inline u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset) +static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, + pmd_t pmd, int offset) { - return 0; } #endif @@ -688,7 +696,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct pagemapread *pm = walk->private; pte_t *pte; int err = 0; - u64 pfn = PM_NOT_PRESENT; + pagemap_entry_t pme = make_pme(PM_NOT_PRESENT); if (pmd_trans_unstable(pmd)) return 0; @@ -702,8 +710,8 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, offset = (addr & ~PAGEMAP_WALK_MASK) >> PAGE_SHIFT; - pfn = thp_pmd_to_pagemap_entry(*pmd, offset); - err = add_to_pagemap(addr, pfn, pm); + thp_pmd_to_pagemap_entry(&pme, *pmd, offset); + err = add_to_pagemap(addr, &pme, pm); if (err) break; } @@ -723,11 +731,11 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, if (vma && (vma->vm_start <= addr) && !is_vm_hugetlb_page(vma)) { pte = pte_offset_map(pmd, addr); - pfn = pte_to_pagemap_entry(*pte); + pte_to_pagemap_entry(&pme, *pte); /* unmap before userspace copy */ pte_unmap(pte); } - err = add_to_pagemap(addr, pfn, pm); + err = add_to_pagemap(addr, &pme, pm); if (err) return err; } @@ -738,13 +746,12 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } #ifdef CONFIG_HUGETLB_PAGE -static u64 huge_pte_to_pagemap_entry(pte_t pte, int offset) +static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, + pte_t pte, int offset) { - u64 pme = 0; if (pte_present(pte)) - pme = PM_PFRAME(pte_pfn(pte) + offset) - | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT; - return pme; + *pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset) + | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT); } /* This function walks within one hugetlb entry in the single call */ @@ -754,12 +761,12 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask, { struct pagemapread *pm = walk->private; int err = 0; - u64 pfn; + pagemap_entry_t pme = make_pme(PM_NOT_PRESENT); for (; addr != end; addr += PAGE_SIZE) { int offset = (addr & ~hmask) >> PAGE_SHIFT; - pfn = huge_pte_to_pagemap_entry(*pte, offset); - err = add_to_pagemap(addr, pfn, pm); + huge_pte_to_pagemap_entry(&pme, *pte, offset); + err = add_to_pagemap(addr, &pme, pm); if (err) return err; } -- cgit v0.10.2 From ce1744f4ed20ca873360e54502f8a71564ef7cc6 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Wed, 21 Mar 2012 16:33:59 -0700 Subject: mm: replace PAGE_MIGRATION with IS_ENABLED(CONFIG_MIGRATION) Since commit 2a11c8ea20bf ("kconfig: Introduce IS_ENABLED(), IS_BUILTIN() and IS_MODULE()") there is a generic grep-friendly method for checking config options in C expressions. Signed-off-by: Konstantin Khlebnikov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 05ed282..855c337 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -8,7 +8,6 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **); #ifdef CONFIG_MIGRATION -#define PAGE_MIGRATION 1 extern void putback_lru_pages(struct list_head *l); extern int migrate_page(struct address_space *, @@ -32,7 +31,6 @@ extern void migrate_page_copy(struct page *newpage, struct page *page); extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); #else -#define PAGE_MIGRATION 0 static inline void putback_lru_pages(struct list_head *l) {} static inline int migrate_pages(struct list_head *l, new_page_t x, diff --git a/mm/mprotect.c b/mm/mprotect.c index f437d05..c621e99 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -60,7 +60,7 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd, ptent = pte_mkwrite(ptent); ptep_modify_prot_commit(mm, addr, pte, ptent); - } else if (PAGE_MIGRATION && !pte_file(oldpte)) { + } else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) { swp_entry_t entry = pte_to_swp_entry(oldpte); if (is_write_migration_entry(entry)) { diff --git a/mm/rmap.c b/mm/rmap.c index c8454e0..78cc46b 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1282,7 +1282,7 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } dec_mm_counter(mm, MM_ANONPAGES); inc_mm_counter(mm, MM_SWAPENTS); - } else if (PAGE_MIGRATION) { + } else if (IS_ENABLED(CONFIG_MIGRATION)) { /* * Store the pfn of the page in a special migration * pte. do_swap_page() will wait until the migration @@ -1293,7 +1293,8 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } set_pte_at(mm, address, pte, swp_entry_to_pte(entry)); BUG_ON(pte_file(*pte)); - } else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) { + } else if (IS_ENABLED(CONFIG_MIGRATION) && + (TTU_ACTION(flags) == TTU_MIGRATION)) { /* Establish migration entry for a file page */ swp_entry_t entry; entry = make_migration_entry(page, pte_write(pteval)); @@ -1499,7 +1500,7 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags) * locking requirements of exec(), migration skips * temporary VMAs until after exec() completes. */ - if (PAGE_MIGRATION && (flags & TTU_MIGRATION) && + if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) && is_vma_temporary_stack(vma)) continue; -- cgit v0.10.2 From cc715d99e529d470dde2f33a6614f255adea71f3 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 21 Mar 2012 16:34:00 -0700 Subject: mm: vmscan: forcibly scan highmem if there are too many buffer_heads pinning highmem Stuart Foster reported on bugzilla that copying large amounts of data from NTFS caused an OOM kill on 32-bit X86 with 16G of memory. Andrew Morton correctly identified that the problem was NTFS was using 512 blocks meaning each page had 8 buffer_heads in low memory pinning it. In the past, direct reclaim used to scan highmem even if the allocating process did not specify __GFP_HIGHMEM but not any more. kswapd no longer will reclaim from zones that are above the high watermark. The intention in both cases was to minimise unnecessary reclaim. The downside is on machines with large amounts of highmem that lowmem can be fully consumed by buffer_heads with nothing trying to free them. The following patch is based on a suggestion by Andrew Morton to extend the buffer_heads_over_limit case to force kswapd and direct reclaim to scan the highmem zone regardless of the allocation request or watermarks. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=42578 [hughd@google.com: move buffer_heads_over_limit check up] [akpm@linux-foundation.org: buffer_heads_over_limit is unlikely] Reported-by: Stuart Foster Tested-by: Stuart Foster Signed-off-by: Mel Gorman Signed-off-by: Hugh Dickins Cc: Johannes Weiner Cc: Rik van Riel Cc: Christoph Lameter Cc: stable Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/vmscan.c b/mm/vmscan.c index 87e4d6a..ae3bf0a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1642,18 +1642,6 @@ static void move_active_pages_to_lru(struct zone *zone, unsigned long pgmoved = 0; struct page *page; - if (buffer_heads_over_limit) { - spin_unlock_irq(&zone->lru_lock); - list_for_each_entry(page, list, lru) { - if (page_has_private(page) && trylock_page(page)) { - if (page_has_private(page)) - try_to_release_page(page, 0); - unlock_page(page); - } - } - spin_lock_irq(&zone->lru_lock); - } - while (!list_empty(list)) { struct lruvec *lruvec; @@ -1735,6 +1723,14 @@ static void shrink_active_list(unsigned long nr_to_scan, continue; } + if (unlikely(buffer_heads_over_limit)) { + if (page_has_private(page) && trylock_page(page)) { + if (page_has_private(page)) + try_to_release_page(page, 0); + unlock_page(page); + } + } + if (page_referenced(page, 0, mz->mem_cgroup, &vm_flags)) { nr_rotated += hpage_nr_pages(page); /* @@ -2238,6 +2234,14 @@ static bool shrink_zones(int priority, struct zonelist *zonelist, unsigned long nr_soft_scanned; bool aborted_reclaim = false; + /* + * If the number of buffer_heads in the machine exceeds the maximum + * allowed level, force direct reclaim to scan the highmem zone as + * highmem pages could be pinning lowmem pages storing buffer_heads + */ + if (buffer_heads_over_limit) + sc->gfp_mask |= __GFP_HIGHMEM; + for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(sc->gfp_mask), sc->nodemask) { if (!populated_zone(zone)) @@ -2727,6 +2731,17 @@ loop_again: */ age_active_anon(zone, &sc, priority); + /* + * If the number of buffer_heads in the machine + * exceeds the maximum allowed level and this node + * has a highmem zone, force kswapd to reclaim from + * it to relieve lowmem pressure. + */ + if (buffer_heads_over_limit && is_highmem_idx(i)) { + end_zone = i; + break; + } + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0)) { end_zone = i; @@ -2802,7 +2817,8 @@ loop_again: COMPACT_SKIPPED) testorder = 0; - if (!zone_watermark_ok_safe(zone, testorder, + if ((buffer_heads_over_limit && is_highmem_idx(i)) || + !zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) + balance_gap, end_zone, 0)) { shrink_zone(priority, zone, &sc); -- cgit v0.10.2 From 28073b02bfaaed1e3278acfb8e6e7c9f76d9f2b6 Mon Sep 17 00:00:00 2001 From: Hillf Danton Date: Wed, 21 Mar 2012 16:34:00 -0700 Subject: mm: hugetlb: defer freeing pages when gathering surplus pages When gathering surplus pages, the number of needed pages is recomputed after reacquiring hugetlb lock to catch changes in resv_huge_pages and free_huge_pages. Plus it is recomputed with the number of newly allocated pages involved. Thus freeing pages can be deferred a bit to see if the final page request is satisfied, though pages could be allocated less than needed. Signed-off-by: Hillf Danton Reviewed-by: Michal Hocko Cc: KAMEZAWA Hiroyuki Cc: Hugh Dickins Cc: Mel Gorman Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a876871..afe3e1f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -852,6 +852,7 @@ static int gather_surplus_pages(struct hstate *h, int delta) struct page *page, *tmp; int ret, i; int needed, allocated; + bool alloc_ok = true; needed = (h->resv_huge_pages + delta) - h->free_huge_pages; if (needed <= 0) { @@ -867,17 +868,13 @@ retry: spin_unlock(&hugetlb_lock); for (i = 0; i < needed; i++) { page = alloc_buddy_huge_page(h, NUMA_NO_NODE); - if (!page) - /* - * We were not able to allocate enough pages to - * satisfy the entire reservation so we free what - * we've allocated so far. - */ - goto free; - + if (!page) { + alloc_ok = false; + break; + } list_add(&page->lru, &surplus_list); } - allocated += needed; + allocated += i; /* * After retaking hugetlb_lock, we need to recalculate 'needed' @@ -886,9 +883,16 @@ retry: spin_lock(&hugetlb_lock); needed = (h->resv_huge_pages + delta) - (h->free_huge_pages + allocated); - if (needed > 0) - goto retry; - + if (needed > 0) { + if (alloc_ok) + goto retry; + /* + * We were not able to allocate enough pages to + * satisfy the entire reservation so we free what + * we've allocated so far. + */ + goto free; + } /* * The surplus_list now contains _at_least_ the number of extra pages * needed to accommodate the reservation. Add the appropriate number @@ -914,10 +918,10 @@ retry: VM_BUG_ON(page_count(page)); enqueue_huge_page(h, page); } +free: spin_unlock(&hugetlb_lock); /* Free unnecessary surplus pages to the buddy allocator */ -free: if (!list_empty(&surplus_list)) { list_for_each_entry_safe(page, tmp, &surplus_list, lru) { list_del(&page->lru); -- cgit v0.10.2 From 6583a84304e5a6cb27ec2aace79ef9c4ba888d12 Mon Sep 17 00:00:00 2001 From: Kautuk Consul Date: Wed, 21 Mar 2012 16:34:01 -0700 Subject: rmap: anon_vma_prepare: Reduce code duplication by calling anon_vma_chain_link Reduce code duplication by calling anon_vma_chain_link() from anon_vma_prepare(). Also move anon_vmal_chain_link() to a more suitable location in the file. Signed-off-by: Kautuk Consul Acked-by: Peter Zijlstra Cc: Hugh Dickins Reviewed-by: KAMEZWA Hiroyuki Cc: Mel Gorman Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/rmap.c b/mm/rmap.c index 78cc46b..ebeb95e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -120,6 +120,21 @@ static void anon_vma_chain_free(struct anon_vma_chain *anon_vma_chain) kmem_cache_free(anon_vma_chain_cachep, anon_vma_chain); } +static void anon_vma_chain_link(struct vm_area_struct *vma, + struct anon_vma_chain *avc, + struct anon_vma *anon_vma) +{ + avc->vma = vma; + avc->anon_vma = anon_vma; + list_add(&avc->same_vma, &vma->anon_vma_chain); + + /* + * It's critical to add new vmas to the tail of the anon_vma, + * see comment in huge_memory.c:__split_huge_page(). + */ + list_add_tail(&avc->same_anon_vma, &anon_vma->head); +} + /** * anon_vma_prepare - attach an anon_vma to a memory region * @vma: the memory region in question @@ -175,10 +190,7 @@ int anon_vma_prepare(struct vm_area_struct *vma) spin_lock(&mm->page_table_lock); if (likely(!vma->anon_vma)) { vma->anon_vma = anon_vma; - avc->anon_vma = anon_vma; - avc->vma = vma; - list_add(&avc->same_vma, &vma->anon_vma_chain); - list_add_tail(&avc->same_anon_vma, &anon_vma->head); + anon_vma_chain_link(vma, avc, anon_vma); allocated = NULL; avc = NULL; } @@ -224,21 +236,6 @@ static inline void unlock_anon_vma_root(struct anon_vma *root) mutex_unlock(&root->mutex); } -static void anon_vma_chain_link(struct vm_area_struct *vma, - struct anon_vma_chain *avc, - struct anon_vma *anon_vma) -{ - avc->vma = vma; - avc->anon_vma = anon_vma; - list_add(&avc->same_vma, &vma->anon_vma_chain); - - /* - * It's critical to add new vmas to the tail of the anon_vma, - * see comment in huge_memory.c:__split_huge_page(). - */ - list_add_tail(&avc->same_anon_vma, &anon_vma->head); -} - /* * Attach the anon_vmas from src to dst. * Returns 0 on success, -ENOMEM on failure. -- cgit v0.10.2 From 978ea78b65794ef07eb66b9946064dea66b52554 Mon Sep 17 00:00:00 2001 From: Xiao Guangrong Date: Wed, 21 Mar 2012 16:34:01 -0700 Subject: rmap: remove __anon_vma_link() declaration This declaration is not used anymore, remove it. Signed-off-by: Xiao Guangrong Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 1cdd62a..fd07c45 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -122,7 +122,6 @@ void unlink_anon_vmas(struct vm_area_struct *); int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *); void anon_vma_moveto_tail(struct vm_area_struct *); int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *); -void __anon_vma_link(struct vm_area_struct *); static inline void anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next) -- cgit v0.10.2 From d563c0501bf8702b9b683206c09b9defb37d8a8a Mon Sep 17 00:00:00 2001 From: Hillf Danton Date: Wed, 21 Mar 2012 16:34:02 -0700 Subject: vmscan: handle isolated pages with lru lock released When shrinking inactive lru list, isolated pages are queued on locally private list, so the lock-hold time could be reduced if pages are counted without lock protection. To achieve that, firstly updating reclaim stat is delayed until the putback stage, after reacquiring the lru lock. Secondly, operations related to vm and zone stats are now proteced with preemption disabled as they are per-cpu operations. Signed-off-by: Hillf Danton Acked-by: Hugh Dickins Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/vmscan.c b/mm/vmscan.c index ae3bf0a..57d8ef6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1413,7 +1413,6 @@ update_isolated_counts(struct mem_cgroup_zone *mz, unsigned long *nr_anon, unsigned long *nr_file) { - struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz); struct zone *zone = mz->zone; unsigned int count[NR_LRU_LISTS] = { 0, }; unsigned long nr_active = 0; @@ -1434,6 +1433,7 @@ update_isolated_counts(struct mem_cgroup_zone *mz, count[lru] += numpages; } + preempt_disable(); __count_vm_events(PGDEACTIVATE, nr_active); __mod_zone_page_state(zone, NR_ACTIVE_FILE, @@ -1448,8 +1448,9 @@ update_isolated_counts(struct mem_cgroup_zone *mz, *nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON]; *nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE]; - reclaim_stat->recent_scanned[0] += *nr_anon; - reclaim_stat->recent_scanned[1] += *nr_file; + __mod_zone_page_state(zone, NR_ISOLATED_ANON, *nr_anon); + __mod_zone_page_state(zone, NR_ISOLATED_FILE, *nr_file); + preempt_enable(); } /* @@ -1511,6 +1512,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz, unsigned long nr_writeback = 0; isolate_mode_t isolate_mode = ISOLATE_INACTIVE; struct zone *zone = mz->zone; + struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz); while (unlikely(too_many_isolated(zone, file, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); @@ -1544,19 +1546,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz, __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned); } + spin_unlock_irq(&zone->lru_lock); - if (nr_taken == 0) { - spin_unlock_irq(&zone->lru_lock); + if (nr_taken == 0) return 0; - } update_isolated_counts(mz, &page_list, &nr_anon, &nr_file); - __mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon); - __mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file); - - spin_unlock_irq(&zone->lru_lock); - nr_reclaimed = shrink_page_list(&page_list, mz, sc, priority, &nr_dirty, &nr_writeback); @@ -1569,6 +1565,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz, spin_lock_irq(&zone->lru_lock); + reclaim_stat->recent_scanned[0] += nr_anon; + reclaim_stat->recent_scanned[1] += nr_file; + if (current_is_kswapd()) __count_vm_events(KSWAPD_STEAL, nr_reclaimed); __count_zone_vm_events(PGSTEAL, zone, nr_reclaimed); -- cgit v0.10.2 From fcf4d8212a8f38334679e82ff14532b908b4b451 Mon Sep 17 00:00:00 2001 From: Jiri Kosina Date: Wed, 21 Mar 2012 16:34:02 -0700 Subject: thp: documentation: 'transparent_hugepage=' can also be specified on cmdline The behavior of THP can either be toggled through sysfs in runtime or using a kernel cmdline parameter 'transparent_hugepage='. Document the latter in kernel-parameters.txt Signed-off-by: Jiri Kosina Cc: Andrea Arcangeli Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 8cadb75..7986d79 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2635,6 +2635,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. to facilitate early boot debugging. See also Documentation/trace/events.txt + transparent_hugepage= + [KNL] + Format: [always|madvise|never] + Can be used to control the default behavior of the system + with respect to transparent hugepages. + See Documentation/vm/transhuge.txt for more details. + tsc= Disable clocksource stability checks for TSC. Format: [x86] reliable: mark tsc clocksource as reliable, this -- cgit v0.10.2 From 9e81130b7ce23050335b1197bb51743517b5b9d0 Mon Sep 17 00:00:00 2001 From: Hillf Danton Date: Wed, 21 Mar 2012 16:34:03 -0700 Subject: mm: hugetlb: bail out unmapping after serving reference page When unmapping a given VM range, we could bail out if a reference page is supplied and is unmapped, which is a minor optimization. Signed-off-by: Hillf Danton Cc: Michal Hocko Cc: KAMEZAWA Hiroyuki Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/hugetlb.c b/mm/hugetlb.c index afe3e1f..62f9fad 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2280,6 +2280,10 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, if (pte_dirty(pte)) set_page_dirty(page); list_add(&page->lru, &page_list); + + /* Bail out after unmapping reference page if supplied */ + if (ref_page) + break; } flush_tlb_range(vma, start, end); spin_unlock(&mm->page_table_lock); -- cgit v0.10.2 From b76437579d1344b612cf1851ae610c636cec7db0 Mon Sep 17 00:00:00 2001 From: Siddhesh Poyarekar Date: Wed, 21 Mar 2012 16:34:04 -0700 Subject: procfs: mark thread stack correctly in proc//maps Stack for a new thread is mapped by userspace code and passed via sys_clone. This memory is currently seen as anonymous in /proc//maps, which makes it difficult to ascertain which mappings are being used for thread stacks. This patch uses the individual task stack pointers to determine which vmas are actually thread stacks. For a multithreaded program like the following: #include void *thread_main(void *foo) { while(1); } int main() { pthread_t t; pthread_create(&t, NULL, thread_main, NULL); pthread_join(t, NULL); } proc/PID/maps looks like the following: 00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out 00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out 019ef000-01a10000 rw-p 00000000 00:00 0 [heap] 7f8a44491000-7f8a44492000 ---p 00000000 00:00 0 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0 7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0 7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0 7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0 7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] 7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Here, one could guess that 7f8a44492000-7f8a44c92000 is a stack since the earlier vma that has no permissions (7f8a44e3d000-7f8a4503d000) but that is not always a reliable way to find out which vma is a thread stack. Also, /proc/PID/maps and /proc/PID/task/TID/maps has the same content. With this patch in place, /proc/PID/task/TID/maps are treated as 'maps as the task would see it' and hence, only the vma that that task uses as stack is marked as [stack]. All other 'stack' vmas are marked as anonymous memory. /proc/PID/maps acts as a thread group level view, where all thread stack vmas are marked as [stack:TID] where TID is the process ID of the task that uses that vma as stack, while the process stack is marked as [stack]. So /proc/PID/maps will look like this: 00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out 00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out 019ef000-01a10000 rw-p 00000000 00:00 0 [heap] 7f8a44491000-7f8a44492000 ---p 00000000 00:00 0 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442] 7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0 7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0 7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0 7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0 7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] 7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Thus marking all vmas that are used as stacks by the threads in the thread group along with the process stack. The task level maps will however like this: 00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out 00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out 019ef000-01a10000 rw-p 00000000 00:00 0 [heap] 7f8a44491000-7f8a44492000 ---p 00000000 00:00 0 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack] 7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0 7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0 7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0 7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0 7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] where only the vma that is being used as a stack by *that* task is marked as [stack]. Analogous changes have been made to /proc/PID/smaps, /proc/PID/numa_maps, /proc/PID/task/TID/smaps and /proc/PID/task/TID/numa_maps. Relevant snippets from smaps and numa_maps: [siddhesh@localhost ~ ]$ pgrep a.out 1441 [siddhesh@localhost ~ ]$ cat /proc/1441/smaps | grep "\[stack" 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442] 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/smaps | grep "\[stack" 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack] [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/smaps | grep "\[stack" 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] [siddhesh@localhost ~ ]$ cat /proc/1441/numa_maps | grep "stack" 7f8a44492000 default stack:1442 anon=2 dirty=2 N0=2 7fff6273a000 default stack anon=3 dirty=3 N0=3 [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/numa_maps | grep "stack" 7f8a44492000 default stack anon=2 dirty=2 N0=2 [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/numa_maps | grep "stack" 7fff6273a000 default stack anon=3 dirty=3 N0=3 [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix build] Signed-off-by: Siddhesh Poyarekar Cc: KOSAKI Motohiro Cc: Alexander Viro Cc: Jamie Lokier Cc: Mike Frysinger Cc: Alexey Dobriyan Cc: Matt Mackall Cc: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index a76a26a..b7413cb 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -290,7 +290,7 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7) rsslim current limit in bytes on the rss start_code address above which program text can run end_code address below which program text can run - start_stack address of the start of the stack + start_stack address of the start of the main process stack esp current value of ESP eip current value of EIP pending bitmap of pending signals @@ -325,7 +325,7 @@ address perms offset dev inode pathname a7cb1000-a7cb2000 ---p 00000000 00:00 0 a7cb2000-a7eb2000 rw-p 00000000 00:00 0 a7eb2000-a7eb3000 ---p 00000000 00:00 0 -a7eb3000-a7ed5000 rw-p 00000000 00:00 0 +a7eb3000-a7ed5000 rw-p 00000000 00:00 0 [stack:1001] a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 @@ -357,11 +357,39 @@ is not associated with a file: [heap] = the heap of the program [stack] = the stack of the main process + [stack:1001] = the stack of the thread with tid 1001 [vdso] = the "virtual dynamic shared object", the kernel system call handler or if empty, the mapping is anonymous. +The /proc/PID/task/TID/maps is a view of the virtual memory from the viewpoint +of the individual tasks of a process. In this file you will see a mapping marked +as [stack] if that task sees it as a stack. This is a key difference from the +content of /proc/PID/maps, where you will see all mappings that are being used +as stack by all of those tasks. Hence, for the example above, the task-level +map, i.e. /proc/PID/task/TID/maps for thread 1001 will look like this: + +08048000-08049000 r-xp 00000000 03:00 8312 /opt/test +08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test +0804a000-0806b000 rw-p 00000000 00:00 0 [heap] +a7cb1000-a7cb2000 ---p 00000000 00:00 0 +a7cb2000-a7eb2000 rw-p 00000000 00:00 0 +a7eb2000-a7eb3000 ---p 00000000 00:00 0 +a7eb3000-a7ed5000 rw-p 00000000 00:00 0 [stack] +a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 +a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 +a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 +a800b000-a800e000 rw-p 00000000 00:00 0 +a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 +a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 +a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 +a8024000-a8027000 rw-p 00000000 00:00 0 +a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 +a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 +a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 +aff35000-aff4a000 rw-p 00000000 00:00 0 +ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] The /proc/PID/smaps is an extension based on maps, showing the memory consumption for each of the process's mappings. For each of mappings there diff --git a/fs/proc/base.c b/fs/proc/base.c index 965d4bd..3b42c14 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2989,9 +2989,9 @@ static const struct pid_entry tgid_base_stuff[] = { INF("cmdline", S_IRUGO, proc_pid_cmdline), ONE("stat", S_IRUGO, proc_tgid_stat), ONE("statm", S_IRUGO, proc_pid_statm), - REG("maps", S_IRUGO, proc_maps_operations), + REG("maps", S_IRUGO, proc_pid_maps_operations), #ifdef CONFIG_NUMA - REG("numa_maps", S_IRUGO, proc_numa_maps_operations), + REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations), #endif REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations), LNK("cwd", proc_cwd_link), @@ -3002,7 +3002,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("mountstats", S_IRUSR, proc_mountstats_operations), #ifdef CONFIG_PROC_PAGE_MONITOR REG("clear_refs", S_IWUSR, proc_clear_refs_operations), - REG("smaps", S_IRUGO, proc_smaps_operations), + REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("pagemap", S_IRUGO, proc_pagemap_operations), #endif #ifdef CONFIG_SECURITY @@ -3348,9 +3348,9 @@ static const struct pid_entry tid_base_stuff[] = { INF("cmdline", S_IRUGO, proc_pid_cmdline), ONE("stat", S_IRUGO, proc_tid_stat), ONE("statm", S_IRUGO, proc_pid_statm), - REG("maps", S_IRUGO, proc_maps_operations), + REG("maps", S_IRUGO, proc_tid_maps_operations), #ifdef CONFIG_NUMA - REG("numa_maps", S_IRUGO, proc_numa_maps_operations), + REG("numa_maps", S_IRUGO, proc_tid_numa_maps_operations), #endif REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations), LNK("cwd", proc_cwd_link), @@ -3360,7 +3360,7 @@ static const struct pid_entry tid_base_stuff[] = { REG("mountinfo", S_IRUGO, proc_mountinfo_operations), #ifdef CONFIG_PROC_PAGE_MONITOR REG("clear_refs", S_IWUSR, proc_clear_refs_operations), - REG("smaps", S_IRUGO, proc_smaps_operations), + REG("smaps", S_IRUGO, proc_tid_smaps_operations), REG("pagemap", S_IRUGO, proc_pagemap_operations), #endif #ifdef CONFIG_SECURITY diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 2925775..c44efe1 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -53,9 +53,12 @@ extern int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task); extern loff_t mem_lseek(struct file *file, loff_t offset, int orig); -extern const struct file_operations proc_maps_operations; -extern const struct file_operations proc_numa_maps_operations; -extern const struct file_operations proc_smaps_operations; +extern const struct file_operations proc_pid_maps_operations; +extern const struct file_operations proc_tid_maps_operations; +extern const struct file_operations proc_pid_numa_maps_operations; +extern const struct file_operations proc_tid_numa_maps_operations; +extern const struct file_operations proc_pid_smaps_operations; +extern const struct file_operations proc_tid_smaps_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; extern const struct file_operations proc_net_operations; diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index c7e3a16..9694cc2 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -209,16 +209,20 @@ static int do_maps_open(struct inode *inode, struct file *file, return ret; } -static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma) +static void +show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid) { struct mm_struct *mm = vma->vm_mm; struct file *file = vma->vm_file; + struct proc_maps_private *priv = m->private; + struct task_struct *task = priv->task; vm_flags_t flags = vma->vm_flags; unsigned long ino = 0; unsigned long long pgoff = 0; unsigned long start, end; dev_t dev = 0; int len; + const char *name = NULL; if (file) { struct inode *inode = vma->vm_file->f_path.dentry->d_inode; @@ -252,36 +256,57 @@ static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma) if (file) { pad_len_spaces(m, len); seq_path(m, &file->f_path, "\n"); - } else { - const char *name = arch_vma_name(vma); - if (!name) { - if (mm) { - if (vma->vm_start <= mm->brk && - vma->vm_end >= mm->start_brk) { - name = "[heap]"; - } else if (vma->vm_start <= mm->start_stack && - vma->vm_end >= mm->start_stack) { - name = "[stack]"; - } + goto done; + } + + name = arch_vma_name(vma); + if (!name) { + pid_t tid; + + if (!mm) { + name = "[vdso]"; + goto done; + } + + if (vma->vm_start <= mm->brk && + vma->vm_end >= mm->start_brk) { + name = "[heap]"; + goto done; + } + + tid = vm_is_stack(task, vma, is_pid); + + if (tid != 0) { + /* + * Thread stack in /proc/PID/task/TID/maps or + * the main process stack. + */ + if (!is_pid || (vma->vm_start <= mm->start_stack && + vma->vm_end >= mm->start_stack)) { + name = "[stack]"; } else { - name = "[vdso]"; + /* Thread stack in /proc/PID/maps */ + pad_len_spaces(m, len); + seq_printf(m, "[stack:%d]", tid); } } - if (name) { - pad_len_spaces(m, len); - seq_puts(m, name); - } + } + +done: + if (name) { + pad_len_spaces(m, len); + seq_puts(m, name); } seq_putc(m, '\n'); } -static int show_map(struct seq_file *m, void *v) +static int show_map(struct seq_file *m, void *v, int is_pid) { struct vm_area_struct *vma = v; struct proc_maps_private *priv = m->private; struct task_struct *task = priv->task; - show_map_vma(m, vma); + show_map_vma(m, vma, is_pid); if (m->count < m->size) /* vma is copied successfully */ m->version = (vma != get_gate_vma(task->mm)) @@ -289,20 +314,49 @@ static int show_map(struct seq_file *m, void *v) return 0; } +static int show_pid_map(struct seq_file *m, void *v) +{ + return show_map(m, v, 1); +} + +static int show_tid_map(struct seq_file *m, void *v) +{ + return show_map(m, v, 0); +} + static const struct seq_operations proc_pid_maps_op = { .start = m_start, .next = m_next, .stop = m_stop, - .show = show_map + .show = show_pid_map +}; + +static const struct seq_operations proc_tid_maps_op = { + .start = m_start, + .next = m_next, + .stop = m_stop, + .show = show_tid_map }; -static int maps_open(struct inode *inode, struct file *file) +static int pid_maps_open(struct inode *inode, struct file *file) { return do_maps_open(inode, file, &proc_pid_maps_op); } -const struct file_operations proc_maps_operations = { - .open = maps_open, +static int tid_maps_open(struct inode *inode, struct file *file) +{ + return do_maps_open(inode, file, &proc_tid_maps_op); +} + +const struct file_operations proc_pid_maps_operations = { + .open = pid_maps_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release_private, +}; + +const struct file_operations proc_tid_maps_operations = { + .open = tid_maps_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release_private, @@ -416,7 +470,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } -static int show_smap(struct seq_file *m, void *v) +static int show_smap(struct seq_file *m, void *v, int is_pid) { struct proc_maps_private *priv = m->private; struct task_struct *task = priv->task; @@ -434,7 +488,7 @@ static int show_smap(struct seq_file *m, void *v) if (vma->vm_mm && !is_vm_hugetlb_page(vma)) walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk); - show_map_vma(m, vma); + show_map_vma(m, vma, is_pid); seq_printf(m, "Size: %8lu kB\n" @@ -473,20 +527,49 @@ static int show_smap(struct seq_file *m, void *v) return 0; } +static int show_pid_smap(struct seq_file *m, void *v) +{ + return show_smap(m, v, 1); +} + +static int show_tid_smap(struct seq_file *m, void *v) +{ + return show_smap(m, v, 0); +} + static const struct seq_operations proc_pid_smaps_op = { .start = m_start, .next = m_next, .stop = m_stop, - .show = show_smap + .show = show_pid_smap +}; + +static const struct seq_operations proc_tid_smaps_op = { + .start = m_start, + .next = m_next, + .stop = m_stop, + .show = show_tid_smap }; -static int smaps_open(struct inode *inode, struct file *file) +static int pid_smaps_open(struct inode *inode, struct file *file) { return do_maps_open(inode, file, &proc_pid_smaps_op); } -const struct file_operations proc_smaps_operations = { - .open = smaps_open, +static int tid_smaps_open(struct inode *inode, struct file *file) +{ + return do_maps_open(inode, file, &proc_tid_smaps_op); +} + +const struct file_operations proc_pid_smaps_operations = { + .open = pid_smaps_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release_private, +}; + +const struct file_operations proc_tid_smaps_operations = { + .open = tid_smaps_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release_private, @@ -1039,7 +1122,7 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, /* * Display pages allocated per node and memory policy via /proc. */ -static int show_numa_map(struct seq_file *m, void *v) +static int show_numa_map(struct seq_file *m, void *v, int is_pid) { struct numa_maps_private *numa_priv = m->private; struct proc_maps_private *proc_priv = &numa_priv->proc_maps; @@ -1076,9 +1159,19 @@ static int show_numa_map(struct seq_file *m, void *v) seq_path(m, &file->f_path, "\n\t= "); } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) { seq_printf(m, " heap"); - } else if (vma->vm_start <= mm->start_stack && - vma->vm_end >= mm->start_stack) { - seq_printf(m, " stack"); + } else { + pid_t tid = vm_is_stack(proc_priv->task, vma, is_pid); + if (tid != 0) { + /* + * Thread stack in /proc/PID/task/TID/maps or + * the main process stack. + */ + if (!is_pid || (vma->vm_start <= mm->start_stack && + vma->vm_end >= mm->start_stack)) + seq_printf(m, " stack"); + else + seq_printf(m, " stack:%d", tid); + } } if (is_vm_hugetlb_page(vma)) @@ -1121,21 +1214,39 @@ out: return 0; } +static int show_pid_numa_map(struct seq_file *m, void *v) +{ + return show_numa_map(m, v, 1); +} + +static int show_tid_numa_map(struct seq_file *m, void *v) +{ + return show_numa_map(m, v, 0); +} + static const struct seq_operations proc_pid_numa_maps_op = { - .start = m_start, - .next = m_next, - .stop = m_stop, - .show = show_numa_map, + .start = m_start, + .next = m_next, + .stop = m_stop, + .show = show_pid_numa_map, }; -static int numa_maps_open(struct inode *inode, struct file *file) +static const struct seq_operations proc_tid_numa_maps_op = { + .start = m_start, + .next = m_next, + .stop = m_stop, + .show = show_tid_numa_map, +}; + +static int numa_maps_open(struct inode *inode, struct file *file, + const struct seq_operations *ops) { struct numa_maps_private *priv; int ret = -ENOMEM; priv = kzalloc(sizeof(*priv), GFP_KERNEL); if (priv) { priv->proc_maps.pid = proc_pid(inode); - ret = seq_open(file, &proc_pid_numa_maps_op); + ret = seq_open(file, ops); if (!ret) { struct seq_file *m = file->private_data; m->private = priv; @@ -1146,8 +1257,25 @@ static int numa_maps_open(struct inode *inode, struct file *file) return ret; } -const struct file_operations proc_numa_maps_operations = { - .open = numa_maps_open, +static int pid_numa_maps_open(struct inode *inode, struct file *file) +{ + return numa_maps_open(inode, file, &proc_pid_numa_maps_op); +} + +static int tid_numa_maps_open(struct inode *inode, struct file *file) +{ + return numa_maps_open(inode, file, &proc_tid_numa_maps_op); +} + +const struct file_operations proc_pid_numa_maps_operations = { + .open = pid_numa_maps_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release_private, +}; + +const struct file_operations proc_tid_numa_maps_operations = { + .open = tid_numa_maps_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release_private, diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c index 980de54..74fe164 100644 --- a/fs/proc/task_nommu.c +++ b/fs/proc/task_nommu.c @@ -134,9 +134,11 @@ static void pad_len_spaces(struct seq_file *m, int len) /* * display a single VMA to a sequenced file */ -static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma) +static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma, + int is_pid) { struct mm_struct *mm = vma->vm_mm; + struct proc_maps_private *priv = m->private; unsigned long ino = 0; struct file *file; dev_t dev = 0; @@ -168,10 +170,19 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma) pad_len_spaces(m, len); seq_path(m, &file->f_path, ""); } else if (mm) { - if (vma->vm_start <= mm->start_stack && - vma->vm_end >= mm->start_stack) { + pid_t tid = vm_is_stack(priv->task, vma, is_pid); + + if (tid != 0) { pad_len_spaces(m, len); - seq_puts(m, "[stack]"); + /* + * Thread stack in /proc/PID/task/TID/maps or + * the main process stack. + */ + if (!is_pid || (vma->vm_start <= mm->start_stack && + vma->vm_end >= mm->start_stack)) + seq_printf(m, "[stack]"); + else + seq_printf(m, "[stack:%d]", tid); } } @@ -182,11 +193,22 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma) /* * display mapping lines for a particular process's /proc/pid/maps */ -static int show_map(struct seq_file *m, void *_p) +static int show_map(struct seq_file *m, void *_p, int is_pid) { struct rb_node *p = _p; - return nommu_vma_show(m, rb_entry(p, struct vm_area_struct, vm_rb)); + return nommu_vma_show(m, rb_entry(p, struct vm_area_struct, vm_rb), + is_pid); +} + +static int show_pid_map(struct seq_file *m, void *_p) +{ + return show_map(m, _p, 1); +} + +static int show_tid_map(struct seq_file *m, void *_p) +{ + return show_map(m, _p, 0); } static void *m_start(struct seq_file *m, loff_t *pos) @@ -240,10 +262,18 @@ static const struct seq_operations proc_pid_maps_ops = { .start = m_start, .next = m_next, .stop = m_stop, - .show = show_map + .show = show_pid_map +}; + +static const struct seq_operations proc_tid_maps_ops = { + .start = m_start, + .next = m_next, + .stop = m_stop, + .show = show_tid_map }; -static int maps_open(struct inode *inode, struct file *file) +static int maps_open(struct inode *inode, struct file *file, + const struct seq_operations *ops) { struct proc_maps_private *priv; int ret = -ENOMEM; @@ -251,7 +281,7 @@ static int maps_open(struct inode *inode, struct file *file) priv = kzalloc(sizeof(*priv), GFP_KERNEL); if (priv) { priv->pid = proc_pid(inode); - ret = seq_open(file, &proc_pid_maps_ops); + ret = seq_open(file, ops); if (!ret) { struct seq_file *m = file->private_data; m->private = priv; @@ -262,8 +292,25 @@ static int maps_open(struct inode *inode, struct file *file) return ret; } -const struct file_operations proc_maps_operations = { - .open = maps_open, +static int pid_maps_open(struct inode *inode, struct file *file) +{ + return maps_open(inode, file, &proc_pid_maps_ops); +} + +static int tid_maps_open(struct inode *inode, struct file *file) +{ + return maps_open(inode, file, &proc_tid_maps_ops); +} + +const struct file_operations proc_pid_maps_operations = { + .open = pid_maps_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release_private, +}; + +const struct file_operations proc_tid_maps_operations = { + .open = tid_maps_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release_private, diff --git a/include/linux/mm.h b/include/linux/mm.h index 378bcce..df17ff2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1040,6 +1040,9 @@ static inline int stack_guard_page_end(struct vm_area_struct *vma, !vma_growsup(vma->vm_next, addr); } +extern pid_t +vm_is_stack(struct task_struct *task, struct vm_area_struct *vma, int in_group); + extern unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len); diff --git a/mm/util.c b/mm/util.c index 136ac4f..ae962b3 100644 --- a/mm/util.c +++ b/mm/util.c @@ -239,6 +239,47 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma, next->vm_prev = vma; } +/* Check if the vma is being used as a stack by this task */ +static int vm_is_stack_for_task(struct task_struct *t, + struct vm_area_struct *vma) +{ + return (vma->vm_start <= KSTK_ESP(t) && vma->vm_end >= KSTK_ESP(t)); +} + +/* + * Check if the vma is being used as a stack. + * If is_group is non-zero, check in the entire thread group or else + * just check in the current task. Returns the pid of the task that + * the vma is stack for. + */ +pid_t vm_is_stack(struct task_struct *task, + struct vm_area_struct *vma, int in_group) +{ + pid_t ret = 0; + + if (vm_is_stack_for_task(task, vma)) + return task->pid; + + if (in_group) { + struct task_struct *t; + rcu_read_lock(); + if (!pid_alive(task)) + goto done; + + t = task; + do { + if (vm_is_stack_for_task(t, vma)) { + ret = t->pid; + goto done; + } + } while_each_thread(task, t); +done: + rcu_read_unlock(); + } + + return ret; +} + #if defined(CONFIG_MMU) && !defined(HAVE_ARCH_PICK_MMAP_LAYOUT) void arch_pick_mmap_layout(struct mm_struct *mm) { -- cgit v0.10.2 From 08ab9b10d43aca091fdff58b69fc1ec89c5b8a83 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Wed, 21 Mar 2012 16:34:04 -0700 Subject: mm, oom: force oom kill on sysrq+f The oom killer chooses not to kill a thread if: - an eligible thread has already been oom killed and has yet to exit, and - an eligible thread is exiting but has yet to free all its memory and is not the thread attempting to currently allocate memory. SysRq+F manually invokes the global oom killer to kill a memory-hogging task. This is normally done as a last resort to free memory when no progress is being made or to test the oom killer itself. For both uses, we always want to kill a thread and never defer. This patch causes SysRq+F to always kill an eligible thread and can be used to force a kill even if another oom killed thread has failed to exit. Signed-off-by: David Rientjes Acked-by: KOSAKI Motohiro Acked-by: Pekka Enberg Acked-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index ecb8e22..136e86f 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -346,7 +346,7 @@ static struct sysrq_key_op sysrq_term_op = { static void moom_callback(struct work_struct *ignored) { - out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL); + out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL, true); } static DECLARE_WORK(moom_work, moom_callback); diff --git a/include/linux/oom.h b/include/linux/oom.h index 552fba9..3d76475 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -49,7 +49,7 @@ extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags); extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags); extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, - int order, nodemask_t *mask); + int order, nodemask_t *mask, bool force_kill); extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 517299c..f23f334 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -310,7 +310,7 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist, */ static struct task_struct *select_bad_process(unsigned int *ppoints, unsigned long totalpages, struct mem_cgroup *memcg, - const nodemask_t *nodemask) + const nodemask_t *nodemask, bool force_kill) { struct task_struct *g, *p; struct task_struct *chosen = NULL; @@ -336,7 +336,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, if (test_tsk_thread_flag(p, TIF_MEMDIE)) { if (unlikely(frozen(p))) __thaw_task(p); - return ERR_PTR(-1UL); + if (!force_kill) + return ERR_PTR(-1UL); } if (!p->mm) continue; @@ -354,7 +355,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, if (p == current) { chosen = p; *ppoints = 1000; - } else { + } else if (!force_kill) { /* * If this task is not being ptraced on exit, * then wait for it to finish before killing @@ -572,7 +573,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask) check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0, NULL); limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT; read_lock(&tasklist_lock); - p = select_bad_process(&points, limit, memcg, NULL); + p = select_bad_process(&points, limit, memcg, NULL, false); if (p && PTR_ERR(p) != -1UL) oom_kill_process(p, gfp_mask, 0, points, limit, memcg, NULL, "Memory cgroup out of memory"); @@ -687,6 +688,7 @@ static void clear_system_oom(void) * @gfp_mask: memory allocation flags * @order: amount of memory being requested as a power of 2 * @nodemask: nodemask passed to page allocator + * @force_kill: true if a task must be killed, even if others are exiting * * If we run out of memory, we have the choice between either * killing a random task (bad), letting the system crash (worse) @@ -694,7 +696,7 @@ static void clear_system_oom(void) * don't have to be perfect here, we just have to be good. */ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, - int order, nodemask_t *nodemask) + int order, nodemask_t *nodemask, bool force_kill) { const nodemask_t *mpol_mask; struct task_struct *p; @@ -738,7 +740,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, goto out; } - p = select_bad_process(&points, totalpages, NULL, mpol_mask); + p = select_bad_process(&points, totalpages, NULL, mpol_mask, + force_kill); /* Found nothing?!?! Either we hang forever, or we panic. */ if (!p) { dump_header(NULL, gfp_mask, order, NULL, mpol_mask); @@ -770,7 +773,7 @@ out: void pagefault_out_of_memory(void) { if (try_set_system_oom()) { - out_of_memory(NULL, 0, 0, NULL); + out_of_memory(NULL, 0, 0, NULL, false); clear_system_oom(); } if (!test_thread_flag(TIF_MEMDIE)) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 572b93e..98552cf 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1968,7 +1968,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, goto out; } /* Exhausted what can be done so it's blamo time */ - out_of_memory(zonelist, gfp_mask, order, nodemask); + out_of_memory(zonelist, gfp_mask, order, nodemask, false); out: clear_zonelist_oom(zonelist, gfp_mask); -- cgit v0.10.2 From 6d9d88d07e132259c35f9493b15429e19198489c Mon Sep 17 00:00:00 2001 From: Jarkko Sakkinen Date: Wed, 21 Mar 2012 16:34:05 -0700 Subject: tmpfs: security xattr setting on inode creation Adds to generic xattr support introduced in Linux 3.0 by implementing initxattrs callback. This enables consulting of security attributes from LSM and EVM when inode is created. [hughd@google.com: moved under CONFIG_TMPFS_XATTR, with memcpy in shmem_xattr_alloc] Signed-off-by: Jarkko Sakkinen Reviewed-by: James Morris Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/shmem.c b/mm/shmem.c index b7e1955..7cc8083 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1178,6 +1178,12 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode static const struct inode_operations shmem_symlink_inode_operations; static const struct inode_operations shmem_short_symlink_operations; +#ifdef CONFIG_TMPFS_XATTR +static int shmem_initxattrs(struct inode *, const struct xattr *, void *); +#else +#define shmem_initxattrs NULL +#endif + static int shmem_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, @@ -1490,7 +1496,7 @@ shmem_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev) if (inode) { error = security_inode_init_security(inode, dir, &dentry->d_name, - NULL, NULL); + shmem_initxattrs, NULL); if (error) { if (error != -EOPNOTSUPP) { iput(inode); @@ -1630,7 +1636,7 @@ static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *s return -ENOSPC; error = security_inode_init_security(inode, dir, &dentry->d_name, - NULL, NULL); + shmem_initxattrs, NULL); if (error) { if (error != -EOPNOTSUPP) { iput(inode); @@ -1704,6 +1710,66 @@ static void shmem_put_link(struct dentry *dentry, struct nameidata *nd, void *co * filesystem level, though. */ +/* + * Allocate new xattr and copy in the value; but leave the name to callers. + */ +static struct shmem_xattr *shmem_xattr_alloc(const void *value, size_t size) +{ + struct shmem_xattr *new_xattr; + size_t len; + + /* wrap around? */ + len = sizeof(*new_xattr) + size; + if (len <= sizeof(*new_xattr)) + return NULL; + + new_xattr = kmalloc(len, GFP_KERNEL); + if (!new_xattr) + return NULL; + + new_xattr->size = size; + memcpy(new_xattr->value, value, size); + return new_xattr; +} + +/* + * Callback for security_inode_init_security() for acquiring xattrs. + */ +static int shmem_initxattrs(struct inode *inode, + const struct xattr *xattr_array, + void *fs_info) +{ + struct shmem_inode_info *info = SHMEM_I(inode); + const struct xattr *xattr; + struct shmem_xattr *new_xattr; + size_t len; + + for (xattr = xattr_array; xattr->name != NULL; xattr++) { + new_xattr = shmem_xattr_alloc(xattr->value, xattr->value_len); + if (!new_xattr) + return -ENOMEM; + + len = strlen(xattr->name) + 1; + new_xattr->name = kmalloc(XATTR_SECURITY_PREFIX_LEN + len, + GFP_KERNEL); + if (!new_xattr->name) { + kfree(new_xattr); + return -ENOMEM; + } + + memcpy(new_xattr->name, XATTR_SECURITY_PREFIX, + XATTR_SECURITY_PREFIX_LEN); + memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN, + xattr->name, len); + + spin_lock(&info->lock); + list_add(&new_xattr->list, &info->xattr_list); + spin_unlock(&info->lock); + } + + return 0; +} + static int shmem_xattr_get(struct dentry *dentry, const char *name, void *buffer, size_t size) { @@ -1731,24 +1797,17 @@ static int shmem_xattr_get(struct dentry *dentry, const char *name, return ret; } -static int shmem_xattr_set(struct dentry *dentry, const char *name, +static int shmem_xattr_set(struct inode *inode, const char *name, const void *value, size_t size, int flags) { - struct inode *inode = dentry->d_inode; struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_xattr *xattr; struct shmem_xattr *new_xattr = NULL; - size_t len; int err = 0; /* value == NULL means remove */ if (value) { - /* wrap around? */ - len = sizeof(*new_xattr) + size; - if (len <= sizeof(*new_xattr)) - return -ENOMEM; - - new_xattr = kmalloc(len, GFP_KERNEL); + new_xattr = shmem_xattr_alloc(value, size); if (!new_xattr) return -ENOMEM; @@ -1757,9 +1816,6 @@ static int shmem_xattr_set(struct dentry *dentry, const char *name, kfree(new_xattr); return -ENOMEM; } - - new_xattr->size = size; - memcpy(new_xattr->value, value, size); } spin_lock(&info->lock); @@ -1858,7 +1914,7 @@ static int shmem_setxattr(struct dentry *dentry, const char *name, if (size == 0) value = ""; /* empty EA, do not remove */ - return shmem_xattr_set(dentry, name, value, size, flags); + return shmem_xattr_set(dentry->d_inode, name, value, size, flags); } @@ -1878,7 +1934,7 @@ static int shmem_removexattr(struct dentry *dentry, const char *name) if (err) return err; - return shmem_xattr_set(dentry, name, NULL, 0, XATTR_REPLACE); + return shmem_xattr_set(dentry->d_inode, name, NULL, 0, XATTR_REPLACE); } static bool xattr_is_trusted(const char *name) -- cgit v0.10.2 From 385de35722c9a22917e7bc5e63cd83a8cffa5ecd Mon Sep 17 00:00:00 2001 From: Dean Nelson Date: Wed, 21 Mar 2012 16:34:05 -0700 Subject: thp: allow a hwpoisoned head page to be put back to LRU Andrea Arcangeli pointed out to me that a check in __memory_failure() which was intended to prevent THP tail pages from being checked for the absence of the PG_lru flag (something that is always the case), was also preventing THP head pages from being checked. A THP head page could actually benefit from the call to shake_page() by ending up being put back to a LRU, provided it had been waiting in a pagevec array. Andrea suggested that the "!PageTransCompound(p)" in the if-statement should be replaced by a "!PageTransTail(p)", thus allowing THP head pages to be checked and possibly shaken. Signed-off-by: Dean Nelson Cc: Jin Dongming Reviewed-by: Andrea Arcangeli Cc: Andi Kleen Cc: Hidetoshi Seto Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index e90a673..6b25758 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -414,11 +414,26 @@ static inline int PageTransHuge(struct page *page) return PageHead(page); } +/* + * PageTransCompound returns true for both transparent huge pages + * and hugetlbfs pages, so it should only be called when it's known + * that hugetlbfs pages aren't involved. + */ static inline int PageTransCompound(struct page *page) { return PageCompound(page); } +/* + * PageTransTail returns true for both transparent huge pages + * and hugetlbfs pages, so it should only be called when it's known + * that hugetlbfs pages aren't involved. + */ +static inline int PageTransTail(struct page *page) +{ + return PageTail(page); +} + #else static inline int PageTransHuge(struct page *page) @@ -430,6 +445,11 @@ static inline int PageTransCompound(struct page *page) { return 0; } + +static inline int PageTransTail(struct page *page) +{ + return 0; +} #endif #ifdef CONFIG_MMU diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 56080ea..c22076ff 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1063,7 +1063,7 @@ int __memory_failure(unsigned long pfn, int trapno, int flags) * The check (unnecessarily) ignores LRU pages being isolated and * walked by the page reclaim code, however that's not a big loss. */ - if (!PageHuge(p) && !PageTransCompound(p)) { + if (!PageHuge(p) && !PageTransTail(p)) { if (!PageLRU(p)) shake_page(p, 0); if (!PageLRU(p)) { -- cgit v0.10.2 From 3268c63eded4612a3d07b56d1e02ce7731e6608e Mon Sep 17 00:00:00 2001 From: Christoph Lameter Date: Wed, 21 Mar 2012 16:34:06 -0700 Subject: mm: fix move/migrate_pages() race on task struct Migration functions perform the rcu_read_unlock too early. As a result the task pointed to may change from under us. This can result in an oops, as reported by Dave Hansen in https://lkml.org/lkml/2012/2/23/302. The following patch extend the period of the rcu_read_lock until after the permissions checks are done. We also take a refcount so that the task reference is stable when calling security check functions and performing cpuset node validation (which takes a mutex). The refcount is dropped before actual page migration occurs so there is no change to the refcounts held during page migration. Also move the determination of the mm of the task struct to immediately before the do_migrate*() calls so that it is clear that we switch from handling the task during permission checks to the mm for the actual migration. Since the determination is only done once and we then no longer use the task_struct we can be sure that we operate on a specific address space that will not change from under us. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Christoph Lameter Cc: "Eric W. Biederman" Reported-by: Dave Hansen Cc: Mel Gorman Cc: Johannes Weiner Cc: KOSAKI Motohiro Cc: KAMEZAWA Hiroyuki Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0a37570..71e1a52 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1323,12 +1323,9 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode, err = -ESRCH; goto out; } - mm = get_task_mm(task); - rcu_read_unlock(); + get_task_struct(task); err = -EINVAL; - if (!mm) - goto out; /* * Check if this process has the right to modify the specified @@ -1336,14 +1333,13 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode, * capabilities, superuser privileges or the same * userid as the target process. */ - rcu_read_lock(); tcred = __task_cred(task); if (cred->euid != tcred->suid && cred->euid != tcred->uid && cred->uid != tcred->suid && cred->uid != tcred->uid && !capable(CAP_SYS_NICE)) { rcu_read_unlock(); err = -EPERM; - goto out; + goto out_put; } rcu_read_unlock(); @@ -1351,26 +1347,36 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode, /* Is the user allowed to access the target nodes? */ if (!nodes_subset(*new, task_nodes) && !capable(CAP_SYS_NICE)) { err = -EPERM; - goto out; + goto out_put; } if (!nodes_subset(*new, node_states[N_HIGH_MEMORY])) { err = -EINVAL; - goto out; + goto out_put; } err = security_task_movememory(task); if (err) - goto out; + goto out_put; - err = do_migrate_pages(mm, old, new, - capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE); -out: + mm = get_task_mm(task); + put_task_struct(task); if (mm) - mmput(mm); + err = do_migrate_pages(mm, old, new, + capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE); + else + err = -EINVAL; + + mmput(mm); +out: NODEMASK_SCRATCH_FREE(scratch); return err; + +out_put: + put_task_struct(task); + goto out; + } diff --git a/mm/migrate.c b/mm/migrate.c index 1503b6b..51c08a0 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1174,20 +1174,17 @@ set_status: * Migrate an array of page address onto an array of nodes and fill * the corresponding array of status. */ -static int do_pages_move(struct mm_struct *mm, struct task_struct *task, +static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, unsigned long nr_pages, const void __user * __user *pages, const int __user *nodes, int __user *status, int flags) { struct page_to_node *pm; - nodemask_t task_nodes; unsigned long chunk_nr_pages; unsigned long chunk_start; int err; - task_nodes = cpuset_mems_allowed(task); - err = -ENOMEM; pm = (struct page_to_node *)__get_free_page(GFP_KERNEL); if (!pm) @@ -1349,6 +1346,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages, struct task_struct *task; struct mm_struct *mm; int err; + nodemask_t task_nodes; /* Check flags */ if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL)) @@ -1364,11 +1362,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages, rcu_read_unlock(); return -ESRCH; } - mm = get_task_mm(task); - rcu_read_unlock(); - - if (!mm) - return -EINVAL; + get_task_struct(task); /* * Check if this process has the right to modify the specified @@ -1376,7 +1370,6 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages, * capabilities, superuser privileges or the same * userid as the target process. */ - rcu_read_lock(); tcred = __task_cred(task); if (cred->euid != tcred->suid && cred->euid != tcred->uid && cred->uid != tcred->suid && cred->uid != tcred->uid && @@ -1391,16 +1384,25 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages, if (err) goto out; - if (nodes) { - err = do_pages_move(mm, task, nr_pages, pages, nodes, status, - flags); - } else { - err = do_pages_stat(mm, nr_pages, pages, status); - } + task_nodes = cpuset_mems_allowed(task); + mm = get_task_mm(task); + put_task_struct(task); + + if (mm) { + if (nodes) + err = do_pages_move(mm, task_nodes, nr_pages, pages, + nodes, status, flags); + else + err = do_pages_stat(mm, nr_pages, pages, status); + } else + err = -EINVAL; -out: mmput(mm); return err; + +out: + put_task_struct(task); + return err; } /* -- cgit v0.10.2 From f0cb3c76ae1ced85f9034480b1b24cd96530ec78 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Wed, 21 Mar 2012 16:34:06 -0700 Subject: mm: drain percpu lru add/rotate page-vectors on cpu hot-unplug This cpu hotplug hook was accidentally removed in commit 00a62ce91e55 ("mm: fix Committed_AS underflow on large NR_CPUS environment") The visible effect of this accident: some pages are borrowed in per-cpu page-vectors. Truncate can deal with it, but these pages cannot be reused while this cpu is offline. So this is like a temporary memory leak. Signed-off-by: Konstantin Khlebnikov Cc: Dave Hansen Cc: KOSAKI Motohiro Cc: Eric B Munson Cc: Mel Gorman Cc: Christoph Lameter Cc: KAMEZAWA Hiroyuki Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/swap.h b/include/linux/swap.h index 64a7dba..b86b5c2 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -223,6 +223,7 @@ extern void lru_add_page_tail(struct zone* zone, extern void activate_page(struct page *); extern void mark_page_accessed(struct page *); extern void lru_add_drain(void); +extern void lru_add_drain_cpu(int cpu); extern int lru_add_drain_all(void); extern void rotate_reclaimable_page(struct page *page); extern void deactivate_page(struct page *page); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 98552cf..673596a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4825,6 +4825,7 @@ static int page_alloc_cpu_notify(struct notifier_block *self, int cpu = (unsigned long)hcpu; if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) { + lru_add_drain_cpu(cpu); drain_pages(cpu); /* diff --git a/mm/swap.c b/mm/swap.c index 14380e9..5c13f13 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -496,7 +496,7 @@ static void lru_deactivate_fn(struct page *page, void *arg) * Either "cpu" is the current CPU, and preemption has already been * disabled; or "cpu" is being hot-unplugged, and is already dead. */ -static void drain_cpu_pagevecs(int cpu) +void lru_add_drain_cpu(int cpu) { struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu); struct pagevec *pvec; @@ -553,7 +553,7 @@ void deactivate_page(struct page *page) void lru_add_drain(void) { - drain_cpu_pagevecs(get_cpu()); + lru_add_drain_cpu(get_cpu()); put_cpu(); } -- cgit v0.10.2 From f5bf18fa22f8c41a13eb8762c7373eb3a93a7333 Mon Sep 17 00:00:00 2001 From: Nishanth Aravamudan Date: Wed, 21 Mar 2012 16:34:07 -0700 Subject: bootmem/sparsemem: remove limit constraint in alloc_bootmem_section While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory Overcommit) on powerpc, we tripped the following: kernel BUG at mm/bootmem.c:483! cpu 0x0: Vector: 700 (Program Check) at [c000000000c03940] pc: c000000000a62bd8: .alloc_bootmem_core+0x90/0x39c lr: c000000000a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c sp: c000000000c03bc0 msr: 8000000000021032 current = 0xc000000000b0cce0 paca = 0xc000000001d80000 pid = 0, comm = swapper kernel BUG at mm/bootmem.c:483! enter ? for help [c000000000c03c80] c000000000a64bcc .sparse_early_usemaps_alloc_node+0x84/0x29c [c000000000c03d50] c000000000a64f10 .sparse_init+0x12c/0x28c [c000000000c03e20] c000000000a474f4 .setup_arch+0x20c/0x294 [c000000000c03ee0] c000000000a4079c .start_kernel+0xb4/0x460 [c000000000c03f90] c000000000009670 .start_here_common+0x1c/0x2c This is BUG_ON(limit && goal + size > limit); and after some debugging, it seems that goal = 0x7ffff000000 limit = 0x80000000000 and sparse_early_usemaps_alloc_node -> sparse_early_usemaps_alloc_pgdat_section calls return alloc_bootmem_section(usemap_size() * count, section_nr); This is on a system with 8TB available via the AMS pool, and as a quirk of AMS in firmware, all of that memory shows up in node 0. So, we end up with an allocation that will fail the goal/limit constraints. In theory, we could "fall-back" to alloc_bootmem_node() in sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE defined, we'll BUG_ON() instead. A simple solution appears to be to unconditionally remove the limit condition in alloc_bootmem_section, meaning allocations are allowed to cross section boundaries (necessary for systems of this size). Johannes Weiner pointed out that if alloc_bootmem_section() no longer guarantees section-locality, we need check_usemap_section_nr() to print possible cross-dependencies between node descriptors and the usemaps allocated through it. That makes the two loops in sparse_early_usemaps_alloc_node() identical, so re-factor the code a bit. [akpm@linux-foundation.org: code simplification] Signed-off-by: Nishanth Aravamudan Cc: Dave Hansen Cc: Anton Blanchard Cc: Paul Mackerras Cc: Ben Herrenschmidt Cc: Robert Jennings Acked-by: Johannes Weiner Acked-by: Mel Gorman Cc: [3.3.1] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/bootmem.c b/mm/bootmem.c index 668e94d..0131170 100644 --- a/mm/bootmem.c +++ b/mm/bootmem.c @@ -766,14 +766,13 @@ void * __init alloc_bootmem_section(unsigned long size, unsigned long section_nr) { bootmem_data_t *bdata; - unsigned long pfn, goal, limit; + unsigned long pfn, goal; pfn = section_nr_to_pfn(section_nr); goal = pfn << PAGE_SHIFT; - limit = section_nr_to_pfn(section_nr + 1) << PAGE_SHIFT; bdata = &bootmem_node_data[early_pfn_to_nid(pfn)]; - return alloc_bootmem_core(bdata, size, SMP_CACHE_BYTES, goal, limit); + return alloc_bootmem_core(bdata, size, SMP_CACHE_BYTES, goal, 0); } #endif diff --git a/mm/sparse.c b/mm/sparse.c index 61d7cde..a8bc7d3 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -353,29 +353,21 @@ static void __init sparse_early_usemaps_alloc_node(unsigned long**usemap_map, usemap = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nodeid), usemap_count); - if (usemap) { - for (pnum = pnum_begin; pnum < pnum_end; pnum++) { - if (!present_section_nr(pnum)) - continue; - usemap_map[pnum] = usemap; - usemap += size; + if (!usemap) { + usemap = alloc_bootmem_node(NODE_DATA(nodeid), size * usemap_count); + if (!usemap) { + printk(KERN_WARNING "%s: allocation failed\n", __func__); + return; } - return; } - usemap = alloc_bootmem_node(NODE_DATA(nodeid), size * usemap_count); - if (usemap) { - for (pnum = pnum_begin; pnum < pnum_end; pnum++) { - if (!present_section_nr(pnum)) - continue; - usemap_map[pnum] = usemap; - usemap += size; - check_usemap_section_nr(nodeid, usemap_map[pnum]); - } - return; + for (pnum = pnum_begin; pnum < pnum_end; pnum++) { + if (!present_section_nr(pnum)) + continue; + usemap_map[pnum] = usemap; + usemap += size; + check_usemap_section_nr(nodeid, usemap_map[pnum]); } - - printk(KERN_WARNING "%s: allocation failed\n", __func__); } #ifndef CONFIG_SPARSEMEM_VMEMMAP -- cgit v0.10.2 From a05b0855fd15504972dba2358e5faa172a1e50ba Mon Sep 17 00:00:00 2001 From: "Aneesh Kumar K.V" Date: Wed, 21 Mar 2012 16:34:08 -0700 Subject: hugetlbfs: avoid taking i_mutex from hugetlbfs_read() Taking i_mutex in hugetlbfs_read() can result in deadlock with mmap as explained below Thread A: read() on hugetlbfs hugetlbfs_read() called i_mutex grabbed hugetlbfs_read_actor() called __copy_to_user() called page fault is triggered Thread B, sharing address space with A: mmap() the same file ->mmap_sem is grabbed on task_B->mm->mmap_sem hugetlbfs_file_mmap() is called attempt to grab ->i_mutex and block waiting for A to give it up Thread A: pagefault handled blocked on attempt to grab task_A->mm->mmap_sem, which happens to be the same thing as task_B->mm->mmap_sem. Block waiting for B to give it up. AFAIU the i_mutex locking was added to hugetlbfs_read() as per http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3066.html to take care of the race between truncate and read. This patch fixes this by looking at page->mapping under lock_page() (find_lock_page()) to ensure that the inode didn't get truncated in the range during a parallel read. Ideally we can extend the patch to make sure we don't increase i_size in mmap. But that will break userspace, because applications will now have to use truncate(2) to increase i_size in hugetlbfs. Based on the original patch from Hillf Danton. Signed-off-by: Aneesh Kumar K.V Cc: Hillf Danton Cc: KAMEZAWA Hiroyuki Cc: Al Viro Cc: Hugh Dickins Cc: [everything after 2007 :)] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index b7bc786..19654cf 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -245,17 +245,10 @@ static ssize_t hugetlbfs_read(struct file *filp, char __user *buf, loff_t isize; ssize_t retval = 0; - mutex_lock(&inode->i_mutex); - /* validate length */ if (len == 0) goto out; - isize = i_size_read(inode); - if (!isize) - goto out; - - end_index = (isize - 1) >> huge_page_shift(h); for (;;) { struct page *page; unsigned long nr, ret; @@ -263,18 +256,21 @@ static ssize_t hugetlbfs_read(struct file *filp, char __user *buf, /* nr is the maximum number of bytes to copy from this page */ nr = huge_page_size(h); + isize = i_size_read(inode); + if (!isize) + goto out; + end_index = (isize - 1) >> huge_page_shift(h); if (index >= end_index) { if (index > end_index) goto out; nr = ((isize - 1) & ~huge_page_mask(h)) + 1; - if (nr <= offset) { + if (nr <= offset) goto out; - } } nr = nr - offset; /* Find the page */ - page = find_get_page(mapping, index); + page = find_lock_page(mapping, index); if (unlikely(page == NULL)) { /* * We have a HOLE, zero out the user-buffer for the @@ -286,17 +282,18 @@ static ssize_t hugetlbfs_read(struct file *filp, char __user *buf, else ra = 0; } else { + unlock_page(page); + /* * We have the page, copy it to user space buffer. */ ra = hugetlbfs_read_actor(page, offset, buf, len, nr); ret = ra; + page_cache_release(page); } if (ra < 0) { if (retval == 0) retval = ra; - if (page) - page_cache_release(page); goto out; } @@ -306,16 +303,12 @@ static ssize_t hugetlbfs_read(struct file *filp, char __user *buf, index += offset >> huge_page_shift(h); offset &= ~huge_page_mask(h); - if (page) - page_cache_release(page); - /* short read or no more work */ if ((ret != nr) || (len == 0)) break; } out: *ppos = ((loff_t)index << huge_page_shift(h)) + offset; - mutex_unlock(&inode->i_mutex); return retval; } -- cgit v0.10.2 From 1010bb1b80edb0713415dfe1f97114d320f58c4f Mon Sep 17 00:00:00 2001 From: Fengguang Wu Date: Wed, 21 Mar 2012 16:34:08 -0700 Subject: mm: don't set __GFP_WRITE on ramfs/sysfs writes There is not much point in skipping zones during allocation based on the dirty usage which they'll never contribute to. And we'd like to avoid page reclaim waits when writing to ramfs/sysfs etc. Signed-off-by: Fengguang Wu Acked-by: Johannes Weiner Cc: Jan Kara Cc: Greg Thelen Cc: Ying Han Cc: KAMEZAWA Hiroyuki Cc: Rik van Riel Cc: Mel Gorman Acked-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/filemap.c b/mm/filemap.c index 2f81650..e8cf8ae 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2341,7 +2341,9 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping, struct page *page; gfp_t gfp_notmask = 0; - gfp_mask = mapping_gfp_mask(mapping) | __GFP_WRITE; + gfp_mask = mapping_gfp_mask(mapping); + if (mapping_cap_account_dirty(mapping)) + gfp_mask |= __GFP_WRITE; if (flags & AOP_FLAG_NOFS) gfp_notmask = __GFP_FS; repeat: -- cgit v0.10.2 From 47a133339c332f9f8e155c70f5da401aded69948 Mon Sep 17 00:00:00 2001 From: Fengguang Wu Date: Wed, 21 Mar 2012 16:34:09 -0700 Subject: mm: use global_dirty_limit in throttle_vm_writeout() When starting a memory hog task, a desktop box w/o swap is found to go unresponsive for a long time. It's solely caused by lots of congestion waits in throttle_vm_writeout(): gnome-system-mo-4201 553.073384: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1 gnome-system-mo-4201 553.073386: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000 gtali-4237 553.080377: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1 gtali-4237 553.080378: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000 Xorg-3483 553.103375: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1 Xorg-3483 553.103377: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000 The root cause is, the dirty threshold is knocked down a lot by the memory hog task. Fixed by using global_dirty_limit which decreases gradually on such events and can guarantee we stay above (the also decreasing) nr_dirty in the progress of following down to the new dirty threshold. Signed-off-by: Fengguang Wu Cc: Johannes Weiner Cc: Jan Kara Cc: Greg Thelen Cc: Ying Han Cc: KAMEZAWA Hiroyuki Reviewed-by: Rik van Riel Cc: Mel Gorman Cc: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 363ba70..3fc2617 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1472,6 +1472,7 @@ void throttle_vm_writeout(gfp_t gfp_mask) for ( ; ; ) { global_dirty_limits(&background_thresh, &dirty_thresh); + dirty_thresh = hard_dirty_limit(dirty_thresh); /* * Boost the allowable dirty threshold a bit for page -- cgit v0.10.2 From 9a3c531df9462df6cc2b060f749651723ffc180c Mon Sep 17 00:00:00 2001 From: Andi Kleen Date: Wed, 21 Mar 2012 16:34:09 -0700 Subject: mm: update stale lock ordering comment for memory-failure.c When i_mmap_lock changed to a mutex the locking order in memory failure was changed to take the sleeping lock first. But the big fat mm lock ordering comment (BFMLO) wasn't updated. Do this here. Pointed out by Andrew. Signed-off-by: Andi Kleen Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/filemap.c b/mm/filemap.c index e8cf8ae..f323060 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -101,9 +101,8 @@ * ->inode->i_lock (zap_pte_range->set_page_dirty) * ->private_lock (zap_pte_range->__set_page_dirty_buffers) * - * (code doesn't rely on that order, so you could switch it around) - * ->tasklist_lock (memory_failure, collect_procs_ao) - * ->i_mmap_mutex + * ->i_mmap_mutex + * ->tasklist_lock (memory_failure, collect_procs_ao) */ /* -- cgit v0.10.2 From c7cfa37b7324a190fc36ff116d79d0f899e8d273 Mon Sep 17 00:00:00 2001 From: Copot Alexandru Date: Wed, 21 Mar 2012 16:34:10 -0700 Subject: mm/vmscan.c: fix spelling error s/noticable/noticeable/ Signed-off-by: Copot Alexandru Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/vmscan.c b/mm/vmscan.c index 57d8ef6..440af1d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2261,8 +2261,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist, * Even though compaction is invoked for any * non-zero order, only frequent costly order * reclamation is disruptive enough to become a - * noticable problem, like transparent huge page - * allocations. + * noticeable problem, like transparent huge + * page allocations. */ if (compaction_ready(zone, sc)) { aborted_reclaim = true; -- cgit v0.10.2 From e845e199362cc5712ba0e7eedc14eed70e144258 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Wed, 21 Mar 2012 16:34:10 -0700 Subject: mm, memcg: pass charge order to oom killer The oom killer typically displays the allocation order at the time of oom as a part of its diangostic messages (for global, cpuset, and mempolicy ooms). The memory controller may also pass the charge order to the oom killer so it can emit the same information. This is useful in determining how large the memory allocation is that triggered the oom killer. Signed-off-by: David Rientjes Cc: Johannes Weiner Cc: Michal Hocko Cc: Balbir Singh Acked-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b80de52..d909650 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -77,7 +77,8 @@ extern void mem_cgroup_uncharge_end(void); extern void mem_cgroup_uncharge_page(struct page *page); extern void mem_cgroup_uncharge_cache_page(struct page *page); -extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask); +extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, + int order); int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *memcg); extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3728181..bb04067 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1811,7 +1811,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) /* * try to call OOM killer. returns false if we should exit memory-reclaim loop. */ -bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) +bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { struct oom_wait_info owait; bool locked, need_to_kill; @@ -1841,7 +1841,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask) if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); - mem_cgroup_out_of_memory(memcg, mask); + mem_cgroup_out_of_memory(memcg, mask, order); } else { schedule(); finish_wait(&memcg_oom_waitq, &owait.wait); @@ -2212,7 +2212,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (!oom_check) return CHARGE_NOMEM; /* check OOM */ - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask)) + if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize))) return CHARGE_OOM_DIE; return CHARGE_RETRY; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f23f334..4198e00 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -554,7 +554,8 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, } #ifdef CONFIG_CGROUP_MEM_RES_CTLR -void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask) +void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, + int order) { unsigned long limit; unsigned int points = 0; @@ -570,12 +571,12 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask) return; } - check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0, NULL); + check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL); limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT; read_lock(&tasklist_lock); p = select_bad_process(&points, limit, memcg, NULL, false); if (p && PTR_ERR(p) != -1UL) - oom_kill_process(p, gfp_mask, 0, points, limit, memcg, NULL, + oom_kill_process(p, gfp_mask, order, points, limit, memcg, NULL, "Memory cgroup out of memory"); read_unlock(&tasklist_lock); } -- cgit v0.10.2 From cc9a6c8776615f9c194ccf0b63a0aa5628235545 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 21 Mar 2012 16:34:11 -0700 Subject: cpuset: mm: reduce large amounts of memory barrier related damage v3 Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman Cc: Miao Xie Cc: David Rientjes Cc: Peter Zijlstra Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index e9eaec5..7a7e5fd 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -89,42 +89,33 @@ extern void rebuild_sched_domains(void); extern void cpuset_print_task_mems_allowed(struct task_struct *p); /* - * reading current mems_allowed and mempolicy in the fastpath must protected - * by get_mems_allowed() + * get_mems_allowed is required when making decisions involving mems_allowed + * such as during page allocation. mems_allowed can be updated in parallel + * and depending on the new value an operation can fail potentially causing + * process failure. A retry loop with get_mems_allowed and put_mems_allowed + * prevents these artificial failures. */ -static inline void get_mems_allowed(void) +static inline unsigned int get_mems_allowed(void) { - current->mems_allowed_change_disable++; - - /* - * ensure that reading mems_allowed and mempolicy happens after the - * update of ->mems_allowed_change_disable. - * - * the write-side task finds ->mems_allowed_change_disable is not 0, - * and knows the read-side task is reading mems_allowed or mempolicy, - * so it will clear old bits lazily. - */ - smp_mb(); + return read_seqcount_begin(¤t->mems_allowed_seq); } -static inline void put_mems_allowed(void) +/* + * If this returns false, the operation that took place after get_mems_allowed + * may have failed. It is up to the caller to retry the operation if + * appropriate. + */ +static inline bool put_mems_allowed(unsigned int seq) { - /* - * ensure that reading mems_allowed and mempolicy before reducing - * mems_allowed_change_disable. - * - * the write-side task will know that the read-side task is still - * reading mems_allowed or mempolicy, don't clears old bits in the - * nodemask. - */ - smp_mb(); - --ACCESS_ONCE(current->mems_allowed_change_disable); + return !read_seqcount_retry(¤t->mems_allowed_seq, seq); } static inline void set_mems_allowed(nodemask_t nodemask) { task_lock(current); + write_seqcount_begin(¤t->mems_allowed_seq); current->mems_allowed = nodemask; + write_seqcount_end(¤t->mems_allowed_seq); task_unlock(current); } @@ -234,12 +225,14 @@ static inline void set_mems_allowed(nodemask_t nodemask) { } -static inline void get_mems_allowed(void) +static inline unsigned int get_mems_allowed(void) { + return 0; } -static inline void put_mems_allowed(void) +static inline bool put_mems_allowed(unsigned int seq) { + return true; } #endif /* !CONFIG_CPUSETS */ diff --git a/include/linux/init_task.h b/include/linux/init_task.h index f994d51..e4baff5 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -29,6 +29,13 @@ extern struct fs_struct init_fs; #define INIT_GROUP_RWSEM(sig) #endif +#ifdef CONFIG_CPUSETS +#define INIT_CPUSET_SEQ \ + .mems_allowed_seq = SEQCNT_ZERO, +#else +#define INIT_CPUSET_SEQ +#endif + #define INIT_SIGNALS(sig) { \ .nr_threads = 1, \ .wait_chldexit = __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\ @@ -192,6 +199,7 @@ extern struct cred init_cred; INIT_FTRACE_GRAPH \ INIT_TRACE_RECURSION \ INIT_TASK_RCU_PREEMPT(tsk) \ + INIT_CPUSET_SEQ \ } diff --git a/include/linux/sched.h b/include/linux/sched.h index e074e1e..0c147a4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1514,7 +1514,7 @@ struct task_struct { #endif #ifdef CONFIG_CPUSETS nodemask_t mems_allowed; /* Protected by alloc_lock */ - int mems_allowed_change_disable; + seqcount_t mems_allowed_seq; /* Seqence no to catch updates */ int cpuset_mem_spread_rotor; int cpuset_slab_spread_rotor; #endif diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 5d57583..1010cc6 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -964,7 +964,6 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk, { bool need_loop; -repeat: /* * Allow tasks that have access to memory reserves because they have * been OOM killed to get memory anywhere. @@ -983,45 +982,19 @@ repeat: */ need_loop = task_has_mempolicy(tsk) || !nodes_intersects(*newmems, tsk->mems_allowed); - nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems); - mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1); - /* - * ensure checking ->mems_allowed_change_disable after setting all new - * allowed nodes. - * - * the read-side task can see an nodemask with new allowed nodes and - * old allowed nodes. and if it allocates page when cpuset clears newly - * disallowed ones continuous, it can see the new allowed bits. - * - * And if setting all new allowed nodes is after the checking, setting - * all new allowed nodes and clearing newly disallowed ones will be done - * continuous, and the read-side task may find no node to alloc page. - */ - smp_mb(); + if (need_loop) + write_seqcount_begin(&tsk->mems_allowed_seq); - /* - * Allocation of memory is very fast, we needn't sleep when waiting - * for the read-side. - */ - while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) { - task_unlock(tsk); - if (!task_curr(tsk)) - yield(); - goto repeat; - } - - /* - * ensure checking ->mems_allowed_change_disable before clearing all new - * disallowed nodes. - * - * if clearing newly disallowed bits before the checking, the read-side - * task may find no node to alloc page. - */ - smp_mb(); + nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems); + mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1); mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2); tsk->mems_allowed = *newmems; + + if (need_loop) + write_seqcount_end(&tsk->mems_allowed_seq); + task_unlock(tsk); } diff --git a/kernel/fork.c b/kernel/fork.c index a9e99f3..9cc227d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1237,6 +1237,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, #ifdef CONFIG_CPUSETS p->cpuset_mem_spread_rotor = NUMA_NO_NODE; p->cpuset_slab_spread_rotor = NUMA_NO_NODE; + seqcount_init(&p->mems_allowed_seq); #endif #ifdef CONFIG_TRACE_IRQFLAGS p->irq_events = 0; diff --git a/mm/filemap.c b/mm/filemap.c index f323060..8430420 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -499,10 +499,13 @@ struct page *__page_cache_alloc(gfp_t gfp) struct page *page; if (cpuset_do_page_mem_spread()) { - get_mems_allowed(); - n = cpuset_mem_spread_node(); - page = alloc_pages_exact_node(n, gfp, 0); - put_mems_allowed(); + unsigned int cpuset_mems_cookie; + do { + cpuset_mems_cookie = get_mems_allowed(); + n = cpuset_mem_spread_node(); + page = alloc_pages_exact_node(n, gfp, 0); + } while (!put_mems_allowed(cpuset_mems_cookie) && !page); + return page; } return alloc_pages(gfp, 0); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 62f9fad..b1c3148 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -454,14 +454,16 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, unsigned long address, int avoid_reserve) { - struct page *page = NULL; + struct page *page; struct mempolicy *mpol; nodemask_t *nodemask; struct zonelist *zonelist; struct zone *zone; struct zoneref *z; + unsigned int cpuset_mems_cookie; - get_mems_allowed(); +retry_cpuset: + cpuset_mems_cookie = get_mems_allowed(); zonelist = huge_zonelist(vma, address, htlb_alloc_mask, &mpol, &nodemask); /* @@ -488,10 +490,15 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, } } } -err: + mpol_cond_put(mpol); - put_mems_allowed(); + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; return page; + +err: + mpol_cond_put(mpol); + return NULL; } static void update_and_free_page(struct hstate *h, struct page *page) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 71e1a52..cfb6c86 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1850,18 +1850,24 @@ struct page * alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, unsigned long addr, int node) { - struct mempolicy *pol = get_vma_policy(current, vma, addr); + struct mempolicy *pol; struct zonelist *zl; struct page *page; + unsigned int cpuset_mems_cookie; + +retry_cpuset: + pol = get_vma_policy(current, vma, addr); + cpuset_mems_cookie = get_mems_allowed(); - get_mems_allowed(); if (unlikely(pol->mode == MPOL_INTERLEAVE)) { unsigned nid; nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); mpol_cond_put(pol); page = alloc_page_interleave(gfp, order, nid); - put_mems_allowed(); + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; + return page; } zl = policy_zonelist(gfp, pol, node); @@ -1872,7 +1878,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, struct page *page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); __mpol_put(pol); - put_mems_allowed(); + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; return page; } /* @@ -1880,7 +1887,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, */ page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); - put_mems_allowed(); + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; return page; } @@ -1907,11 +1915,14 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order) { struct mempolicy *pol = current->mempolicy; struct page *page; + unsigned int cpuset_mems_cookie; if (!pol || in_interrupt() || (gfp & __GFP_THISNODE)) pol = &default_policy; - get_mems_allowed(); +retry_cpuset: + cpuset_mems_cookie = get_mems_allowed(); + /* * No reference counting needed for current->mempolicy * nor system default_policy @@ -1922,7 +1933,10 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order) page = __alloc_pages_nodemask(gfp, order, policy_zonelist(gfp, pol, numa_node_id()), policy_nodemask(gfp, pol)); - put_mems_allowed(); + + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; + return page; } EXPORT_SYMBOL(alloc_pages_current); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 673596a..40de685 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2380,8 +2380,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, { enum zone_type high_zoneidx = gfp_zone(gfp_mask); struct zone *preferred_zone; - struct page *page; + struct page *page = NULL; int migratetype = allocflags_to_migratetype(gfp_mask); + unsigned int cpuset_mems_cookie; gfp_mask &= gfp_allowed_mask; @@ -2400,15 +2401,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, if (unlikely(!zonelist->_zonerefs->zone)) return NULL; - get_mems_allowed(); +retry_cpuset: + cpuset_mems_cookie = get_mems_allowed(); + /* The preferred zone is used for statistics later */ first_zones_zonelist(zonelist, high_zoneidx, nodemask ? : &cpuset_current_mems_allowed, &preferred_zone); - if (!preferred_zone) { - put_mems_allowed(); - return NULL; - } + if (!preferred_zone) + goto out; /* First allocation attempt */ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, @@ -2418,9 +2419,19 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, page = __alloc_pages_slowpath(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, migratetype); - put_mems_allowed(); trace_mm_page_alloc(page, order, gfp_mask, migratetype); + +out: + /* + * When updating a task's mems_allowed, it is possible to race with + * parallel threads in such a way that an allocation can fail while + * the mask is being updated. If a page allocation is about to fail, + * check if the cpuset changed during allocation and if so, retry. + */ + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) + goto retry_cpuset; + return page; } EXPORT_SYMBOL(__alloc_pages_nodemask); @@ -2634,13 +2645,15 @@ void si_meminfo_node(struct sysinfo *val, int nid) bool skip_free_areas_node(unsigned int flags, int nid) { bool ret = false; + unsigned int cpuset_mems_cookie; if (!(flags & SHOW_MEM_FILTER_NODES)) goto out; - get_mems_allowed(); - ret = !node_isset(nid, cpuset_current_mems_allowed); - put_mems_allowed(); + do { + cpuset_mems_cookie = get_mems_allowed(); + ret = !node_isset(nid, cpuset_current_mems_allowed); + } while (!put_mems_allowed(cpuset_mems_cookie)); out: return ret; } diff --git a/mm/slab.c b/mm/slab.c index f0bd785..29c8716 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3284,12 +3284,10 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags) if (in_interrupt() || (flags & __GFP_THISNODE)) return NULL; nid_alloc = nid_here = numa_mem_id(); - get_mems_allowed(); if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD)) nid_alloc = cpuset_slab_spread_node(); else if (current->mempolicy) nid_alloc = slab_node(current->mempolicy); - put_mems_allowed(); if (nid_alloc != nid_here) return ____cache_alloc_node(cachep, flags, nid_alloc); return NULL; @@ -3312,14 +3310,17 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) enum zone_type high_zoneidx = gfp_zone(flags); void *obj = NULL; int nid; + unsigned int cpuset_mems_cookie; if (flags & __GFP_THISNODE) return NULL; - get_mems_allowed(); - zonelist = node_zonelist(slab_node(current->mempolicy), flags); local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); +retry_cpuset: + cpuset_mems_cookie = get_mems_allowed(); + zonelist = node_zonelist(slab_node(current->mempolicy), flags); + retry: /* * Look through allowed nodes for objects available @@ -3372,7 +3373,9 @@ retry: } } } - put_mems_allowed(); + + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj)) + goto retry_cpuset; return obj; } diff --git a/mm/slub.c b/mm/slub.c index 4907563..f4a6229 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1581,6 +1581,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags, struct zone *zone; enum zone_type high_zoneidx = gfp_zone(flags); void *object; + unsigned int cpuset_mems_cookie; /* * The defrag ratio allows a configuration of the tradeoffs between @@ -1604,23 +1605,32 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags, get_cycles() % 1024 > s->remote_node_defrag_ratio) return NULL; - get_mems_allowed(); - zonelist = node_zonelist(slab_node(current->mempolicy), flags); - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { - struct kmem_cache_node *n; - - n = get_node(s, zone_to_nid(zone)); - - if (n && cpuset_zone_allowed_hardwall(zone, flags) && - n->nr_partial > s->min_partial) { - object = get_partial_node(s, n, c); - if (object) { - put_mems_allowed(); - return object; + do { + cpuset_mems_cookie = get_mems_allowed(); + zonelist = node_zonelist(slab_node(current->mempolicy), flags); + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { + struct kmem_cache_node *n; + + n = get_node(s, zone_to_nid(zone)); + + if (n && cpuset_zone_allowed_hardwall(zone, flags) && + n->nr_partial > s->min_partial) { + object = get_partial_node(s, n, c); + if (object) { + /* + * Return the object even if + * put_mems_allowed indicated that + * the cpuset mems_allowed was + * updated in parallel. It's a + * harmless race between the alloc + * and the cpuset update. + */ + put_mems_allowed(cpuset_mems_cookie); + return object; + } } } - } - put_mems_allowed(); + } while (!put_mems_allowed(cpuset_mems_cookie)); #endif return NULL; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 440af1d..55d86c9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2343,7 +2343,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, unsigned long writeback_threshold; bool aborted_reclaim; - get_mems_allowed(); delayacct_freepages_start(); if (global_reclaim(sc)) @@ -2407,7 +2406,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, out: delayacct_freepages_end(); - put_mems_allowed(); if (sc->nr_reclaimed) return sc->nr_reclaimed; -- cgit v0.10.2 From ef6942224a185c9e434f6cfe69fe434e732f5b38 Mon Sep 17 00:00:00 2001 From: Bob Liu Date: Wed, 21 Mar 2012 16:34:11 -0700 Subject: ksm: cleanup: introduce find_mergeable_vma() There are multiple places which perform the same check. Add a new find_mergeable_vma() to handle this. Signed-off-by: Bob Liu Acked-by: Hugh Dickins Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/ksm.c b/mm/ksm.c index a6d3fb7..47c8853 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -374,6 +374,20 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) return (ret & VM_FAULT_OOM) ? -ENOMEM : 0; } +static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm, + unsigned long addr) +{ + struct vm_area_struct *vma; + if (ksm_test_exit(mm)) + return NULL; + vma = find_vma(mm, addr); + if (!vma || vma->vm_start > addr) + return NULL; + if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) + return NULL; + return vma; +} + static void break_cow(struct rmap_item *rmap_item) { struct mm_struct *mm = rmap_item->mm; @@ -387,15 +401,9 @@ static void break_cow(struct rmap_item *rmap_item) put_anon_vma(rmap_item->anon_vma); down_read(&mm->mmap_sem); - if (ksm_test_exit(mm)) - goto out; - vma = find_vma(mm, addr); - if (!vma || vma->vm_start > addr) - goto out; - if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) - goto out; - break_ksm(vma, addr); -out: + vma = find_mergeable_vma(mm, addr); + if (vma) + break_ksm(vma, addr); up_read(&mm->mmap_sem); } @@ -421,12 +429,8 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item) struct page *page; down_read(&mm->mmap_sem); - if (ksm_test_exit(mm)) - goto out; - vma = find_vma(mm, addr); - if (!vma || vma->vm_start > addr) - goto out; - if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) + vma = find_mergeable_vma(mm, addr); + if (!vma) goto out; page = follow_page(vma, addr, FOLL_GET); -- cgit v0.10.2 From a1d776ee3147cec2a54a645e92eb2e3e2f65a137 Mon Sep 17 00:00:00 2001 From: David Gibson Date: Wed, 21 Mar 2012 16:34:12 -0700 Subject: hugetlb: cleanup hugetlb.h Make a couple of small cleanups to linux/include/hugetlb.h. The set_file_hugepages() function, which was not used anywhere is removed, and the hugetlbfs_config and hugetlbfs_inode_info structures with its HUGETLBFS_I helper function are moved into inode.c, the only place they were used. These structures are really linked to the hugetlbfs filesystem specifically not to hugepage mm handling in general, so they belong in the filesystem code not in a generally available header. It would be nice to move the hugetlbfs_sb_info (superblock) structure in there as well, but it's currently needed in a number of places via the hstate_vma() and hstate_inode(). Signed-off-by: David Gibson Cc: Hugh Dickins Cc: Paul Mackerras Cc: Andrew Barry Cc: Mel Gorman Cc: Minchan Kim Cc: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 19654cf..4fbd9fc 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -41,6 +41,25 @@ const struct file_operations hugetlbfs_file_operations; static const struct inode_operations hugetlbfs_dir_inode_operations; static const struct inode_operations hugetlbfs_inode_operations; +struct hugetlbfs_config { + uid_t uid; + gid_t gid; + umode_t mode; + long nr_blocks; + long nr_inodes; + struct hstate *hstate; +}; + +struct hugetlbfs_inode_info { + struct shared_policy policy; + struct inode vfs_inode; +}; + +static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode) +{ + return container_of(inode, struct hugetlbfs_inode_info, vfs_inode); +} + static struct backing_dev_info hugetlbfs_backing_dev_info = { .name = "hugetlbfs", .ra_pages = 0, /* No readahead */ diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index d9d6c86..7adc492 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -128,15 +128,6 @@ enum { }; #ifdef CONFIG_HUGETLBFS -struct hugetlbfs_config { - uid_t uid; - gid_t gid; - umode_t mode; - long nr_blocks; - long nr_inodes; - struct hstate *hstate; -}; - struct hugetlbfs_sb_info { long max_blocks; /* blocks allowed */ long free_blocks; /* blocks free */ @@ -146,17 +137,6 @@ struct hugetlbfs_sb_info { struct hstate *hstate; }; - -struct hugetlbfs_inode_info { - struct shared_policy policy; - struct inode vfs_inode; -}; - -static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode) -{ - return container_of(inode, struct hugetlbfs_inode_info, vfs_inode); -} - static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb) { return sb->s_fs_info; @@ -179,14 +159,9 @@ static inline int is_file_hugepages(struct file *file) return 0; } -static inline void set_file_hugepages(struct file *file) -{ - file->f_op = &hugetlbfs_file_operations; -} #else /* !CONFIG_HUGETLBFS */ #define is_file_hugepages(file) 0 -#define set_file_hugepages(file) BUG() static inline struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag, struct user_struct **user, int creat_flags) { -- cgit v0.10.2 From 90481622d75715bfcb68501280a917dbfe516029 Mon Sep 17 00:00:00 2001 From: David Gibson Date: Wed, 21 Mar 2012 16:34:12 -0700 Subject: hugepages: fix use after free bug in "quota" handling hugetlbfs_{get,put}_quota() are badly named. They don't interact with the general quota handling code, and they don't much resemble its behaviour. Rather than being about maintaining limits on on-disk block usage by particular users, they are instead about maintaining limits on in-memory page usage (including anonymous MAP_PRIVATE copied-on-write pages) associated with a particular hugetlbfs filesystem instance. Worse, they work by having callbacks to the hugetlbfs filesystem code from the low-level page handling code, in particular from free_huge_page(). This is a layering violation of itself, but more importantly, if the kernel does a get_user_pages() on hugepages (which can happen from KVM amongst others), then the free_huge_page() can be delayed until after the associated inode has already been freed. If an unmount occurs at the wrong time, even the hugetlbfs superblock where the "quota" limits are stored may have been freed. Andrew Barry proposed a patch to fix this by having hugepages, instead of storing a pointer to their address_space and reaching the superblock from there, had the hugepages store pointers directly to the superblock, bumping the reference count as appropriate to avoid it being freed. Andrew Morton rejected that version, however, on the grounds that it made the existing layering violation worse. This is a reworked version of Andrew's patch, which removes the extra, and some of the existing, layering violation. It works by introducing the concept of a hugepage "subpool" at the lower hugepage mm layer - that is a finite logical pool of hugepages to allocate from. hugetlbfs now creates a subpool for each filesystem instance with a page limit set, and a pointer to the subpool gets added to each allocated hugepage, instead of the address_space pointer used now. The subpool has its own lifetime and is only freed once all pages in it _and_ all other references to it (i.e. superblocks) are gone. subpools are optional - a NULL subpool pointer is taken by the code to mean that no subpool limits are in effect. Previous discussion of this bug found in: "Fix refcounting in hugetlbfs quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or http://marc.info/?l=linux-mm&m=126928970510627&w=1 v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to alloc_huge_page() - since it already takes the vma, it is not necessary. Signed-off-by: Andrew Barry Signed-off-by: David Gibson Cc: Hugh Dickins Cc: Mel Gorman Cc: Minchan Kim Cc: Hillf Danton Cc: Paul Mackerras Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 4fbd9fc..7913e32 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -626,9 +626,15 @@ static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf) spin_lock(&sbinfo->stat_lock); /* If no limits set, just report 0 for max/free/used * blocks, like simple_statfs() */ - if (sbinfo->max_blocks >= 0) { - buf->f_blocks = sbinfo->max_blocks; - buf->f_bavail = buf->f_bfree = sbinfo->free_blocks; + if (sbinfo->spool) { + long free_pages; + + spin_lock(&sbinfo->spool->lock); + buf->f_blocks = sbinfo->spool->max_hpages; + free_pages = sbinfo->spool->max_hpages + - sbinfo->spool->used_hpages; + buf->f_bavail = buf->f_bfree = free_pages; + spin_unlock(&sbinfo->spool->lock); buf->f_files = sbinfo->max_inodes; buf->f_ffree = sbinfo->free_inodes; } @@ -644,6 +650,10 @@ static void hugetlbfs_put_super(struct super_block *sb) if (sbi) { sb->s_fs_info = NULL; + + if (sbi->spool) + hugepage_put_subpool(sbi->spool); + kfree(sbi); } } @@ -874,10 +884,14 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent) sb->s_fs_info = sbinfo; sbinfo->hstate = config.hstate; spin_lock_init(&sbinfo->stat_lock); - sbinfo->max_blocks = config.nr_blocks; - sbinfo->free_blocks = config.nr_blocks; sbinfo->max_inodes = config.nr_inodes; sbinfo->free_inodes = config.nr_inodes; + sbinfo->spool = NULL; + if (config.nr_blocks != -1) { + sbinfo->spool = hugepage_new_subpool(config.nr_blocks); + if (!sbinfo->spool) + goto out_free; + } sb->s_maxbytes = MAX_LFS_FILESIZE; sb->s_blocksize = huge_page_size(config.hstate); sb->s_blocksize_bits = huge_page_shift(config.hstate); @@ -896,38 +910,12 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent) sb->s_root = root; return 0; out_free: + if (sbinfo->spool) + kfree(sbinfo->spool); kfree(sbinfo); return -ENOMEM; } -int hugetlb_get_quota(struct address_space *mapping, long delta) -{ - int ret = 0; - struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb); - - if (sbinfo->free_blocks > -1) { - spin_lock(&sbinfo->stat_lock); - if (sbinfo->free_blocks - delta >= 0) - sbinfo->free_blocks -= delta; - else - ret = -ENOMEM; - spin_unlock(&sbinfo->stat_lock); - } - - return ret; -} - -void hugetlb_put_quota(struct address_space *mapping, long delta) -{ - struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb); - - if (sbinfo->free_blocks > -1) { - spin_lock(&sbinfo->stat_lock); - sbinfo->free_blocks += delta; - spin_unlock(&sbinfo->stat_lock); - } -} - static struct dentry *hugetlbfs_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 7adc492..cf01817 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -14,6 +14,15 @@ struct user_struct; #include #include +struct hugepage_subpool { + spinlock_t lock; + long count; + long max_hpages, used_hpages; +}; + +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); +void hugepage_put_subpool(struct hugepage_subpool *spool); + int PageHuge(struct page *page); void reset_vma_resv_huge_pages(struct vm_area_struct *vma); @@ -129,12 +138,11 @@ enum { #ifdef CONFIG_HUGETLBFS struct hugetlbfs_sb_info { - long max_blocks; /* blocks allowed */ - long free_blocks; /* blocks free */ long max_inodes; /* inodes allowed */ long free_inodes; /* inodes free */ spinlock_t stat_lock; struct hstate *hstate; + struct hugepage_subpool *spool; }; static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb) @@ -146,8 +154,6 @@ extern const struct file_operations hugetlbfs_file_operations; extern const struct vm_operations_struct hugetlb_vm_ops; struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct, struct user_struct **user, int creat_flags); -int hugetlb_get_quota(struct address_space *mapping, long delta); -void hugetlb_put_quota(struct address_space *mapping, long delta); static inline int is_file_hugepages(struct file *file) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index b1c3148..afa057a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -53,6 +53,84 @@ static unsigned long __initdata default_hstate_size; */ static DEFINE_SPINLOCK(hugetlb_lock); +static inline void unlock_or_release_subpool(struct hugepage_subpool *spool) +{ + bool free = (spool->count == 0) && (spool->used_hpages == 0); + + spin_unlock(&spool->lock); + + /* If no pages are used, and no other handles to the subpool + * remain, free the subpool the subpool remain */ + if (free) + kfree(spool); +} + +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks) +{ + struct hugepage_subpool *spool; + + spool = kmalloc(sizeof(*spool), GFP_KERNEL); + if (!spool) + return NULL; + + spin_lock_init(&spool->lock); + spool->count = 1; + spool->max_hpages = nr_blocks; + spool->used_hpages = 0; + + return spool; +} + +void hugepage_put_subpool(struct hugepage_subpool *spool) +{ + spin_lock(&spool->lock); + BUG_ON(!spool->count); + spool->count--; + unlock_or_release_subpool(spool); +} + +static int hugepage_subpool_get_pages(struct hugepage_subpool *spool, + long delta) +{ + int ret = 0; + + if (!spool) + return 0; + + spin_lock(&spool->lock); + if ((spool->used_hpages + delta) <= spool->max_hpages) { + spool->used_hpages += delta; + } else { + ret = -ENOMEM; + } + spin_unlock(&spool->lock); + + return ret; +} + +static void hugepage_subpool_put_pages(struct hugepage_subpool *spool, + long delta) +{ + if (!spool) + return; + + spin_lock(&spool->lock); + spool->used_hpages -= delta; + /* If hugetlbfs_put_super couldn't free spool due to + * an outstanding quota reference, free it now. */ + unlock_or_release_subpool(spool); +} + +static inline struct hugepage_subpool *subpool_inode(struct inode *inode) +{ + return HUGETLBFS_SB(inode->i_sb)->spool; +} + +static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma) +{ + return subpool_inode(vma->vm_file->f_dentry->d_inode); +} + /* * Region tracking -- allows tracking of reservations and instantiated pages * across the pages in a mapping. @@ -540,9 +618,9 @@ static void free_huge_page(struct page *page) */ struct hstate *h = page_hstate(page); int nid = page_to_nid(page); - struct address_space *mapping; + struct hugepage_subpool *spool = + (struct hugepage_subpool *)page_private(page); - mapping = (struct address_space *) page_private(page); set_page_private(page, 0); page->mapping = NULL; BUG_ON(page_count(page)); @@ -558,8 +636,7 @@ static void free_huge_page(struct page *page) enqueue_huge_page(h, page); } spin_unlock(&hugetlb_lock); - if (mapping) - hugetlb_put_quota(mapping, 1); + hugepage_subpool_put_pages(spool, 1); } static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) @@ -977,11 +1054,12 @@ static void return_unused_surplus_pages(struct hstate *h, /* * Determine if the huge page at addr within the vma has an associated * reservation. Where it does not we will need to logically increase - * reservation and actually increase quota before an allocation can occur. - * Where any new reservation would be required the reservation change is - * prepared, but not committed. Once the page has been quota'd allocated - * an instantiated the change should be committed via vma_commit_reservation. - * No action is required on failure. + * reservation and actually increase subpool usage before an allocation + * can occur. Where any new reservation would be required the + * reservation change is prepared, but not committed. Once the page + * has been allocated from the subpool and instantiated the change should + * be committed via vma_commit_reservation. No action is required on + * failure. */ static long vma_needs_reservation(struct hstate *h, struct vm_area_struct *vma, unsigned long addr) @@ -1030,24 +1108,24 @@ static void vma_commit_reservation(struct hstate *h, static struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve) { + struct hugepage_subpool *spool = subpool_vma(vma); struct hstate *h = hstate_vma(vma); struct page *page; - struct address_space *mapping = vma->vm_file->f_mapping; - struct inode *inode = mapping->host; long chg; /* - * Processes that did not create the mapping will have no reserves and - * will not have accounted against quota. Check that the quota can be - * made before satisfying the allocation - * MAP_NORESERVE mappings may also need pages and quota allocated - * if no reserve mapping overlaps. + * Processes that did not create the mapping will have no + * reserves and will not have accounted against subpool + * limit. Check that the subpool limit can be made before + * satisfying the allocation MAP_NORESERVE mappings may also + * need pages and subpool limit allocated allocated if no reserve + * mapping overlaps. */ chg = vma_needs_reservation(h, vma, addr); if (chg < 0) return ERR_PTR(-VM_FAULT_OOM); if (chg) - if (hugetlb_get_quota(inode->i_mapping, chg)) + if (hugepage_subpool_get_pages(spool, chg)) return ERR_PTR(-VM_FAULT_SIGBUS); spin_lock(&hugetlb_lock); @@ -1057,12 +1135,12 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma, if (!page) { page = alloc_buddy_huge_page(h, NUMA_NO_NODE); if (!page) { - hugetlb_put_quota(inode->i_mapping, chg); + hugepage_subpool_put_pages(spool, chg); return ERR_PTR(-VM_FAULT_SIGBUS); } } - set_page_private(page, (unsigned long) mapping); + set_page_private(page, (unsigned long)spool); vma_commit_reservation(h, vma, addr); @@ -2083,6 +2161,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma) { struct hstate *h = hstate_vma(vma); struct resv_map *reservations = vma_resv_map(vma); + struct hugepage_subpool *spool = subpool_vma(vma); unsigned long reserve; unsigned long start; unsigned long end; @@ -2098,7 +2177,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma) if (reserve) { hugetlb_acct_memory(h, -reserve); - hugetlb_put_quota(vma->vm_file->f_mapping, reserve); + hugepage_subpool_put_pages(spool, reserve); } } } @@ -2331,7 +2410,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, */ address = address & huge_page_mask(h); pgoff = vma_hugecache_offset(h, vma, address); - mapping = (struct address_space *)page_private(page); + mapping = vma->vm_file->f_dentry->d_inode->i_mapping; /* * Take the mapping lock for the duration of the table walk. As @@ -2884,11 +2963,12 @@ int hugetlb_reserve_pages(struct inode *inode, { long ret, chg; struct hstate *h = hstate_inode(inode); + struct hugepage_subpool *spool = subpool_inode(inode); /* * Only apply hugepage reservation if asked. At fault time, an * attempt will be made for VM_NORESERVE to allocate a page - * and filesystem quota without using reserves + * without using reserves */ if (vm_flags & VM_NORESERVE) return 0; @@ -2915,17 +2995,17 @@ int hugetlb_reserve_pages(struct inode *inode, if (chg < 0) return chg; - /* There must be enough filesystem quota for the mapping */ - if (hugetlb_get_quota(inode->i_mapping, chg)) + /* There must be enough pages in the subpool for the mapping */ + if (hugepage_subpool_get_pages(spool, chg)) return -ENOSPC; /* * Check enough hugepages are available for the reservation. - * Hand back the quota if there are not + * Hand the pages back to the subpool if there are not */ ret = hugetlb_acct_memory(h, chg); if (ret < 0) { - hugetlb_put_quota(inode->i_mapping, chg); + hugepage_subpool_put_pages(spool, chg); return ret; } @@ -2949,12 +3029,13 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed) { struct hstate *h = hstate_inode(inode); long chg = region_truncate(&inode->i_mapping->private_list, offset); + struct hugepage_subpool *spool = subpool_inode(inode); spin_lock(&inode->i_lock); inode->i_blocks -= (blocks_per_huge_page(h) * freed); spin_unlock(&inode->i_lock); - hugetlb_put_quota(inode->i_mapping, (chg - freed)); + hugepage_subpool_put_pages(spool, (chg - freed)); hugetlb_acct_memory(h, -(chg - freed)); } -- cgit v0.10.2 From 05af2e104a0c282dcd9303431e1360750ba76de6 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Wed, 21 Mar 2012 16:34:13 -0700 Subject: mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat() sync_mm_rss() can only be used for current to avoid race conditions in iterating and clearing its per-task counters. Remove the task argument for it and its helper function, __sync_task_rss_stat(), to avoid thinking it can be used safely for anything other than current. Signed-off-by: David Rientjes Acked-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/exec.c b/fs/exec.c index 3908544..6ed164d 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -824,7 +824,7 @@ static int exec_mmap(struct mm_struct *mm) /* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; - sync_mm_rss(tsk, old_mm); + sync_mm_rss(old_mm); mm_release(tsk, old_mm); if (old_mm) { diff --git a/include/linux/mm.h b/include/linux/mm.h index df17ff2..ce2b2a3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1131,9 +1131,9 @@ static inline void setmax_mm_hiwater_rss(unsigned long *maxrss, } #if defined(SPLIT_RSS_COUNTING) -void sync_mm_rss(struct task_struct *task, struct mm_struct *mm); +void sync_mm_rss(struct mm_struct *mm); #else -static inline void sync_mm_rss(struct task_struct *task, struct mm_struct *mm) +static inline void sync_mm_rss(struct mm_struct *mm) { } #endif diff --git a/kernel/exit.c b/kernel/exit.c index 0ed15fe..d26acd3 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -934,7 +934,7 @@ void do_exit(long code) acct_update_integrals(tsk); /* sync mm's RSS info before statistics gathering */ if (tsk->mm) - sync_mm_rss(tsk, tsk->mm); + sync_mm_rss(tsk->mm); group_dead = atomic_dec_and_test(&tsk->signal->live); if (group_dead) { hrtimer_cancel(&tsk->signal->real_timer); diff --git a/mm/memory.c b/mm/memory.c index a5de734..2d27239 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -125,17 +125,17 @@ core_initcall(init_zero_pfn); #if defined(SPLIT_RSS_COUNTING) -static void __sync_task_rss_stat(struct task_struct *task, struct mm_struct *mm) +static void __sync_task_rss_stat(struct mm_struct *mm) { int i; for (i = 0; i < NR_MM_COUNTERS; i++) { - if (task->rss_stat.count[i]) { - add_mm_counter(mm, i, task->rss_stat.count[i]); - task->rss_stat.count[i] = 0; + if (current->rss_stat.count[i]) { + add_mm_counter(mm, i, current->rss_stat.count[i]); + current->rss_stat.count[i] = 0; } } - task->rss_stat.events = 0; + current->rss_stat.events = 0; } static void add_mm_counter_fast(struct mm_struct *mm, int member, int val) @@ -157,12 +157,12 @@ static void check_sync_rss_stat(struct task_struct *task) if (unlikely(task != current)) return; if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH)) - __sync_task_rss_stat(task, task->mm); + __sync_task_rss_stat(task->mm); } -void sync_mm_rss(struct task_struct *task, struct mm_struct *mm) +void sync_mm_rss(struct mm_struct *mm) { - __sync_task_rss_stat(task, mm); + __sync_task_rss_stat(mm); } #else /* SPLIT_RSS_COUNTING */ @@ -643,7 +643,7 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss) int i; if (current->mm == mm) - sync_mm_rss(current, mm); + sync_mm_rss(mm); for (i = 0; i < NR_MM_COUNTERS; i++) if (rss[i]) add_mm_counter(mm, i, rss[i]); diff --git a/mm/mmu_context.c b/mm/mmu_context.c index cf332bc..3dcfaf4 100644 --- a/mm/mmu_context.c +++ b/mm/mmu_context.c @@ -53,7 +53,7 @@ void unuse_mm(struct mm_struct *mm) struct task_struct *tsk = current; task_lock(tsk); - sync_mm_rss(tsk, mm); + sync_mm_rss(mm); tsk->mm = NULL; /* active_mm is still 'mm' */ enter_lazy_tlb(mm, tsk); -- cgit v0.10.2 From ea48cf7863c789579b170ef28e7fc62728365d6e Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Wed, 21 Mar 2012 16:34:13 -0700 Subject: mm, counters: fold __sync_task_rss_stat() into sync_mm_rss() There's no difference between sync_mm_rss() and __sync_task_rss_stat(), so fold the latter into the former. Signed-off-by: David Rientjes Acked-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memory.c b/mm/memory.c index 2d27239..1e0561e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -125,7 +125,7 @@ core_initcall(init_zero_pfn); #if defined(SPLIT_RSS_COUNTING) -static void __sync_task_rss_stat(struct mm_struct *mm) +void sync_mm_rss(struct mm_struct *mm) { int i; @@ -157,12 +157,7 @@ static void check_sync_rss_stat(struct task_struct *task) if (unlikely(task != current)) return; if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH)) - __sync_task_rss_stat(task->mm); -} - -void sync_mm_rss(struct mm_struct *mm) -{ - __sync_task_rss_stat(mm); + sync_mm_rss(task->mm); } #else /* SPLIT_RSS_COUNTING */ -- cgit v0.10.2 From 21a3c273f88c9cbbaf7e14505df0131d95c8f262 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Wed, 21 Mar 2012 16:34:13 -0700 Subject: mm, hugetlb: add thread name and pid to SHM_HUGETLB mlock rlimit warning Add the thread name and pid of the application that is allocating shm segments with MAP_HUGETLB without being a part of /proc/sys/vm/hugetlb_shm_group or having CAP_IPC_LOCK. This identifies the application so it may be fixed by avoiding using the deprecated exception (see Documentation/feature-removal-schedule.txt). Signed-off-by: David Rientjes Cc: Dave Jones Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 7913e32..7940815 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -953,7 +953,11 @@ struct file *hugetlb_file_setup(const char *name, size_t size, if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) { *user = current_user(); if (user_shm_lock(size, *user)) { - printk_once(KERN_WARNING "Using mlock ulimits for SHM_HUGETLB is deprecated\n"); + task_lock(current); + printk_once(KERN_WARNING + "%s (%d): Using mlock ulimits for SHM_HUGETLB is deprecated\n", + current->comm, current->pid); + task_unlock(current); } else { *user = NULL; return ERR_PTR(-EPERM); -- cgit v0.10.2 From 40716e29243de46720e5773797791466c28904ec Mon Sep 17 00:00:00 2001 From: Steven Truelove Date: Wed, 21 Mar 2012 16:34:14 -0700 Subject: hugetlbfs: fix alignment of huge page requests When calling shmget() with SHM_HUGETLB, shmget aligns the request size to PAGE_SIZE, but this is not sufficient. Modify hugetlb_file_setup() to align requests to the huge page size, and to accept an address argument so that all alignment checks can be performed in hugetlb_file_setup(), rather than in its callers. Change newseg() and mmap_pgoff() to match the new prototype and eliminate a now redundant alignment check. [akpm@linux-foundation.org: fix build] Signed-off-by: Steven Truelove Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 7940815..631329f 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -935,8 +935,8 @@ static int can_do_hugetlb_shm(void) return capable(CAP_IPC_LOCK) || in_group_p(sysctl_hugetlb_shm_group); } -struct file *hugetlb_file_setup(const char *name, size_t size, - vm_flags_t acctflag, +struct file *hugetlb_file_setup(const char *name, unsigned long addr, + size_t size, vm_flags_t acctflag, struct user_struct **user, int creat_flags) { int error = -ENOMEM; @@ -945,6 +945,8 @@ struct file *hugetlb_file_setup(const char *name, size_t size, struct path path; struct dentry *root; struct qstr quick_string; + struct hstate *hstate; + unsigned long num_pages; *user = NULL; if (!hugetlbfs_vfsmount) @@ -978,10 +980,12 @@ struct file *hugetlb_file_setup(const char *name, size_t size, if (!inode) goto out_dentry; + hstate = hstate_inode(inode); + size += addr & ~huge_page_mask(hstate); + num_pages = ALIGN(size, huge_page_size(hstate)) >> + huge_page_shift(hstate); error = -ENOMEM; - if (hugetlb_reserve_pages(inode, 0, - size >> huge_page_shift(hstate_inode(inode)), NULL, - acctflag)) + if (hugetlb_reserve_pages(inode, 0, num_pages, NULL, acctflag)) goto out_inode; d_instantiate(path.dentry, inode); diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index cf01817..000837e 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -152,7 +152,8 @@ static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb) extern const struct file_operations hugetlbfs_file_operations; extern const struct vm_operations_struct hugetlb_vm_ops; -struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct, +struct file *hugetlb_file_setup(const char *name, unsigned long addr, + size_t size, vm_flags_t acct, struct user_struct **user, int creat_flags); static inline int is_file_hugepages(struct file *file) @@ -168,7 +169,8 @@ static inline int is_file_hugepages(struct file *file) #else /* !CONFIG_HUGETLBFS */ #define is_file_hugepages(file) 0 -static inline struct file *hugetlb_file_setup(const char *name, size_t size, +static inline struct file * +hugetlb_file_setup(const char *name, unsigned long addr, size_t size, vm_flags_t acctflag, struct user_struct **user, int creat_flags) { return ERR_PTR(-ENOSYS); diff --git a/ipc/shm.c b/ipc/shm.c index b76be5b..406c5b2 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -482,7 +482,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params) /* hugetlb_file_setup applies strict accounting */ if (shmflg & SHM_NORESERVE) acctflag = VM_NORESERVE; - file = hugetlb_file_setup(name, size, acctflag, + file = hugetlb_file_setup(name, 0, size, acctflag, &shp->mlock_user, HUGETLB_SHMFS_INODE); } else { /* diff --git a/mm/mmap.c b/mm/mmap.c index 9e0c0de..a19cc27 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1099,9 +1099,9 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len, * A dummy user value is used because we are not locking * memory so no accounting is necessary */ - len = ALIGN(len, huge_page_size(&default_hstate)); - file = hugetlb_file_setup(HUGETLB_ANON_FILE, len, VM_NORESERVE, - &user, HUGETLB_ANONHUGE_INODE); + file = hugetlb_file_setup(HUGETLB_ANON_FILE, addr, len, + VM_NORESERVE, &user, + HUGETLB_ANONHUGE_INODE); if (IS_ERR(file)) return PTR_ERR(file); } -- cgit v0.10.2 From b69add218d32450d6604bc9080f6e33e19b06f5e Mon Sep 17 00:00:00 2001 From: Xiao Guangrong Date: Wed, 21 Mar 2012 16:34:14 -0700 Subject: hugetlb: remove prev_vma from hugetlb_get_unmapped_area_topdown() After looking up the vma which covers or follows the cached search address, the following condition is always true: !prev_vma || (addr >= prev_vma->vm_end) so we can stop checking the previous VMA altogether. Signed-off-by: Xiao Guangrong Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index c20e81c..f6679a7 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -308,7 +308,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file, { struct hstate *h = hstate_file(file); struct mm_struct *mm = current->mm; - struct vm_area_struct *vma, *prev_vma; + struct vm_area_struct *vma; unsigned long base = mm->mmap_base; unsigned long addr = addr0; unsigned long largest_hole = mm->cached_hole_size; @@ -340,22 +340,14 @@ try_again: if (!vma) return addr; - /* - * new region fits between prev_vma->vm_end and - * vma->vm_start, use it: - */ - prev_vma = vma->vm_prev; - if (addr + len <= vma->vm_start && - (!prev_vma || (addr >= prev_vma->vm_end))) { + if (addr + len <= vma->vm_start) { /* remember the address as a hint for next time */ mm->cached_hole_size = largest_hole; return (mm->free_area_cache = addr); - } else { + } else if (mm->free_area_cache == vma->vm_end) { /* pull free_area_cache down to the first hole */ - if (mm->free_area_cache == vma->vm_end) { - mm->free_area_cache = vma->vm_start; - mm->cached_hole_size = largest_hole; - } + mm->free_area_cache = vma->vm_start; + mm->cached_hole_size = largest_hole; } /* remember the largest hole we saw so far */ -- cgit v0.10.2 From d1d5e05ffdc110021ae7937802e88ae0d223dcdc Mon Sep 17 00:00:00 2001 From: Hillf Danton Date: Wed, 21 Mar 2012 16:34:15 -0700 Subject: hugetlbfs: return error code when initializing module Return an errno upon failure to create inode kmem cache, and unregister the FS upon failure to mount. [akpm@linux-foundation.org: remove unneeded test of `error'] Signed-off-by: Hillf Danton Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 631329f..2691633 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -1021,6 +1021,7 @@ static int __init init_hugetlbfs_fs(void) if (error) return error; + error = -ENOMEM; hugetlbfs_inode_cachep = kmem_cache_create("hugetlbfs_inode_cache", sizeof(struct hugetlbfs_inode_info), 0, 0, init_once); @@ -1039,10 +1040,10 @@ static int __init init_hugetlbfs_fs(void) } error = PTR_ERR(vfsmount); + unregister_filesystem(&hugetlbfs_fs_type); out: - if (error) - kmem_cache_destroy(hugetlbfs_inode_cachep); + kmem_cache_destroy(hugetlbfs_inode_cachep); out2: bdi_destroy(&hugetlbfs_backing_dev_info); return error; -- cgit v0.10.2 From 8d13bddd11c10db40e2c81b4b224c11126691fc0 Mon Sep 17 00:00:00 2001 From: Kautuk Consul Date: Wed, 21 Mar 2012 16:34:15 -0700 Subject: page_alloc.c: remove add_from_early_node_map() add_from_early_node_map() is unused. Signed-off-by: Kautuk Consul Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/mm.h b/include/linux/mm.h index ce2b2a3..b1c8318 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1295,8 +1295,6 @@ extern void get_pfn_range_for_nid(unsigned int nid, extern unsigned long find_min_pfn_with_active_regions(void); extern void free_bootmem_with_active_regions(int nid, unsigned long max_low_pfn); -int add_from_early_node_map(struct range *range, int az, - int nr_range, int nid); extern void sparse_memory_present_with_active_regions(int nid); #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 40de685..70bbd0f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3940,18 +3940,6 @@ void __init free_bootmem_with_active_regions(int nid, unsigned long max_low_pfn) } } -int __init add_from_early_node_map(struct range *range, int az, - int nr_range, int nid) -{ - unsigned long start_pfn, end_pfn; - int i; - - /* need to go over early_node_map to find out good range for node */ - for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) - nr_range = add_range(range, az, nr_range, start_pfn, end_pfn); - return nr_range; -} - /** * sparse_memory_present_with_active_regions - Call memory_present for each active range * @nid: The node to call memory_present for. If MAX_NUMNODES, all nodes will be used. -- cgit v0.10.2 From b224ef856b1a5b949daff5937a9e187fe622b8f5 Mon Sep 17 00:00:00 2001 From: Kautuk Consul Date: Wed, 21 Mar 2012 16:34:15 -0700 Subject: page_alloc: remove unused find_zone_movable_pfns_for_nodes() argument find_zone_movable_pfns_for_nodes() does not use its argument. Signed-off-by: Kautuk Consul Cc: David Rientjes Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 70bbd0f..caea788 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4524,7 +4524,7 @@ static unsigned long __init early_calculate_totalpages(void) * memory. When they don't, some nodes will have more kernelcore than * others */ -static void __init find_zone_movable_pfns_for_nodes(unsigned long *movable_pfn) +static void __init find_zone_movable_pfns_for_nodes(void) { int i, nid; unsigned long usable_startpfn; @@ -4716,7 +4716,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn) /* Find the PFNs that ZONE_MOVABLE begins at in each node */ memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn)); - find_zone_movable_pfns_for_nodes(zone_movable_pfn); + find_zone_movable_pfns_for_nodes(); /* Print out the zone ranges */ printk("Zone PFN ranges:\n"); -- cgit v0.10.2 From d71b5a73fe9af42752c4329b087f7911b35f8f79 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Wed, 21 Mar 2012 16:34:16 -0700 Subject: numa_emulation: fix cpumask_of_node() Without this fix the cpumask_of_node() for a fake=numa=2 is: cpumask 0 ff cpumask 1 ff with the fix it's correct and it's set to: cpumask 0 55 cpumask 1 aa Signed-off-by: Andrea Arcangeli Cc: Andi Kleen Cc: Johannes Weiner Cc: David Rientjes Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c index 46db568..740b0a3 100644 --- a/arch/x86/mm/numa_emulation.c +++ b/arch/x86/mm/numa_emulation.c @@ -60,7 +60,7 @@ static int __init emu_setup_memblk(struct numa_meminfo *ei, eb->nid = nid; if (emu_nid_to_phys[nid] == NUMA_NO_NODE) - emu_nid_to_phys[nid] = pb->nid; + emu_nid_to_phys[nid] = nid; pb->start += size; if (pb->start >= pb->end) { -- cgit v0.10.2 From 88f6b4c32e531dc5b06bd05144f790847a1fdaeb Mon Sep 17 00:00:00 2001 From: Kautuk Consul Date: Wed, 21 Mar 2012 16:34:16 -0700 Subject: mmap.c: fix comment for __insert_vm_struct() The comment above __insert_vm_struct seems to suggest that this function is also going to link the VMA with the anon_vma, but this is not true. This function only links the VMA to the mm->mm_rb tree and the mm->mmap linked list. [akpm@linux-foundation.org: improve comment layout and text] Signed-off-by: Kautuk Consul Acked-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/mmap.c b/mm/mmap.c index a19cc27..230f0ba 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -451,9 +451,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, } /* - * Helper for vma_adjust in the split_vma insert case: - * insert vm structure into list and rbtree and anon_vma, - * but it has already been inserted into prio_tree earlier. + * Helper for vma_adjust() in the split_vma insert case: insert a vma into the + * mm's list and rbtree. It has already been inserted into the prio_tree. */ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) { -- cgit v0.10.2 From 1480de0340a8d5f094b74d7c4b902456c9a06903 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Wed, 21 Mar 2012 16:34:17 -0700 Subject: mm: forbid lumpy-reclaim in shrink_active_list() Reset the reclaim mode in shrink_active_list() to RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC. (sync/async sign is used only in shrink_page_list and does not affect shrink_active_list) Currenly shrink_active_list() sometimes works in lumpy-reclaim mode, if RECLAIM_MODE_LUMPYRECLAIM is left over from an earlier shrink_inactive_list(). Meanwhile, in age_active_anon() sc->reclaim_mode is totally zero. So the current behavior is too complex and confusing, and this looks like bug. In general, shrink_active_list() populates the inactive list for the next shrink_inactive_list(). Lumpy shring_inactive_list() isolates pages around the chosen one from both the active and inactive lists. So, there is no reason for lumpy isolation in shrink_active_list(). See also: https://lkml.org/lkml/2012/3/15/583 Signed-off-by: Konstantin Khlebnikov Proposed-by: Hugh Dickins Acked-by: Johannes Weiner Cc: Rik van Riel Cc: Minchan Kim Cc: Mel Gorman Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/vmscan.c b/mm/vmscan.c index 55d86c9..49f15ef 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1690,6 +1690,8 @@ static void shrink_active_list(unsigned long nr_to_scan, lru_add_drain(); + reset_reclaim_mode(sc); + if (!sc->may_unmap) isolate_mode |= ISOLATE_UNMAPPED; if (!sc->may_writepage) -- cgit v0.10.2 From 052b1987faca3606109d88d96bce124851f7c4c2 Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Wed, 21 Mar 2012 16:34:17 -0700 Subject: swap: don't do discard if no discard option added When swapon() was not passed the SWAP_FLAG_DISCARD option, sys_swapon() will still perform a discard operation. This can cause problems if discard is slow or buggy. Reverse the order of the check so that a discard operation is performed only if the sys_swapon() caller is attempting to enable discard. Signed-off-by: Shaohua Li Reported-by: Holger Kiehl Tested-by: Holger Kiehl Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/swapfile.c b/mm/swapfile.c index b82c028..21b5694 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2103,7 +2103,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) p->flags |= SWP_SOLIDSTATE; p->cluster_next = 1 + (random32() % p->highest_bit); } - if (discard_swap(p) == 0 && (swap_flags & SWAP_FLAG_DISCARD)) + if ((swap_flags & SWAP_FLAG_DISCARD) && discard_swap(p) == 0) p->flags |= SWP_DISCARDABLE; } -- cgit v0.10.2 From 31a79235fc75b506e282e43723107a40f3bc5c07 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 21 Mar 2012 16:34:18 -0700 Subject: memcg: replace MEM_CONT by MEM_RES_CTLR Correct an #endif comment in memcontrol.h from MEM_CONT to MEM_RES_CTLR. Signed-off-by: Hugh Dickins Reviewed-by: KOSAKI Motohiro Acked-by: Kirill A. Shutemov Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d909650..e76f107 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -392,7 +392,7 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage, struct page *newpage) { } -#endif /* CONFIG_CGROUP_MEM_CONT */ +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */ #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM) static inline bool -- cgit v0.10.2 From d79154bb5223edad407db61f59b9b15b0080ed80 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 21 Mar 2012 16:34:18 -0700 Subject: memcg: replace mem and mem_cont stragglers Replace mem and mem_cont stragglers in memcontrol.c by memcg. Signed-off-by: Hugh Dickins Reviewed-by: KOSAKI Motohiro Acked-by: Kirill A. Shutemov Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index bb04067..e5370db 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -144,7 +144,7 @@ struct mem_cgroup_per_zone { unsigned long long usage_in_excess;/* Set to the value by which */ /* the soft limit is exceeded*/ bool on_tree; - struct mem_cgroup *mem; /* Back pointer, we cannot */ + struct mem_cgroup *memcg; /* Back pointer, we cannot */ /* use container_of */ }; /* Macro for accessing counter */ @@ -612,9 +612,9 @@ retry: * we will to add it back at the end of reclaim to its correct * position in the tree. */ - __mem_cgroup_remove_exceeded(mz->mem, mz, mctz); - if (!res_counter_soft_limit_excess(&mz->mem->res) || - !css_tryget(&mz->mem->css)) + __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz); + if (!res_counter_soft_limit_excess(&mz->memcg->res) || + !css_tryget(&mz->memcg->css)) goto retry; done: return mz; @@ -1772,22 +1772,22 @@ static DEFINE_SPINLOCK(memcg_oom_lock); static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq); struct oom_wait_info { - struct mem_cgroup *mem; + struct mem_cgroup *memcg; wait_queue_t wait; }; static int memcg_oom_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *arg) { - struct mem_cgroup *wake_memcg = (struct mem_cgroup *)arg, - *oom_wait_memcg; + struct mem_cgroup *wake_memcg = (struct mem_cgroup *)arg; + struct mem_cgroup *oom_wait_memcg; struct oom_wait_info *oom_wait_info; oom_wait_info = container_of(wait, struct oom_wait_info, wait); - oom_wait_memcg = oom_wait_info->mem; + oom_wait_memcg = oom_wait_info->memcg; /* - * Both of oom_wait_info->mem and wake_mem are stable under us. + * Both of oom_wait_info->memcg and wake_memcg are stable under us. * Then we can use css_is_ancestor without taking care of RCU. */ if (!mem_cgroup_same_or_subtree(oom_wait_memcg, wake_memcg) @@ -1816,7 +1816,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, int order) struct oom_wait_info owait; bool locked, need_to_kill; - owait.mem = memcg; + owait.memcg = memcg; owait.wait.flags = 0; owait.wait.func = memcg_oom_wake_function; owait.wait.private = current; @@ -3549,7 +3549,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, break; nr_scanned = 0; - reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, + reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone, gfp_mask, &nr_scanned); nr_reclaimed += reclaimed; *total_scanned += nr_scanned; @@ -3576,13 +3576,13 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, next_mz = __mem_cgroup_largest_soft_limit_node(mctz); if (next_mz == mz) - css_put(&next_mz->mem->css); + css_put(&next_mz->memcg->css); else /* next_mz == NULL or other memcg */ break; } while (1); } - __mem_cgroup_remove_exceeded(mz->mem, mz, mctz); - excess = res_counter_soft_limit_excess(&mz->mem->res); + __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz); + excess = res_counter_soft_limit_excess(&mz->memcg->res); /* * One school of thought says that we should not add * back the node to the tree if reclaim returns 0. @@ -3592,9 +3592,9 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, * term TODO. */ /* If excess == 0, no tree ops */ - __mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess); + __mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess); spin_unlock(&mctz->lock); - css_put(&mz->mem->css); + css_put(&mz->memcg->css); loop++; /* * Could not reclaim anything and there are no more @@ -3607,7 +3607,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, break; } while (!nr_reclaimed); if (next_mz) - css_put(&next_mz->mem->css); + css_put(&next_mz->memcg->css); return nr_reclaimed; } @@ -4098,38 +4098,38 @@ static int mem_control_numa_stat_show(struct seq_file *m, void *arg) unsigned long total_nr, file_nr, anon_nr, unevictable_nr; unsigned long node_nr; struct cgroup *cont = m->private; - struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cont); + struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); - total_nr = mem_cgroup_nr_lru_pages(mem_cont, LRU_ALL); + total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL); seq_printf(m, "total=%lu", total_nr); for_each_node_state(nid, N_HIGH_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(mem_cont, nid, LRU_ALL); + node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL); seq_printf(m, " N%d=%lu", nid, node_nr); } seq_putc(m, '\n'); - file_nr = mem_cgroup_nr_lru_pages(mem_cont, LRU_ALL_FILE); + file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE); seq_printf(m, "file=%lu", file_nr); for_each_node_state(nid, N_HIGH_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(mem_cont, nid, + node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_FILE); seq_printf(m, " N%d=%lu", nid, node_nr); } seq_putc(m, '\n'); - anon_nr = mem_cgroup_nr_lru_pages(mem_cont, LRU_ALL_ANON); + anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON); seq_printf(m, "anon=%lu", anon_nr); for_each_node_state(nid, N_HIGH_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(mem_cont, nid, + node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_ANON); seq_printf(m, " N%d=%lu", nid, node_nr); } seq_putc(m, '\n'); - unevictable_nr = mem_cgroup_nr_lru_pages(mem_cont, BIT(LRU_UNEVICTABLE)); + unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE)); seq_printf(m, "unevictable=%lu", unevictable_nr); for_each_node_state(nid, N_HIGH_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(mem_cont, nid, + node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, BIT(LRU_UNEVICTABLE)); seq_printf(m, " N%d=%lu", nid, node_nr); } @@ -4141,12 +4141,12 @@ static int mem_control_numa_stat_show(struct seq_file *m, void *arg) static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft, struct cgroup_map_cb *cb) { - struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cont); + struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); struct mcs_total_stat mystat; int i; memset(&mystat, 0, sizeof(mystat)); - mem_cgroup_get_local_stat(mem_cont, &mystat); + mem_cgroup_get_local_stat(memcg, &mystat); for (i = 0; i < NR_MCS_STAT; i++) { @@ -4158,14 +4158,14 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft, /* Hierarchical information */ { unsigned long long limit, memsw_limit; - memcg_get_hierarchical_limit(mem_cont, &limit, &memsw_limit); + memcg_get_hierarchical_limit(memcg, &limit, &memsw_limit); cb->fill(cb, "hierarchical_memory_limit", limit); if (do_swap_account) cb->fill(cb, "hierarchical_memsw_limit", memsw_limit); } memset(&mystat, 0, sizeof(mystat)); - mem_cgroup_get_total_stat(mem_cont, &mystat); + mem_cgroup_get_total_stat(memcg, &mystat); for (i = 0; i < NR_MCS_STAT; i++) { if (i == MCS_SWAP && !do_swap_account) continue; @@ -4181,7 +4181,7 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft, for_each_online_node(nid) for (zid = 0; zid < MAX_NR_ZONES; zid++) { - mz = mem_cgroup_zoneinfo(mem_cont, nid, zid); + mz = mem_cgroup_zoneinfo(memcg, nid, zid); recent_rotated[0] += mz->reclaim_stat.recent_rotated[0]; @@ -4758,7 +4758,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) INIT_LIST_HEAD(&mz->lruvec.lists[l]); mz->usage_in_excess = 0; mz->on_tree = false; - mz->mem = memcg; + mz->memcg = memcg; } memcg->info.nodeinfo[node] = pn; return 0; @@ -4771,29 +4771,29 @@ static void free_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) static struct mem_cgroup *mem_cgroup_alloc(void) { - struct mem_cgroup *mem; + struct mem_cgroup *memcg; int size = sizeof(struct mem_cgroup); /* Can be very big if MAX_NUMNODES is very big */ if (size < PAGE_SIZE) - mem = kzalloc(size, GFP_KERNEL); + memcg = kzalloc(size, GFP_KERNEL); else - mem = vzalloc(size); + memcg = vzalloc(size); - if (!mem) + if (!memcg) return NULL; - mem->stat = alloc_percpu(struct mem_cgroup_stat_cpu); - if (!mem->stat) + memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu); + if (!memcg->stat) goto out_free; - spin_lock_init(&mem->pcp_counter_lock); - return mem; + spin_lock_init(&memcg->pcp_counter_lock); + return memcg; out_free: if (size < PAGE_SIZE) - kfree(mem); + kfree(memcg); else - vfree(mem); + vfree(memcg); return NULL; } -- cgit v0.10.2 From 1eb4927251a4e5ab152e64afb29453547365fde8 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 21 Mar 2012 16:34:19 -0700 Subject: memcg: lru_size instead of MEM_CGROUP_ZSTAT I never understood why we need a MEM_CGROUP_ZSTAT(mz, idx) macro to obscure the LRU counts. For easier searching? So call it lru_size rather than bare count (lru_length sounds better, but would be wrong, since each huge page raises lru_size hugely). Signed-off-by: Hugh Dickins Acked-by: Kirill A. Shutemov Acked-by: KAMEZAWA Hiroyuki Cc: Michal Hocko Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e5370db..6405e78 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -135,7 +135,7 @@ struct mem_cgroup_reclaim_iter { */ struct mem_cgroup_per_zone { struct lruvec lruvec; - unsigned long count[NR_LRU_LISTS]; + unsigned long lru_size[NR_LRU_LISTS]; struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1]; @@ -147,8 +147,6 @@ struct mem_cgroup_per_zone { struct mem_cgroup *memcg; /* Back pointer, we cannot */ /* use container_of */ }; -/* Macro for accessing counter */ -#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)]) struct mem_cgroup_per_node { struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES]; @@ -728,7 +726,7 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid, for_each_lru(l) { if (BIT(l) & lru_mask) - ret += MEM_CGROUP_ZSTAT(mz, l); + ret += mz->lru_size[l]; } return ret; } @@ -1077,7 +1075,7 @@ struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page, mz = page_cgroup_zoneinfo(memcg, page); /* compound_order() is stabilized through lru_lock */ - MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page); + mz->lru_size[lru] += 1 << compound_order(page); return &mz->lruvec; } @@ -1105,8 +1103,8 @@ void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru) VM_BUG_ON(!memcg); mz = page_cgroup_zoneinfo(memcg, page); /* huge page split is done under lru_lock. so, we have no races. */ - VM_BUG_ON(MEM_CGROUP_ZSTAT(mz, lru) < (1 << compound_order(page))); - MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page); + VM_BUG_ON(mz->lru_size[lru] < (1 << compound_order(page))); + mz->lru_size[lru] -= 1 << compound_order(page); } void mem_cgroup_lru_del(struct page *page) @@ -3629,7 +3627,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg, mz = mem_cgroup_zoneinfo(memcg, node, zid); list = &mz->lruvec.lists[lru]; - loop = MEM_CGROUP_ZSTAT(mz, lru); + loop = mz->lru_size[lru]; /* give some margin against EBUSY etc...*/ loop += 256; busy = NULL; -- cgit v0.10.2 From f156ab9333c7810f8c4b1a0413142f52534b2df1 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 21 Mar 2012 16:34:19 -0700 Subject: memcg: enum lru_list lru Mostly we use "enum lru_list lru": change those few "l"s to "lru"s. Signed-off-by: Hugh Dickins Reviewed-by: KOSAKI Motohiro Acked-by: Kirill A. Shutemov Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6405e78..7572a50 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -719,14 +719,14 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid, unsigned int lru_mask) { struct mem_cgroup_per_zone *mz; - enum lru_list l; + enum lru_list lru; unsigned long ret = 0; mz = mem_cgroup_zoneinfo(memcg, nid, zid); - for_each_lru(l) { - if (BIT(l) & lru_mask) - ret += mz->lru_size[l]; + for_each_lru(lru) { + if (BIT(lru) & lru_mask) + ret += mz->lru_size[lru]; } return ret; } @@ -3701,10 +3701,10 @@ move_account: mem_cgroup_start_move(memcg); for_each_node_state(node, N_HIGH_MEMORY) { for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) { - enum lru_list l; - for_each_lru(l) { + enum lru_list lru; + for_each_lru(lru) { ret = mem_cgroup_force_empty_list(memcg, - node, zid, l); + node, zid, lru); if (ret) break; } @@ -4734,7 +4734,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) { struct mem_cgroup_per_node *pn; struct mem_cgroup_per_zone *mz; - enum lru_list l; + enum lru_list lru; int zone, tmp = node; /* * This routine is called against possible nodes. @@ -4752,8 +4752,8 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node) for (zone = 0; zone < MAX_NR_ZONES; zone++) { mz = &pn->zoneinfo[zone]; - for_each_lru(l) - INIT_LIST_HEAD(&mz->lruvec.lists[l]); + for_each_lru(lru) + INIT_LIST_HEAD(&mz->lruvec.lists[lru]); mz->usage_in_excess = 0; mz->on_tree = false; mz->memcg = memcg; -- cgit v0.10.2 From 1f2b71f41ee81735c25ef326da9a0610d640abc2 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 21 Mar 2012 16:34:19 -0700 Subject: memcg: remove redundant returns Remove redundant returns from ends of functions, and one blank line. Signed-off-by: Hugh Dickins Reviewed-by: KOSAKI Motohiro Acked-by: Kirill A. Shutemov Acked-by: Michal Hocko Acked-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7572a50..43a9ade 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1391,7 +1391,6 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) if (!memcg || !p) return; - rcu_read_lock(); mem_cgrp = memcg->css.cgroup; @@ -1926,7 +1925,6 @@ out: if (unlikely(need_unlock)) move_unlock_page_cgroup(pc, &flags); rcu_read_unlock(); - return; } EXPORT_SYMBOL(mem_cgroup_update_page_stat); @@ -2912,7 +2910,6 @@ direct_uncharge: res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE); if (unlikely(batch->memcg != memcg)) memcg_oom_recover(memcg); - return; } /* @@ -3937,7 +3934,6 @@ static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg, out: *mem_limit = min_limit; *memsw_limit = min_memsw_limit; - return; } static int mem_cgroup_reset(struct cgroup *cont, unsigned int event) -- cgit v0.10.2 From 0e79dedde951e981612ed4e6d74873d61d2a113b Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Wed, 21 Mar 2012 16:34:20 -0700 Subject: memcg: remove unnecessary thp check in page stat accounting Commit e94c8a9cbce1 ("memcg: make mem_cgroup_split_huge_fixup() more efficient") removed move_lock_page_cgroup(). So we do not have to check PageTransHuge in mem_cgroup_update_page_stat() and fallback into the locked accounting because both move_account() and thp split are done with compound_lock so they cannot race. The race between update vs. move is protected by mem_cgroup_stealed. PageTransHuge pages shouldn't appear in this code path currently because we are tracking only file pages at the moment but later we are planning to track also other pages (e.g. mlocked ones). Signed-off-by: KAMEZAWA Hiroyuki Cc: Johannes Weiner Cc: Andrea Arcangeli Reviewed-by: Acked-by: Michal Hocko Cc: David Rientjes Acked-by: Ying Han Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 43a9ade..69af5d5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1898,7 +1898,7 @@ void mem_cgroup_update_page_stat(struct page *page, if (unlikely(!memcg || !PageCgroupUsed(pc))) goto out; /* pc->mem_cgroup is unstable ? */ - if (unlikely(mem_cgroup_stealed(memcg)) || PageTransHuge(page)) { + if (unlikely(mem_cgroup_stealed(memcg))) { /* take a lock against to access pc->mem_cgroup */ move_lock_page_cgroup(pc, &flags); need_unlock = true; -- cgit v0.10.2 From 9f7de8275b46d9d11b1505adbfe6c2bb48df4741 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 21 Mar 2012 16:34:20 -0700 Subject: idr: make idr_get_next() good for rcu_read_lock() Make one small adjustment to idr_get_next(): take the height from the top layer (stable under RCU) instead of from the root (unprotected by RCU), as idr_find() does: so that it can be used with RCU locking. Copied comment on RCU locking from idr_find(). Signed-off-by: Hugh Dickins Acked-by: KAMEZAWA Hiroyuki Acked-by: Li Zefan Cc: Eric Dumazet Acked-by: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/idr.c b/lib/idr.c index ed055b2..12499ba 100644 --- a/lib/idr.c +++ b/lib/idr.c @@ -595,8 +595,10 @@ EXPORT_SYMBOL(idr_for_each); * Returns pointer to registered object with id, which is next number to * given id. After being looked up, *@nextidp will be updated for the next * iteration. + * + * This function can be called under rcu_read_lock(), given that the leaf + * pointers lifetimes are correctly managed. */ - void *idr_get_next(struct idr *idp, int *nextidp) { struct idr_layer *p, *pa[MAX_LEVEL]; @@ -605,11 +607,11 @@ void *idr_get_next(struct idr *idp, int *nextidp) int n, max; /* find first ent */ - n = idp->layers * IDR_BITS; - max = 1 << n; p = rcu_dereference_raw(idp->top); if (!p) return NULL; + n = (p->layer + 1) * IDR_BITS; + max = 1 << n; while (id < max) { while (n > 0 && p) { -- cgit v0.10.2 From 42aee6c495e07dba7410b863a360db6bb9ec6d66 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 21 Mar 2012 16:34:21 -0700 Subject: cgroup: revert ss_id_lock to spinlock Commit c1e2ee2dc436 ("memcg: replace ss->id_lock with a rwlock") has now been seen to cause the unfair behavior we should have expected from converting a spinlock to an rwlock: softlockup in cgroup_mkdir(), whose get_new_cssid() is waiting for the wlock, while there are 19 tasks using the rlock in css_get_next() to get on with their memcg workload (in an artificial test, admittedly). Yet lib/idr.c was made suitable for RCU way back: revert that commit, restoring ss->id_lock to a spinlock. Signed-off-by: Hugh Dickins Acked-by: KAMEZAWA Hiroyuki Acked-by: Li Zefan Cc: Eric Dumazet Acked-by: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 501adb1..5a85b34 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -498,7 +498,7 @@ struct cgroup_subsys { struct list_head sibling; /* used when use_id == true */ struct idr idr; - rwlock_t id_lock; + spinlock_t id_lock; /* should be defined only by modular subsystems */ struct module *module; diff --git a/kernel/cgroup.c b/kernel/cgroup.c index c6877fe..8eb90f2 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4885,9 +4885,9 @@ void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css) rcu_assign_pointer(id->css, NULL); rcu_assign_pointer(css->id, NULL); - write_lock(&ss->id_lock); + spin_lock(&ss->id_lock); idr_remove(&ss->idr, id->id); - write_unlock(&ss->id_lock); + spin_unlock(&ss->id_lock); kfree_rcu(id, rcu_head); } EXPORT_SYMBOL_GPL(free_css_id); @@ -4913,10 +4913,10 @@ static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth) error = -ENOMEM; goto err_out; } - write_lock(&ss->id_lock); + spin_lock(&ss->id_lock); /* Don't use 0. allocates an ID of 1-65535 */ error = idr_get_new_above(&ss->idr, newid, 1, &myid); - write_unlock(&ss->id_lock); + spin_unlock(&ss->id_lock); /* Returns error when there are no free spaces for new ID.*/ if (error) { @@ -4931,9 +4931,9 @@ static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth) return newid; remove_idr: error = -ENOSPC; - write_lock(&ss->id_lock); + spin_lock(&ss->id_lock); idr_remove(&ss->idr, myid); - write_unlock(&ss->id_lock); + spin_unlock(&ss->id_lock); err_out: kfree(newid); return ERR_PTR(error); @@ -4945,7 +4945,7 @@ static int __init_or_module cgroup_init_idr(struct cgroup_subsys *ss, { struct css_id *newid; - rwlock_init(&ss->id_lock); + spin_lock_init(&ss->id_lock); idr_init(&ss->idr); newid = get_new_cssid(ss, 0); @@ -5040,9 +5040,9 @@ css_get_next(struct cgroup_subsys *ss, int id, * scan next entry from bitmap(tree), tmpid is updated after * idr_get_next(). */ - read_lock(&ss->id_lock); + spin_lock(&ss->id_lock); tmp = idr_get_next(&ss->idr, &tmpid); - read_unlock(&ss->id_lock); + spin_unlock(&ss->id_lock); if (!tmp) break; -- cgit v0.10.2 From ca464d69b19120a826aa2534de2511a6f542edf5 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 21 Mar 2012 16:34:21 -0700 Subject: memcg: let css_get_next() rely upon rcu_read_lock() Remove lock and unlock around css_get_next()'s call to idr_get_next(). memcg iterators (only users of css_get_next) already did rcu_read_lock(), and its comment demands that; but add a WARN_ON_ONCE to make sure of it. Signed-off-by: Hugh Dickins Acked-by: KAMEZAWA Hiroyuki Acked-by: Li Zefan Cc: Eric Dumazet Acked-by: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 8eb90f2..391d5e9 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5033,6 +5033,8 @@ css_get_next(struct cgroup_subsys *ss, int id, return NULL; BUG_ON(!ss->use_id); + WARN_ON_ONCE(!rcu_read_lock_held()); + /* fill start point for scan */ tmpid = id; while (1) { @@ -5040,10 +5042,7 @@ css_get_next(struct cgroup_subsys *ss, int id, * scan next entry from bitmap(tree), tmpid is updated after * idr_get_next(). */ - spin_lock(&ss->id_lock); tmp = idr_get_next(&ss->idr, &tmpid); - spin_unlock(&ss->id_lock); - if (!tmp) break; if (tmp->depth >= depth && tmp->stack[depth] == rootid) { -- cgit v0.10.2 From b24028572fb69e9dd6de8c359eba2b2c66baa889 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Wed, 21 Mar 2012 16:34:22 -0700 Subject: memcg: remove PCG_CACHE page_cgroup flag We record 'the page is cache' with the PCG_CACHE bit in page_cgroup. Here, "CACHE" means anonymous user pages (and SwapCache). This doesn't include shmem. Considering callers, at charge/uncharge, the caller should know what the page is and we don't need to record it by using one bit per page. This patch removes PCG_CACHE bit and make callers of mem_cgroup_charge_statistics() to specify what the page is. About page migration: Mapping of the used page is not touched during migra tion (see page_remove_rmap) so we can rely on it and push the correct charge type down to __mem_cgroup_uncharge_common from end_migration for unused page. The force flag was misleading was abused for skipping the needless page_mapped() / PageCgroupMigration() check, as we know the unused page is no longer mapped and cleared the migration flag just a few lines up. But doing the checks is no biggie and it's not worth adding another flag just to skip them. [akpm@linux-foundation.org: checkpatch fixes] [hughd@google.com: fix PageAnon uncharging] Signed-off-by: KAMEZAWA Hiroyuki Acked-by: Michal Hocko Acked-by: Johannes Weiner Cc: Hugh Dickins Cc: Ying Han Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index a2d11771..1060292 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -4,7 +4,6 @@ enum { /* flags for mem_cgroup */ PCG_LOCK, /* Lock for pc->mem_cgroup and following bits. */ - PCG_CACHE, /* charged as cache */ PCG_USED, /* this object is in use. */ PCG_MIGRATION, /* under page migration */ /* flags for mem_cgroup and file and I/O status */ @@ -64,11 +63,6 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \ static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \ { return test_and_clear_bit(PCG_##lname, &pc->flags); } -/* Cache flag is set only once (at allocation) */ -TESTPCGFLAG(Cache, CACHE) -CLEARPCGFLAG(Cache, CACHE) -SETPCGFLAG(Cache, CACHE) - TESTPCGFLAG(Used, USED) CLEARPCGFLAG(Used, USED) SETPCGFLAG(Used, USED) @@ -85,7 +79,7 @@ static inline void lock_page_cgroup(struct page_cgroup *pc) { /* * Don't take this lock in IRQ context. - * This lock is for pc->mem_cgroup, USED, CACHE, MIGRATION + * This lock is for pc->mem_cgroup, USED, MIGRATION */ bit_spin_lock(PCG_LOCK, &pc->flags); } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 69af5d5..88113ee 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -690,15 +690,19 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg, } static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, - bool file, int nr_pages) + bool anon, int nr_pages) { preempt_disable(); - if (file) - __this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE], + /* + * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is + * counted as CACHE even if it's on ANON LRU. + */ + if (anon) + __this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS], nr_pages); else - __this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS], + __this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE], nr_pages); /* pagein of a big page is an event. So, ignore page size */ @@ -2442,6 +2446,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg, { struct zone *uninitialized_var(zone); bool was_on_lru = false; + bool anon; lock_page_cgroup(pc); if (unlikely(PageCgroupUsed(pc))) { @@ -2477,19 +2482,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg, * See mem_cgroup_add_lru_list(), etc. */ smp_wmb(); - switch (ctype) { - case MEM_CGROUP_CHARGE_TYPE_CACHE: - case MEM_CGROUP_CHARGE_TYPE_SHMEM: - SetPageCgroupCache(pc); - SetPageCgroupUsed(pc); - break; - case MEM_CGROUP_CHARGE_TYPE_MAPPED: - ClearPageCgroupCache(pc); - SetPageCgroupUsed(pc); - break; - default: - break; - } + SetPageCgroupUsed(pc); if (lrucare) { if (was_on_lru) { @@ -2500,7 +2493,12 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg, spin_unlock_irq(&zone->lru_lock); } - mem_cgroup_charge_statistics(memcg, PageCgroupCache(pc), nr_pages); + if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) + anon = true; + else + anon = false; + + mem_cgroup_charge_statistics(memcg, anon, nr_pages); unlock_page_cgroup(pc); /* @@ -2565,6 +2563,7 @@ static int mem_cgroup_move_account(struct page *page, { unsigned long flags; int ret; + bool anon = PageAnon(page); VM_BUG_ON(from == to); VM_BUG_ON(PageLRU(page)); @@ -2593,14 +2592,14 @@ static int mem_cgroup_move_account(struct page *page, __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); preempt_enable(); } - mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages); + mem_cgroup_charge_statistics(from, anon, -nr_pages); if (uncharge) /* This is not "cancel", but cancel_charge does all we need. */ __mem_cgroup_cancel_charge(from, nr_pages); /* caller should have done css_get */ pc->mem_cgroup = to; - mem_cgroup_charge_statistics(to, PageCgroupCache(pc), nr_pages); + mem_cgroup_charge_statistics(to, anon, nr_pages); /* * We charges against "to" which may not have any tasks. Then, "to" * can be under rmdir(). But in current implementation, caller of @@ -2921,6 +2920,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype) struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; + bool anon; if (mem_cgroup_disabled()) return NULL; @@ -2946,8 +2946,12 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype) if (!PageCgroupUsed(pc)) goto unlock_out; + anon = PageAnon(page); + switch (ctype) { case MEM_CGROUP_CHARGE_TYPE_MAPPED: + anon = true; + /* fallthrough */ case MEM_CGROUP_CHARGE_TYPE_DROP: /* See mem_cgroup_prepare_migration() */ if (page_mapped(page) || PageCgroupMigration(pc)) @@ -2964,7 +2968,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype) break; } - mem_cgroup_charge_statistics(memcg, PageCgroupCache(pc), -nr_pages); + mem_cgroup_charge_statistics(memcg, anon, -nr_pages); ClearPageCgroupUsed(pc); /* @@ -3271,6 +3275,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *memcg, { struct page *used, *unused; struct page_cgroup *pc; + bool anon; if (!memcg) return; @@ -3292,8 +3297,10 @@ void mem_cgroup_end_migration(struct mem_cgroup *memcg, lock_page_cgroup(pc); ClearPageCgroupMigration(pc); unlock_page_cgroup(pc); - - __mem_cgroup_uncharge_common(unused, MEM_CGROUP_CHARGE_TYPE_FORCE); + anon = PageAnon(used); + __mem_cgroup_uncharge_common(unused, + anon ? MEM_CGROUP_CHARGE_TYPE_MAPPED + : MEM_CGROUP_CHARGE_TYPE_CACHE); /* * If a page is a file cache, radix-tree replacement is very atomic @@ -3303,7 +3310,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *memcg, * and USED bit check in mem_cgroup_uncharge_page() will do enough * check. (see prepare_charge() also) */ - if (PageAnon(used)) + if (anon) mem_cgroup_uncharge_page(used); /* * At migration, we may charge account against cgroup which has no @@ -3333,7 +3340,7 @@ void mem_cgroup_replace_page_cache(struct page *oldpage, /* fix accounting on old pages */ lock_page_cgroup(pc); memcg = pc->mem_cgroup; - mem_cgroup_charge_statistics(memcg, PageCgroupCache(pc), -1); + mem_cgroup_charge_statistics(memcg, false, -1); ClearPageCgroupUsed(pc); unlock_page_cgroup(pc); -- cgit v0.10.2 From a710920caedfcf56543136bfea300a6c593f9838 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Wed, 21 Mar 2012 16:34:22 -0700 Subject: memcg: kill dead prev_priority stubs This code was removed in 25edde033291 ("vmscan: kill prev_priority completely") Signed-off-by: Konstantin Khlebnikov Acked-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index e76f107..c54e5df 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -299,21 +299,6 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root, { } -static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *memcg) -{ - return 0; -} - -static inline void mem_cgroup_note_reclaim_priority(struct mem_cgroup *memcg, - int priority) -{ -} - -static inline void mem_cgroup_record_reclaim_priority(struct mem_cgroup *memcg, - int priority) -{ -} - static inline bool mem_cgroup_disabled(void) { return true; -- cgit v0.10.2 From 9e3357907c84517d9e07bc0b19265807f0264b43 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Wed, 21 Mar 2012 16:34:23 -0700 Subject: memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat) As described in the log, I guess EXPORT was for preparing dirty accounting. But _now_, we don't need to export this. Remove this for now. Signed-off-by: KAMEZAWA Hiroyuki Reviewed-by: Greg Thelen Cc: Johannes Weiner Cc: Michal Hocko Cc: KOSAKI Motohiro Cc: Ying Han Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 88113ee..eba04a4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1930,7 +1930,6 @@ out: move_unlock_page_cgroup(pc, &flags); rcu_read_unlock(); } -EXPORT_SYMBOL(mem_cgroup_update_page_stat); /* * size of first charge trial. "32" comes from vmscan.c's magic value. -- cgit v0.10.2 From 619d094b5872a5af153f1af77a8b7f7326faf0d0 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Wed, 21 Mar 2012 16:34:23 -0700 Subject: memcg: simplify move_account() check In memcg, for avoiding take-lock-irq-off at accessing page_cgroup, a logic, flag + rcu_read_lock(), is used. This works as following CPU-A CPU-B rcu_read_lock() set flag if(flag is set) take heavy lock do job. synchronize_rcu() rcu_read_unlock() take heavy lock. In recent discussion, it's argued that using per-cpu value for this flag just complicates the code because 'set flag' is very rare. This patch changes 'flag' implementation from percpu to atomic_t. This will be much simpler. Signed-off-by: KAMEZAWA Hiroyuki Acked-by: Greg Thelen Cc: Johannes Weiner Cc: Michal Hocko Cc: KOSAKI Motohiro Cc: Ying Han Cc: "Paul E. McKenney" Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index eba04a4..cfd2db0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -89,7 +89,6 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */ MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */ - MEM_CGROUP_ON_MOVE, /* someone is moving account between groups */ MEM_CGROUP_STAT_NSTATS, }; @@ -298,6 +297,10 @@ struct mem_cgroup { */ unsigned long move_charge_at_immigrate; /* + * set > 0 if pages under this cgroup are moving to other cgroup. + */ + atomic_t moving_account; + /* * percpu counter. */ struct mem_cgroup_stat_cpu *stat; @@ -1287,35 +1290,36 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg) return memcg->swappiness; } +/* + * memcg->moving_account is used for checking possibility that some thread is + * calling move_account(). When a thread on CPU-A starts moving pages under + * a memcg, other threads should check memcg->moving_account under + * rcu_read_lock(), like this: + * + * CPU-A CPU-B + * rcu_read_lock() + * memcg->moving_account+1 if (memcg->mocing_account) + * take heavy locks. + * synchronize_rcu() update something. + * rcu_read_unlock() + * start move here. + */ static void mem_cgroup_start_move(struct mem_cgroup *memcg) { - int cpu; - - get_online_cpus(); - spin_lock(&memcg->pcp_counter_lock); - for_each_online_cpu(cpu) - per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1; - memcg->nocpu_base.count[MEM_CGROUP_ON_MOVE] += 1; - spin_unlock(&memcg->pcp_counter_lock); - put_online_cpus(); - + atomic_inc(&memcg->moving_account); synchronize_rcu(); } static void mem_cgroup_end_move(struct mem_cgroup *memcg) { - int cpu; - - if (!memcg) - return; - get_online_cpus(); - spin_lock(&memcg->pcp_counter_lock); - for_each_online_cpu(cpu) - per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1; - memcg->nocpu_base.count[MEM_CGROUP_ON_MOVE] -= 1; - spin_unlock(&memcg->pcp_counter_lock); - put_online_cpus(); + /* + * Now, mem_cgroup_clear_mc() may call this function with NULL. + * We check NULL in callee rather than caller. + */ + if (memcg) + atomic_dec(&memcg->moving_account); } + /* * 2 routines for checking "mem" is under move_account() or not. * @@ -1331,7 +1335,7 @@ static void mem_cgroup_end_move(struct mem_cgroup *memcg) static bool mem_cgroup_stealed(struct mem_cgroup *memcg) { VM_BUG_ON(!rcu_read_lock_held()); - return this_cpu_read(memcg->stat->count[MEM_CGROUP_ON_MOVE]) > 0; + return atomic_read(&memcg->moving_account) > 0; } static bool mem_cgroup_under_move(struct mem_cgroup *memcg) @@ -1882,8 +1886,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, int order) * by flags. * * Considering "move", this is an only case we see a race. To make the race - * small, we check MEM_CGROUP_ON_MOVE percpu value and detect there are - * possibility of race condition. If there is, we take a lock. + * small, we check mm->moving_account and detect there are possibility of race + * If there is, we take a lock. */ void mem_cgroup_update_page_stat(struct page *page, @@ -2100,17 +2104,6 @@ static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu) per_cpu(memcg->stat->events[i], cpu) = 0; memcg->nocpu_base.events[i] += x; } - /* need to clear ON_MOVE value, works as a kind of lock. */ - per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) = 0; - spin_unlock(&memcg->pcp_counter_lock); -} - -static void synchronize_mem_cgroup_on_move(struct mem_cgroup *memcg, int cpu) -{ - int idx = MEM_CGROUP_ON_MOVE; - - spin_lock(&memcg->pcp_counter_lock); - per_cpu(memcg->stat->count[idx], cpu) = memcg->nocpu_base.count[idx]; spin_unlock(&memcg->pcp_counter_lock); } @@ -2122,11 +2115,8 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb, struct memcg_stock_pcp *stock; struct mem_cgroup *iter; - if ((action == CPU_ONLINE)) { - for_each_mem_cgroup(iter) - synchronize_mem_cgroup_on_move(iter, cpu); + if (action == CPU_ONLINE) return NOTIFY_OK; - } if ((action != CPU_DEAD) || action != CPU_DEAD_FROZEN) return NOTIFY_OK; -- cgit v0.10.2 From 312734c04e2fecc58429aec98194e4ff12d8f7d6 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Wed, 21 Mar 2012 16:34:24 -0700 Subject: memcg: remove PCG_MOVE_LOCK flag from page_cgroup PCG_MOVE_LOCK is used for bit spinlock to avoid race between overwriting pc->mem_cgroup and page statistics accounting per memcg. This lock helps to avoid the race but the race is very rare because moving tasks between cgroup is not a usual job. So, it seems using 1bit per page is too costly. This patch changes this lock as per-memcg spinlock and removes PCG_MOVE_LOCK. If smaller lock is required, we'll be able to add some hashes but I'd like to start from this. Signed-off-by: KAMEZAWA Hiroyuki Acked-by: Greg Thelen Cc: Johannes Weiner Cc: Michal Hocko Cc: KOSAKI Motohiro Cc: Ying Han Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 1060292..7a3af74 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -7,7 +7,6 @@ enum { PCG_USED, /* this object is in use. */ PCG_MIGRATION, /* under page migration */ /* flags for mem_cgroup and file and I/O status */ - PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */ PCG_FILE_MAPPED, /* page is accounted as "mapped" */ __NR_PCG_FLAGS, }; @@ -89,24 +88,6 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc) bit_spin_unlock(PCG_LOCK, &pc->flags); } -static inline void move_lock_page_cgroup(struct page_cgroup *pc, - unsigned long *flags) -{ - /* - * We know updates to pc->flags of page cache's stats are from both of - * usual context or IRQ context. Disable IRQ to avoid deadlock. - */ - local_irq_save(*flags); - bit_spin_lock(PCG_MOVE_LOCK, &pc->flags); -} - -static inline void move_unlock_page_cgroup(struct page_cgroup *pc, - unsigned long *flags) -{ - bit_spin_unlock(PCG_MOVE_LOCK, &pc->flags); - local_irq_restore(*flags); -} - #else /* CONFIG_CGROUP_MEM_RES_CTLR */ struct page_cgroup; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cfd2db0..8afed28 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -300,6 +300,8 @@ struct mem_cgroup { * set > 0 if pages under this cgroup are moving to other cgroup. */ atomic_t moving_account; + /* taken only while moving_account > 0 */ + spinlock_t move_lock; /* * percpu counter. */ @@ -1376,6 +1378,24 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg) return false; } +/* + * Take this lock when + * - a code tries to modify page's memcg while it's USED. + * - a code tries to modify page state accounting in a memcg. + * see mem_cgroup_stealed(), too. + */ +static void move_lock_mem_cgroup(struct mem_cgroup *memcg, + unsigned long *flags) +{ + spin_lock_irqsave(&memcg->move_lock, *flags); +} + +static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, + unsigned long *flags) +{ + spin_unlock_irqrestore(&memcg->move_lock, *flags); +} + /** * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. * @memcg: The memory cgroup that went over limit @@ -1900,7 +1920,7 @@ void mem_cgroup_update_page_stat(struct page *page, if (mem_cgroup_disabled()) return; - +again: rcu_read_lock(); memcg = pc->mem_cgroup; if (unlikely(!memcg || !PageCgroupUsed(pc))) @@ -1908,11 +1928,13 @@ void mem_cgroup_update_page_stat(struct page *page, /* pc->mem_cgroup is unstable ? */ if (unlikely(mem_cgroup_stealed(memcg))) { /* take a lock against to access pc->mem_cgroup */ - move_lock_page_cgroup(pc, &flags); + move_lock_mem_cgroup(memcg, &flags); + if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) { + move_unlock_mem_cgroup(memcg, &flags); + rcu_read_unlock(); + goto again; + } need_unlock = true; - memcg = pc->mem_cgroup; - if (!memcg || !PageCgroupUsed(pc)) - goto out; } switch (idx) { @@ -1931,7 +1953,7 @@ void mem_cgroup_update_page_stat(struct page *page, out: if (unlikely(need_unlock)) - move_unlock_page_cgroup(pc, &flags); + move_unlock_mem_cgroup(memcg, &flags); rcu_read_unlock(); } @@ -2500,8 +2522,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg, #ifdef CONFIG_TRANSPARENT_HUGEPAGE -#define PCGF_NOCOPY_AT_SPLIT ((1 << PCG_LOCK) | (1 << PCG_MOVE_LOCK) |\ - (1 << PCG_MIGRATION)) +#define PCGF_NOCOPY_AT_SPLIT ((1 << PCG_LOCK) | (1 << PCG_MIGRATION)) /* * Because tail pages are not marked as "used", set it. We're under * zone->lru_lock, 'splitting on pmd' and compound_lock. @@ -2572,7 +2593,7 @@ static int mem_cgroup_move_account(struct page *page, if (!PageCgroupUsed(pc) || pc->mem_cgroup != from) goto unlock; - move_lock_page_cgroup(pc, &flags); + move_lock_mem_cgroup(from, &flags); if (PageCgroupFileMapped(pc)) { /* Update mapped_file data for mem_cgroup */ @@ -2596,7 +2617,7 @@ static int mem_cgroup_move_account(struct page *page, * guaranteed that "to" is never removed. So, we don't check rmdir * status here. */ - move_unlock_page_cgroup(pc, &flags); + move_unlock_mem_cgroup(from, &flags); ret = 0; unlock: unlock_page_cgroup(pc); @@ -4971,6 +4992,7 @@ mem_cgroup_create(struct cgroup *cont) atomic_set(&memcg->refcnt, 1); memcg->move_charge_at_immigrate = 0; mutex_init(&memcg->thresholds_lock); + spin_lock_init(&memcg->move_lock); return &memcg->css; free_out: __mem_cgroup_free(memcg); -- cgit v0.10.2 From 89c06bd52fb9ffceddf84f7309d2e8c9f1666216 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Wed, 21 Mar 2012 16:34:25 -0700 Subject: memcg: use new logic for page stat accounting Now, page-stat-per-memcg is recorded into per page_cgroup flag by duplicating page's status into the flag. The reason is that memcg has a feature to move a page from a group to another group and we have race between "move" and "page stat accounting", Under current logic, assume CPU-A and CPU-B. CPU-A does "move" and CPU-B does "page stat accounting". When CPU-A goes 1st, CPU-A CPU-B update "struct page" info. move_lock_mem_cgroup(memcg) see pc->flags copy page stat to new group overwrite pc->mem_cgroup. move_unlock_mem_cgroup(memcg) move_lock_mem_cgroup(mem) set pc->flags update page stat accounting move_unlock_mem_cgroup(mem) stat accounting is guarded by move_lock_mem_cgroup() and "move" logic (CPU-A) doesn't see changes in "struct page" information. But it's costly to have the same information both in 'struct page' and 'struct page_cgroup'. And, there is a potential problem. For example, assume we have PG_dirty accounting in memcg. PG_..is a flag for struct page. PCG_ is a flag for struct page_cgroup. (This is just an example. The same problem can be found in any kind of page stat accounting.) CPU-A CPU-B TestSet PG_dirty (delay) TestClear PG_dirty if (TestClear(PCG_dirty)) memcg->nr_dirty-- if (TestSet(PCG_dirty)) memcg->nr_dirty++ Here, memcg->nr_dirty = +1, this is wrong. This race was reported by Greg Thelen . Now, only FILE_MAPPED is supported but fortunately, it's serialized by page table lock and this is not real bug, _now_, If this potential problem is caused by having duplicated information in struct page and struct page_cgroup, we may be able to fix this by using original 'struct page' information. But we'll have a problem in "move account" Assume we use only PG_dirty. CPU-A CPU-B TestSet PG_dirty (delay) move_lock_mem_cgroup() if (PageDirty(page)) new_memcg->nr_dirty++ pc->mem_cgroup = new_memcg; move_unlock_mem_cgroup() move_lock_mem_cgroup() memcg = pc->mem_cgroup new_memcg->nr_dirty++ accounting information may be double-counted. This was original reason to have PCG_xxx flags but it seems PCG_xxx has another problem. I think we need a bigger lock as move_lock_mem_cgroup(page) TestSetPageDirty(page) update page stats (without any checks) move_unlock_mem_cgroup(page) This fixes both of problems and we don't have to duplicate page flag into page_cgroup. Please note: move_lock_mem_cgroup() is held only when there are possibility of "account move" under the system. So, in most path, status update will go without atomic locks. This patch introduces mem_cgroup_begin_update_page_stat() and mem_cgroup_end_update_page_stat() both should be called at modifying 'struct page' information if memcg takes care of it. as mem_cgroup_begin_update_page_stat() modify page information mem_cgroup_update_page_stat() => never check any 'struct page' info, just update counters. mem_cgroup_end_update_page_stat(). This patch is slow because we need to call begin_update_page_stat()/ end_update_page_stat() regardless of accounted will be changed or not. A following patch adds an easy optimization and reduces the cost. [akpm@linux-foundation.org: s/lock/locked/] [hughd@google.com: fix deadlock by avoiding stat lock when anon] Signed-off-by: KAMEZAWA Hiroyuki Cc: Greg Thelen Acked-by: Johannes Weiner Cc: Michal Hocko Cc: KOSAKI Motohiro Cc: Ying Han Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c54e5df..bf7ae01 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -141,6 +141,31 @@ static inline bool mem_cgroup_disabled(void) return false; } +void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, + unsigned long *flags); + +static inline void mem_cgroup_begin_update_page_stat(struct page *page, + bool *locked, unsigned long *flags) +{ + if (mem_cgroup_disabled()) + return; + rcu_read_lock(); + *locked = false; + return __mem_cgroup_begin_update_page_stat(page, locked, flags); +} + +void __mem_cgroup_end_update_page_stat(struct page *page, + unsigned long *flags); +static inline void mem_cgroup_end_update_page_stat(struct page *page, + bool *locked, unsigned long *flags) +{ + if (mem_cgroup_disabled()) + return; + if (*locked) + __mem_cgroup_end_update_page_stat(page, flags); + rcu_read_unlock(); +} + void mem_cgroup_update_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx, int val); @@ -341,6 +366,16 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline void mem_cgroup_begin_update_page_stat(struct page *page, + bool *locked, unsigned long *flags) +{ +} + +static inline void mem_cgroup_end_update_page_stat(struct page *page, + bool *locked, unsigned long *flags) +{ +} + static inline void mem_cgroup_inc_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8afed28..df1e180 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1910,32 +1910,59 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, int order) * If there is, we take a lock. */ +void __mem_cgroup_begin_update_page_stat(struct page *page, + bool *locked, unsigned long *flags) +{ + struct mem_cgroup *memcg; + struct page_cgroup *pc; + + pc = lookup_page_cgroup(page); +again: + memcg = pc->mem_cgroup; + if (unlikely(!memcg || !PageCgroupUsed(pc))) + return; + /* + * If this memory cgroup is not under account moving, we don't + * need to take move_lock_page_cgroup(). Because we already hold + * rcu_read_lock(), any calls to move_account will be delayed until + * rcu_read_unlock() if mem_cgroup_stealed() == true. + */ + if (!mem_cgroup_stealed(memcg)) + return; + + move_lock_mem_cgroup(memcg, flags); + if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) { + move_unlock_mem_cgroup(memcg, flags); + goto again; + } + *locked = true; +} + +void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags) +{ + struct page_cgroup *pc = lookup_page_cgroup(page); + + /* + * It's guaranteed that pc->mem_cgroup never changes while + * lock is held because a routine modifies pc->mem_cgroup + * should take move_lock_page_cgroup(). + */ + move_unlock_mem_cgroup(pc->mem_cgroup, flags); +} + void mem_cgroup_update_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx, int val) { struct mem_cgroup *memcg; struct page_cgroup *pc = lookup_page_cgroup(page); - bool need_unlock = false; unsigned long uninitialized_var(flags); if (mem_cgroup_disabled()) return; -again: - rcu_read_lock(); + memcg = pc->mem_cgroup; if (unlikely(!memcg || !PageCgroupUsed(pc))) - goto out; - /* pc->mem_cgroup is unstable ? */ - if (unlikely(mem_cgroup_stealed(memcg))) { - /* take a lock against to access pc->mem_cgroup */ - move_lock_mem_cgroup(memcg, &flags); - if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) { - move_unlock_mem_cgroup(memcg, &flags); - rcu_read_unlock(); - goto again; - } - need_unlock = true; - } + return; switch (idx) { case MEMCG_NR_FILE_MAPPED: @@ -1950,11 +1977,6 @@ again: } this_cpu_add(memcg->stat->count[idx], val); - -out: - if (unlikely(need_unlock)) - move_unlock_mem_cgroup(memcg, &flags); - rcu_read_unlock(); } /* diff --git a/mm/rmap.c b/mm/rmap.c index ebeb95e..5b5ad58 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1148,10 +1148,15 @@ void page_add_new_anon_rmap(struct page *page, */ void page_add_file_rmap(struct page *page) { + bool locked; + unsigned long flags; + + mem_cgroup_begin_update_page_stat(page, &locked, &flags); if (atomic_inc_and_test(&page->_mapcount)) { __inc_zone_page_state(page, NR_FILE_MAPPED); mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED); } + mem_cgroup_end_update_page_stat(page, &locked, &flags); } /** @@ -1162,9 +1167,21 @@ void page_add_file_rmap(struct page *page) */ void page_remove_rmap(struct page *page) { + bool anon = PageAnon(page); + bool locked; + unsigned long flags; + + /* + * The anon case has no mem_cgroup page_stat to update; but may + * uncharge_page() below, where the lock ordering can deadlock if + * we hold the lock against page_stat move: so avoid it on anon. + */ + if (!anon) + mem_cgroup_begin_update_page_stat(page, &locked, &flags); + /* page still mapped by someone else? */ if (!atomic_add_negative(-1, &page->_mapcount)) - return; + goto out; /* * Now that the last pte has gone, s390 must transfer dirty @@ -1173,7 +1190,7 @@ void page_remove_rmap(struct page *page) * not if it's in swapcache - there might be another pte slot * containing the swap entry, but page not yet written to swap. */ - if ((!PageAnon(page) || PageSwapCache(page)) && + if ((!anon || PageSwapCache(page)) && page_test_and_clear_dirty(page_to_pfn(page), 1)) set_page_dirty(page); /* @@ -1181,8 +1198,8 @@ void page_remove_rmap(struct page *page) * and not charged by memcg for now. */ if (unlikely(PageHuge(page))) - return; - if (PageAnon(page)) { + goto out; + if (anon) { mem_cgroup_uncharge_page(page); if (!PageTransHuge(page)) __dec_zone_page_state(page, NR_ANON_PAGES); @@ -1202,6 +1219,9 @@ void page_remove_rmap(struct page *page) * Leaving it set also helps swapoff to reinstate ptes * faster for those pages still in swapcache. */ +out: + if (!anon) + mem_cgroup_end_update_page_stat(page, &locked, &flags); } /* -- cgit v0.10.2 From 2ff76f1193f8481f7e6c29304eea4006e8e51569 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Wed, 21 Mar 2012 16:34:25 -0700 Subject: memcg: remove PCG_FILE_MAPPED With the new lock scheme for updating memcg's page stat, we don't need a flag PCG_FILE_MAPPED which was duplicated information of page_mapped(). [hughd@google.com: cosmetic fix] [hughd@google.com: add comment to MEM_CGROUP_CHARGE_TYPE_MAPPED case in __mem_cgroup_uncharge_common()] Signed-off-by: KAMEZAWA Hiroyuki Acked-by: Greg Thelen Acked-by: Johannes Weiner Cc: Michal Hocko Cc: KOSAKI Motohiro Cc: Ying Han Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 7a3af74..a88cdba 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -6,8 +6,6 @@ enum { PCG_LOCK, /* Lock for pc->mem_cgroup and following bits. */ PCG_USED, /* this object is in use. */ PCG_MIGRATION, /* under page migration */ - /* flags for mem_cgroup and file and I/O status */ - PCG_FILE_MAPPED, /* page is accounted as "mapped" */ __NR_PCG_FLAGS, }; @@ -66,10 +64,6 @@ TESTPCGFLAG(Used, USED) CLEARPCGFLAG(Used, USED) SETPCGFLAG(Used, USED) -SETPCGFLAG(FileMapped, FILE_MAPPED) -CLEARPCGFLAG(FileMapped, FILE_MAPPED) -TESTPCGFLAG(FileMapped, FILE_MAPPED) - SETPCGFLAG(Migration, MIGRATION) CLEARPCGFLAG(Migration, MIGRATION) TESTPCGFLAG(Migration, MIGRATION) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index df1e180..0e13b2a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1966,10 +1966,6 @@ void mem_cgroup_update_page_stat(struct page *page, switch (idx) { case MEMCG_NR_FILE_MAPPED: - if (val > 0) - SetPageCgroupFileMapped(pc); - else if (!page_mapped(page)) - ClearPageCgroupFileMapped(pc); idx = MEM_CGROUP_STAT_FILE_MAPPED; break; default: @@ -2617,7 +2613,7 @@ static int mem_cgroup_move_account(struct page *page, move_lock_mem_cgroup(from, &flags); - if (PageCgroupFileMapped(pc)) { + if (!anon && page_mapped(page)) { /* Update mapped_file data for mem_cgroup */ preempt_disable(); __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); @@ -2982,6 +2978,11 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype) switch (ctype) { case MEM_CGROUP_CHARGE_TYPE_MAPPED: + /* + * Generally PageAnon tells if it's the anon statistics to be + * updated; but sometimes e.g. mem_cgroup_uncharge_page() is + * used before page reached the stage of being marked PageAnon. + */ anon = true; /* fallthrough */ case MEM_CGROUP_CHARGE_TYPE_DROP: -- cgit v0.10.2 From 4331f7d339ee0b54603344b9d13662a9c022540c Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Wed, 21 Mar 2012 16:34:26 -0700 Subject: memcg: fix performance of mem_cgroup_begin_update_page_stat() mem_cgroup_begin_update_page_stat() should be very fast because it's called very frequently. Now, it needs to look up page_cgroup and its memcg....this is slow. This patch adds a global variable to check "any memcg is moving or not". With this, the caller doesn't need to visit page_cgroup and memcg. Here is a test result. A test program makes page faults onto a file, MAP_SHARED and makes each page's page_mapcount(page) > 1, and free the range by madvise() and page fault again. This program causes 26214400 times of page fault onto a file(size was 1G.) and shows shows the cost of mem_cgroup_begin_update_page_stat(). Before this patch for mem_cgroup_begin_update_page_stat() [kamezawa@bluextal test]$ time ./mmap 1G real 0m21.765s user 0m5.999s sys 0m15.434s 27.46% mmap mmap [.] reader 21.15% mmap [kernel.kallsyms] [k] page_fault 9.17% mmap [kernel.kallsyms] [k] filemap_fault 2.96% mmap [kernel.kallsyms] [k] __do_fault 2.83% mmap [kernel.kallsyms] [k] __mem_cgroup_begin_update_page_stat After this patch [root@bluextal test]# time ./mmap 1G real 0m21.373s user 0m6.113s sys 0m15.016s In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away. Note: we may be able to remove this optimization in future if we can get pointer to memcg directly from struct page. [akpm@linux-foundation.org: don't return a void] Signed-off-by: KAMEZAWA Hiroyuki Acked-by: Greg Thelen Acked-by: Johannes Weiner Cc: Michal Hocko Cc: KOSAKI Motohiro Cc: Ying Han Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index bf7ae01..f94efd2 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -144,6 +144,8 @@ static inline bool mem_cgroup_disabled(void) void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, unsigned long *flags); +extern atomic_t memcg_moving; + static inline void mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, unsigned long *flags) { @@ -151,7 +153,8 @@ static inline void mem_cgroup_begin_update_page_stat(struct page *page, return; rcu_read_lock(); *locked = false; - return __mem_cgroup_begin_update_page_stat(page, locked, flags); + if (atomic_read(&memcg_moving)) + __mem_cgroup_begin_update_page_stat(page, locked, flags); } void __mem_cgroup_end_update_page_stat(struct page *page, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0e13b2a..eb1004f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1306,8 +1306,13 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg) * rcu_read_unlock() * start move here. */ + +/* for quick checking without looking up memcg */ +atomic_t memcg_moving __read_mostly; + static void mem_cgroup_start_move(struct mem_cgroup *memcg) { + atomic_inc(&memcg_moving); atomic_inc(&memcg->moving_account); synchronize_rcu(); } @@ -1318,8 +1323,10 @@ static void mem_cgroup_end_move(struct mem_cgroup *memcg) * Now, mem_cgroup_clear_mc() may call this function with NULL. * We check NULL in callee rather than caller. */ - if (memcg) + if (memcg) { + atomic_dec(&memcg_moving); atomic_dec(&memcg->moving_account); + } } /* -- cgit v0.10.2 From 13fd1dd9db345f6b2babd1e80a1c929092eb4896 Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Wed, 21 Mar 2012 16:34:26 -0700 Subject: mm/memcontrol.c: s/stealed/stolen/ A grammatical fix. Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index eb1004f..c200875 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1332,8 +1332,8 @@ static void mem_cgroup_end_move(struct mem_cgroup *memcg) /* * 2 routines for checking "mem" is under move_account() or not. * - * mem_cgroup_stealed() - checking a cgroup is mc.from or not. This is used - * for avoiding race in accounting. If true, + * mem_cgroup_stolen() - checking whether a cgroup is mc.from or not. This + * is used for avoiding races in accounting. If true, * pc->mem_cgroup may be overwritten. * * mem_cgroup_under_move() - checking a cgroup is mc.from or mc.to or @@ -1341,7 +1341,7 @@ static void mem_cgroup_end_move(struct mem_cgroup *memcg) * waiting at hith-memory prressure caused by "move". */ -static bool mem_cgroup_stealed(struct mem_cgroup *memcg) +static bool mem_cgroup_stolen(struct mem_cgroup *memcg) { VM_BUG_ON(!rcu_read_lock_held()); return atomic_read(&memcg->moving_account) > 0; @@ -1389,7 +1389,7 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg) * Take this lock when * - a code tries to modify page's memcg while it's USED. * - a code tries to modify page state accounting in a memcg. - * see mem_cgroup_stealed(), too. + * see mem_cgroup_stolen(), too. */ static void move_lock_mem_cgroup(struct mem_cgroup *memcg, unsigned long *flags) @@ -1932,9 +1932,9 @@ again: * If this memory cgroup is not under account moving, we don't * need to take move_lock_page_cgroup(). Because we already hold * rcu_read_lock(), any calls to move_account will be delayed until - * rcu_read_unlock() if mem_cgroup_stealed() == true. + * rcu_read_unlock() if mem_cgroup_stolen() == true. */ - if (!mem_cgroup_stealed(memcg)) + if (!mem_cgroup_stolen(memcg)) return; move_lock_mem_cgroup(memcg, flags); -- cgit v0.10.2 From 45f3e385b7a639c633d7a4b1e863c2d52b918258 Mon Sep 17 00:00:00 2001 From: Anton Vorontsov Date: Wed, 21 Mar 2012 16:34:26 -0700 Subject: mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event() In the following code: if (type == _MEM) thresholds = &memcg->thresholds; else if (type == _MEMSWAP) thresholds = &memcg->memsw_thresholds; else BUG(); BUG_ON(!thresholds); The BUG_ON() seems redundant. Signed-off-by: Anton Vorontsov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c200875..4dc9709 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4467,12 +4467,6 @@ static void mem_cgroup_usage_unregister_event(struct cgroup *cgrp, else BUG(); - /* - * Something went wrong if we trying to unregister a threshold - * if we don't have thresholds - */ - BUG_ON(!thresholds); - if (!thresholds->primary) goto unlock; -- cgit v0.10.2 From a488428871265979bcf2c46298a04c1d5826e6cb Mon Sep 17 00:00:00 2001 From: Jeff Liu Date: Wed, 21 Mar 2012 16:34:27 -0700 Subject: mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read() Signed-off-by: Jie Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4dc9709..6110293 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3902,7 +3902,6 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft) break; default: BUG(); - break; } return val; } -- cgit v0.10.2 From 8d32ff84401f1addb961c7af2c8d9baceb0ab9ba Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Wed, 21 Mar 2012 16:34:27 -0700 Subject: memcg: clean up existing move charge code - Replace lengthy function name is_target_pte_for_mc() with a shorter one in order to avoid ugly line breaks. - explicitly use MC_TARGET_* instead of simply using integers. Signed-off-by: Naoya Horiguchi Cc: Andrea Arcangeli Cc: KAMEZAWA Hiroyuki Cc: Daisuke Nishimura Cc: Hillf Danton Cc: David Rientjes Acked-by: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6110293..c8d00a9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5110,7 +5110,7 @@ one_by_one: } /** - * is_target_pte_for_mc - check a pte whether it is valid for move charge + * get_mctgt_type - get target type of moving charge * @vma: the vma the pte to be checked belongs * @addr: the address corresponding to the pte to be checked * @ptent: the pte to be checked @@ -5133,7 +5133,7 @@ union mc_target { }; enum mc_target_type { - MC_TARGET_NONE, /* not used */ + MC_TARGET_NONE = 0, MC_TARGET_PAGE, MC_TARGET_SWAP, }; @@ -5214,12 +5214,12 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma, return page; } -static int is_target_pte_for_mc(struct vm_area_struct *vma, +static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma, unsigned long addr, pte_t ptent, union mc_target *target) { struct page *page = NULL; struct page_cgroup *pc; - int ret = 0; + enum mc_target_type ret = MC_TARGET_NONE; swp_entry_t ent = { .val = 0 }; if (pte_present(ptent)) @@ -5230,7 +5230,7 @@ static int is_target_pte_for_mc(struct vm_area_struct *vma, page = mc_handle_file_pte(vma, addr, ptent, &ent); if (!page && !ent.val) - return 0; + return ret; if (page) { pc = lookup_page_cgroup(page); /* @@ -5270,7 +5270,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) - if (is_target_pte_for_mc(vma, addr, *pte, NULL)) + if (get_mctgt_type(vma, addr, *pte, NULL)) mc.precharge++; /* increment precharge temporarily */ pte_unmap_unlock(pte - 1, ptl); cond_resched(); @@ -5442,8 +5442,7 @@ retry: if (!mc.precharge) break; - type = is_target_pte_for_mc(vma, addr, ptent, &target); - switch (type) { + switch (get_mctgt_type(vma, addr, ptent, &target)) { case MC_TARGET_PAGE: page = target.page; if (isolate_lru_page(page)) @@ -5456,7 +5455,7 @@ retry: mc.moved_charge++; } putback_lru_page(page); -put: /* is_target_pte_for_mc() gets the page */ +put: /* get_mctgt_type() gets the page */ put_page(page); break; case MC_TARGET_SWAP: -- cgit v0.10.2 From d8c37c480678ebe09bc570f33e085e28049db035 Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Wed, 21 Mar 2012 16:34:27 -0700 Subject: thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE These macros will be used in a later patch, where all usages are expected to be optimized away without #ifdef CONFIG_TRANSPARENT_HUGEPAGE. But to detect unexpected usages, we convert the existing BUG() to BUILD_BUG(). [akpm@linux-foundation.org: fix build in mm/pgtable-generic.c] Signed-off-by: Naoya Horiguchi Acked-by: Hillf Danton Reviewed-by: Andrea Arcangeli Reviewed-by: KAMEZAWA Hiroyuki Acked-by: David Rientjes Cc: Daisuke Nishimura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index f56cacb..c8af7a2 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -51,6 +51,9 @@ extern pmd_t *page_check_address_pmd(struct page *page, unsigned long address, enum page_check_address_pmd_flag flag); +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) +#define HPAGE_PMD_NR (1< MAX_ORDER #error "hugepages can't be allocated by the buddy allocator" #endif @@ -158,9 +159,9 @@ static inline struct page *compound_trans_head(struct page *page) return page; } #else /* CONFIG_TRANSPARENT_HUGEPAGE */ -#define HPAGE_PMD_SHIFT ({ BUG(); 0; }) -#define HPAGE_PMD_MASK ({ BUG(); 0; }) -#define HPAGE_PMD_SIZE ({ BUG(); 0; }) +#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) +#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) +#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; }) #define hpage_nr_pages(x) 1 diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index eb663fb..5a74fea 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -70,10 +70,11 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { int young; -#ifndef CONFIG_TRANSPARENT_HUGEPAGE +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + VM_BUG_ON(address & ~HPAGE_PMD_MASK); +#else BUG(); #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ - VM_BUG_ON(address & ~HPAGE_PMD_MASK); young = pmdp_test_and_clear_young(vma, address, pmdp); if (young) flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); -- cgit v0.10.2 From 12724850e8064f64b6223d26d78c0597c742c65a Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Wed, 21 Mar 2012 16:34:28 -0700 Subject: memcg: avoid THP split in task migration Currently we can't do task migration among memory cgroups without THP split, which means processes heavily using THP experience large overhead in task migration. This patch introduces the code for moving charge of THP and makes THP more valuable. Signed-off-by: Naoya Horiguchi Acked-by: Hillf Danton Cc: Andrea Arcangeli Acked-by: KAMEZAWA Hiroyuki Cc: David Rientjes Cc: Daisuke Nishimura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8d00a9..b2ee6df 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5256,6 +5256,41 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma, return ret; } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +/* + * We don't consider swapping or file mapped pages because THP does not + * support them for now. + * Caller should make sure that pmd_trans_huge(pmd) is true. + */ +static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma, + unsigned long addr, pmd_t pmd, union mc_target *target) +{ + struct page *page = NULL; + struct page_cgroup *pc; + enum mc_target_type ret = MC_TARGET_NONE; + + page = pmd_page(pmd); + VM_BUG_ON(!page || !PageHead(page)); + if (!move_anon()) + return ret; + pc = lookup_page_cgroup(page); + if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) { + ret = MC_TARGET_PAGE; + if (target) { + get_page(page); + target->page = page; + } + } + return ret; +} +#else +static inline enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma, + unsigned long addr, pmd_t pmd, union mc_target *target) +{ + return MC_TARGET_NONE; +} +#endif + static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) @@ -5264,9 +5299,12 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, pte_t *pte; spinlock_t *ptl; - split_huge_page_pmd(walk->mm, pmd); - if (pmd_trans_unstable(pmd)) + if (pmd_trans_huge_lock(pmd, vma) == 1) { + if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) + mc.precharge += HPAGE_PMD_NR; + spin_unlock(&vma->vm_mm->page_table_lock); return 0; + } pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) @@ -5425,18 +5463,49 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, struct vm_area_struct *vma = walk->private; pte_t *pte; spinlock_t *ptl; + enum mc_target_type target_type; + union mc_target target; + struct page *page; + struct page_cgroup *pc; - split_huge_page_pmd(walk->mm, pmd); - if (pmd_trans_unstable(pmd)) + /* + * We don't take compound_lock() here but no race with splitting thp + * happens because: + * - if pmd_trans_huge_lock() returns 1, the relevant thp is not + * under splitting, which means there's no concurrent thp split, + * - if another thread runs into split_huge_page() just after we + * entered this if-block, the thread must wait for page table lock + * to be unlocked in __split_huge_page_splitting(), where the main + * part of thp split is not executed yet. + */ + if (pmd_trans_huge_lock(pmd, vma) == 1) { + if (!mc.precharge) { + spin_unlock(&vma->vm_mm->page_table_lock); + return 0; + } + target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); + if (target_type == MC_TARGET_PAGE) { + page = target.page; + if (!isolate_lru_page(page)) { + pc = lookup_page_cgroup(page); + if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, + pc, mc.from, mc.to, + false)) { + mc.precharge -= HPAGE_PMD_NR; + mc.moved_charge += HPAGE_PMD_NR; + } + putback_lru_page(page); + } + put_page(page); + } + spin_unlock(&vma->vm_mm->page_table_lock); return 0; + } + retry: pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; addr += PAGE_SIZE) { pte_t ptent = *(pte++); - union mc_target target; - int type; - struct page *page; - struct page_cgroup *pc; swp_entry_t ent; if (!mc.precharge) -- cgit v0.10.2