summaryrefslogtreecommitdiff
path: root/mm/memory.c
AgeCommit message (Collapse)Author
2015-02-13mm, rt: kmap_atomic schedulingPeter Zijlstra
In fact, with migrate_disable() existing one could play games with kmap_atomic. You could save/restore the kmap_atomic slots on context switch (if there are any in use of course), this should be esp easy now that we have a kmap_atomic stack. Something like the below.. it wants replacing all the preempt_disable() stuff with pagefault_disable() && migrate_disable() of course, but then you can flip kmaps around like below. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> [dvhart@linux.intel.com: build fix] Link: http://lkml.kernel.org/r/1311842631.5890.208.camel@twins [tglx@linutronix.de: Get rid of the per cpu variable and store the idx and the pte content right away in the task struct. Shortens the context switch code. ]
2015-02-13mm: shrink the page frame to !-rt sizePeter Zijlstra
He below is a boot-tested hack to shrink the page frame size back to normal. Should be a net win since there should be many less PTE-pages than page-frames. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13mm: Remove preempt count from pagefault disable/enableThomas Gleixner
Now that all users are cleaned up, we can remove the preemption count. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13mm: raw_pagefault_disablePeter Zijlstra
Adding migrate_disable() to pagefault_disable() to preserve the per-cpu thing for kmap_atomic might not have been the best of choices. But short of adding preempt_disable/migrate_disable foo all over the kmap code it still seems the best way. It does however yield the below borkage as well as wreck !-rt builds since !-rt does rely on pagefault_disable() not preempting. So fix all that up by adding raw_pagefault_disable(). <NMI> [<ffffffff81076d5c>] warn_slowpath_common+0x85/0x9d [<ffffffff81076e17>] warn_slowpath_fmt+0x46/0x48 [<ffffffff814f7fca>] ? _raw_spin_lock+0x6c/0x73 [<ffffffff810cac87>] ? watchdog_overflow_callback+0x9b/0xd0 [<ffffffff810caca3>] watchdog_overflow_callback+0xb7/0xd0 [<ffffffff810f51bb>] __perf_event_overflow+0x11c/0x1fe [<ffffffff810f298f>] ? perf_event_update_userpage+0x149/0x151 [<ffffffff810f2846>] ? perf_event_task_disable+0x7c/0x7c [<ffffffff810f5b7c>] perf_event_overflow+0x14/0x16 [<ffffffff81046e02>] x86_pmu_handle_irq+0xcb/0x108 [<ffffffff814f9a6b>] perf_event_nmi_handler+0x46/0x91 [<ffffffff814fb2ba>] notifier_call_chain+0x79/0xa6 [<ffffffff814fb34d>] __atomic_notifier_call_chain+0x66/0x98 [<ffffffff814fb2e7>] ? notifier_call_chain+0xa6/0xa6 [<ffffffff814fb393>] atomic_notifier_call_chain+0x14/0x16 [<ffffffff814fb3c3>] notify_die+0x2e/0x30 [<ffffffff814f8f75>] do_nmi+0x7e/0x22b [<ffffffff814f8bca>] nmi+0x1a/0x2c [<ffffffff814fb130>] ? sub_preempt_count+0x4b/0xaa <<EOE>> <IRQ> [<ffffffff812d44cc>] delay_tsc+0xac/0xd1 [<ffffffff812d4399>] __delay+0xf/0x11 [<ffffffff812d95d9>] do_raw_spin_lock+0xd2/0x13c [<ffffffff814f813e>] _raw_spin_lock_irqsave+0x6b/0x85 [<ffffffff8106772a>] ? task_rq_lock+0x35/0x8d [<ffffffff8106772a>] task_rq_lock+0x35/0x8d [<ffffffff8106fe2f>] migrate_disable+0x65/0x12c [<ffffffff81114e69>] pagefault_disable+0xe/0x1f [<ffffffff81039c73>] dump_trace+0x21f/0x2e2 [<ffffffff8103ad79>] show_trace_log_lvl+0x54/0x5d [<ffffffff8103ad97>] show_trace+0x15/0x17 [<ffffffff814f4f5f>] dump_stack+0x77/0x80 [<ffffffff812d94b0>] spin_bug+0x9c/0xa3 [<ffffffff81067745>] ? task_rq_lock+0x50/0x8d [<ffffffff812d954e>] do_raw_spin_lock+0x47/0x13c [<ffffffff814f7fbe>] _raw_spin_lock+0x60/0x73 [<ffffffff81067745>] ? task_rq_lock+0x50/0x8d [<ffffffff81067745>] task_rq_lock+0x50/0x8d [<ffffffff8106fe2f>] migrate_disable+0x65/0x12c [<ffffffff81114e69>] pagefault_disable+0xe/0x1f [<ffffffff81039c73>] dump_trace+0x21f/0x2e2 [<ffffffff8104369b>] save_stack_trace+0x2f/0x4c [<ffffffff810a7848>] save_trace+0x3f/0xaf [<ffffffff810aa2bd>] mark_lock+0x228/0x530 [<ffffffff810aac27>] __lock_acquire+0x662/0x1812 [<ffffffff8103dad4>] ? native_sched_clock+0x37/0x6d [<ffffffff810a790e>] ? trace_hardirqs_off_caller+0x1f/0x99 [<ffffffff810693f6>] ? sched_rt_period_timer+0xbd/0x218 [<ffffffff810ac403>] lock_acquire+0x145/0x18a [<ffffffff810693f6>] ? sched_rt_period_timer+0xbd/0x218 [<ffffffff814f7f9e>] _raw_spin_lock+0x40/0x73 [<ffffffff810693f6>] ? sched_rt_period_timer+0xbd/0x218 [<ffffffff810693f6>] sched_rt_period_timer+0xbd/0x218 [<ffffffff8109aa39>] __run_hrtimer+0x1e4/0x347 [<ffffffff81069339>] ? can_migrate_task.clone.82+0x14a/0x14a [<ffffffff8109b97c>] hrtimer_interrupt+0xee/0x1d6 [<ffffffff814fb23d>] ? add_preempt_count+0xae/0xb2 [<ffffffff814ffb38>] smp_apic_timer_interrupt+0x85/0x98 [<ffffffff814fef13>] apic_timer_interrupt+0x13/0x20 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-31keae8mkjiv8esq4rl76cib@git.kernel.org
2015-02-13mm: Prepare decoupling the page fault disabling logicIngo Molnar
Add a pagefault_disabled variable to task_struct to allow decoupling the pagefault-disabled logic from the preempt count. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-02-13Reset to 3.12.37Scott Wood
2014-05-14mm, rt: kmap_atomic schedulingPeter Zijlstra
In fact, with migrate_disable() existing one could play games with kmap_atomic. You could save/restore the kmap_atomic slots on context switch (if there are any in use of course), this should be esp easy now that we have a kmap_atomic stack. Something like the below.. it wants replacing all the preempt_disable() stuff with pagefault_disable() && migrate_disable() of course, but then you can flip kmaps around like below. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> [dvhart@linux.intel.com: build fix] Link: http://lkml.kernel.org/r/1311842631.5890.208.camel@twins [tglx@linutronix.de: Get rid of the per cpu variable and store the idx and the pte content right away in the task struct. Shortens the context switch code. ]
2014-05-14mm: shrink the page frame to !-rt sizePeter Zijlstra
He below is a boot-tested hack to shrink the page frame size back to normal. Should be a net win since there should be many less PTE-pages than page-frames. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14mm: Remove preempt count from pagefault disable/enableThomas Gleixner
Now that all users are cleaned up, we can remove the preemption count. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14mm: raw_pagefault_disablePeter Zijlstra
Adding migrate_disable() to pagefault_disable() to preserve the per-cpu thing for kmap_atomic might not have been the best of choices. But short of adding preempt_disable/migrate_disable foo all over the kmap code it still seems the best way. It does however yield the below borkage as well as wreck !-rt builds since !-rt does rely on pagefault_disable() not preempting. So fix all that up by adding raw_pagefault_disable(). <NMI> [<ffffffff81076d5c>] warn_slowpath_common+0x85/0x9d [<ffffffff81076e17>] warn_slowpath_fmt+0x46/0x48 [<ffffffff814f7fca>] ? _raw_spin_lock+0x6c/0x73 [<ffffffff810cac87>] ? watchdog_overflow_callback+0x9b/0xd0 [<ffffffff810caca3>] watchdog_overflow_callback+0xb7/0xd0 [<ffffffff810f51bb>] __perf_event_overflow+0x11c/0x1fe [<ffffffff810f298f>] ? perf_event_update_userpage+0x149/0x151 [<ffffffff810f2846>] ? perf_event_task_disable+0x7c/0x7c [<ffffffff810f5b7c>] perf_event_overflow+0x14/0x16 [<ffffffff81046e02>] x86_pmu_handle_irq+0xcb/0x108 [<ffffffff814f9a6b>] perf_event_nmi_handler+0x46/0x91 [<ffffffff814fb2ba>] notifier_call_chain+0x79/0xa6 [<ffffffff814fb34d>] __atomic_notifier_call_chain+0x66/0x98 [<ffffffff814fb2e7>] ? notifier_call_chain+0xa6/0xa6 [<ffffffff814fb393>] atomic_notifier_call_chain+0x14/0x16 [<ffffffff814fb3c3>] notify_die+0x2e/0x30 [<ffffffff814f8f75>] do_nmi+0x7e/0x22b [<ffffffff814f8bca>] nmi+0x1a/0x2c [<ffffffff814fb130>] ? sub_preempt_count+0x4b/0xaa <<EOE>> <IRQ> [<ffffffff812d44cc>] delay_tsc+0xac/0xd1 [<ffffffff812d4399>] __delay+0xf/0x11 [<ffffffff812d95d9>] do_raw_spin_lock+0xd2/0x13c [<ffffffff814f813e>] _raw_spin_lock_irqsave+0x6b/0x85 [<ffffffff8106772a>] ? task_rq_lock+0x35/0x8d [<ffffffff8106772a>] task_rq_lock+0x35/0x8d [<ffffffff8106fe2f>] migrate_disable+0x65/0x12c [<ffffffff81114e69>] pagefault_disable+0xe/0x1f [<ffffffff81039c73>] dump_trace+0x21f/0x2e2 [<ffffffff8103ad79>] show_trace_log_lvl+0x54/0x5d [<ffffffff8103ad97>] show_trace+0x15/0x17 [<ffffffff814f4f5f>] dump_stack+0x77/0x80 [<ffffffff812d94b0>] spin_bug+0x9c/0xa3 [<ffffffff81067745>] ? task_rq_lock+0x50/0x8d [<ffffffff812d954e>] do_raw_spin_lock+0x47/0x13c [<ffffffff814f7fbe>] _raw_spin_lock+0x60/0x73 [<ffffffff81067745>] ? task_rq_lock+0x50/0x8d [<ffffffff81067745>] task_rq_lock+0x50/0x8d [<ffffffff8106fe2f>] migrate_disable+0x65/0x12c [<ffffffff81114e69>] pagefault_disable+0xe/0x1f [<ffffffff81039c73>] dump_trace+0x21f/0x2e2 [<ffffffff8104369b>] save_stack_trace+0x2f/0x4c [<ffffffff810a7848>] save_trace+0x3f/0xaf [<ffffffff810aa2bd>] mark_lock+0x228/0x530 [<ffffffff810aac27>] __lock_acquire+0x662/0x1812 [<ffffffff8103dad4>] ? native_sched_clock+0x37/0x6d [<ffffffff810a790e>] ? trace_hardirqs_off_caller+0x1f/0x99 [<ffffffff810693f6>] ? sched_rt_period_timer+0xbd/0x218 [<ffffffff810ac403>] lock_acquire+0x145/0x18a [<ffffffff810693f6>] ? sched_rt_period_timer+0xbd/0x218 [<ffffffff814f7f9e>] _raw_spin_lock+0x40/0x73 [<ffffffff810693f6>] ? sched_rt_period_timer+0xbd/0x218 [<ffffffff810693f6>] sched_rt_period_timer+0xbd/0x218 [<ffffffff8109aa39>] __run_hrtimer+0x1e4/0x347 [<ffffffff81069339>] ? can_migrate_task.clone.82+0x14a/0x14a [<ffffffff8109b97c>] hrtimer_interrupt+0xee/0x1d6 [<ffffffff814fb23d>] ? add_preempt_count+0xae/0xb2 [<ffffffff814ffb38>] smp_apic_timer_interrupt+0x85/0x98 [<ffffffff814fef13>] apic_timer_interrupt+0x13/0x20 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-31keae8mkjiv8esq4rl76cib@git.kernel.org
2014-05-14mm: Prepare decoupling the page fault disabling logicIngo Molnar
Add a pagefault_disabled variable to task_struct to allow decoupling the pagefault-disabled logic from the preempt count. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-14Reset to 3.12.19Scott Wood
2014-04-10mm, rt: kmap_atomic schedulingPeter Zijlstra
In fact, with migrate_disable() existing one could play games with kmap_atomic. You could save/restore the kmap_atomic slots on context switch (if there are any in use of course), this should be esp easy now that we have a kmap_atomic stack. Something like the below.. it wants replacing all the preempt_disable() stuff with pagefault_disable() && migrate_disable() of course, but then you can flip kmaps around like below. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> [dvhart@linux.intel.com: build fix] Link: http://lkml.kernel.org/r/1311842631.5890.208.camel@twins [tglx@linutronix.de: Get rid of the per cpu variable and store the idx and the pte content right away in the task struct. Shortens the context switch code. ]
2014-04-10mm: shrink the page frame to !-rt sizePeter Zijlstra
He below is a boot-tested hack to shrink the page frame size back to normal. Should be a net win since there should be many less PTE-pages than page-frames. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10mm: Remove preempt count from pagefault disable/enableThomas Gleixner
Now that all users are cleaned up, we can remove the preemption count. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10mm: raw_pagefault_disablePeter Zijlstra
Adding migrate_disable() to pagefault_disable() to preserve the per-cpu thing for kmap_atomic might not have been the best of choices. But short of adding preempt_disable/migrate_disable foo all over the kmap code it still seems the best way. It does however yield the below borkage as well as wreck !-rt builds since !-rt does rely on pagefault_disable() not preempting. So fix all that up by adding raw_pagefault_disable(). <NMI> [<ffffffff81076d5c>] warn_slowpath_common+0x85/0x9d [<ffffffff81076e17>] warn_slowpath_fmt+0x46/0x48 [<ffffffff814f7fca>] ? _raw_spin_lock+0x6c/0x73 [<ffffffff810cac87>] ? watchdog_overflow_callback+0x9b/0xd0 [<ffffffff810caca3>] watchdog_overflow_callback+0xb7/0xd0 [<ffffffff810f51bb>] __perf_event_overflow+0x11c/0x1fe [<ffffffff810f298f>] ? perf_event_update_userpage+0x149/0x151 [<ffffffff810f2846>] ? perf_event_task_disable+0x7c/0x7c [<ffffffff810f5b7c>] perf_event_overflow+0x14/0x16 [<ffffffff81046e02>] x86_pmu_handle_irq+0xcb/0x108 [<ffffffff814f9a6b>] perf_event_nmi_handler+0x46/0x91 [<ffffffff814fb2ba>] notifier_call_chain+0x79/0xa6 [<ffffffff814fb34d>] __atomic_notifier_call_chain+0x66/0x98 [<ffffffff814fb2e7>] ? notifier_call_chain+0xa6/0xa6 [<ffffffff814fb393>] atomic_notifier_call_chain+0x14/0x16 [<ffffffff814fb3c3>] notify_die+0x2e/0x30 [<ffffffff814f8f75>] do_nmi+0x7e/0x22b [<ffffffff814f8bca>] nmi+0x1a/0x2c [<ffffffff814fb130>] ? sub_preempt_count+0x4b/0xaa <<EOE>> <IRQ> [<ffffffff812d44cc>] delay_tsc+0xac/0xd1 [<ffffffff812d4399>] __delay+0xf/0x11 [<ffffffff812d95d9>] do_raw_spin_lock+0xd2/0x13c [<ffffffff814f813e>] _raw_spin_lock_irqsave+0x6b/0x85 [<ffffffff8106772a>] ? task_rq_lock+0x35/0x8d [<ffffffff8106772a>] task_rq_lock+0x35/0x8d [<ffffffff8106fe2f>] migrate_disable+0x65/0x12c [<ffffffff81114e69>] pagefault_disable+0xe/0x1f [<ffffffff81039c73>] dump_trace+0x21f/0x2e2 [<ffffffff8103ad79>] show_trace_log_lvl+0x54/0x5d [<ffffffff8103ad97>] show_trace+0x15/0x17 [<ffffffff814f4f5f>] dump_stack+0x77/0x80 [<ffffffff812d94b0>] spin_bug+0x9c/0xa3 [<ffffffff81067745>] ? task_rq_lock+0x50/0x8d [<ffffffff812d954e>] do_raw_spin_lock+0x47/0x13c [<ffffffff814f7fbe>] _raw_spin_lock+0x60/0x73 [<ffffffff81067745>] ? task_rq_lock+0x50/0x8d [<ffffffff81067745>] task_rq_lock+0x50/0x8d [<ffffffff8106fe2f>] migrate_disable+0x65/0x12c [<ffffffff81114e69>] pagefault_disable+0xe/0x1f [<ffffffff81039c73>] dump_trace+0x21f/0x2e2 [<ffffffff8104369b>] save_stack_trace+0x2f/0x4c [<ffffffff810a7848>] save_trace+0x3f/0xaf [<ffffffff810aa2bd>] mark_lock+0x228/0x530 [<ffffffff810aac27>] __lock_acquire+0x662/0x1812 [<ffffffff8103dad4>] ? native_sched_clock+0x37/0x6d [<ffffffff810a790e>] ? trace_hardirqs_off_caller+0x1f/0x99 [<ffffffff810693f6>] ? sched_rt_period_timer+0xbd/0x218 [<ffffffff810ac403>] lock_acquire+0x145/0x18a [<ffffffff810693f6>] ? sched_rt_period_timer+0xbd/0x218 [<ffffffff814f7f9e>] _raw_spin_lock+0x40/0x73 [<ffffffff810693f6>] ? sched_rt_period_timer+0xbd/0x218 [<ffffffff810693f6>] sched_rt_period_timer+0xbd/0x218 [<ffffffff8109aa39>] __run_hrtimer+0x1e4/0x347 [<ffffffff81069339>] ? can_migrate_task.clone.82+0x14a/0x14a [<ffffffff8109b97c>] hrtimer_interrupt+0xee/0x1d6 [<ffffffff814fb23d>] ? add_preempt_count+0xae/0xb2 [<ffffffff814ffb38>] smp_apic_timer_interrupt+0x85/0x98 [<ffffffff814fef13>] apic_timer_interrupt+0x13/0x20 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-31keae8mkjiv8esq4rl76cib@git.kernel.org
2014-04-10mm: Prepare decoupling the page fault disabling logicIngo Molnar
Add a pagefault_disabled variable to task_struct to allow decoupling the pagefault-disabled logic from the preempt count. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-05mm, thp: fix infinite loop on memcg OOMKirill A. Shutemov
commit 9845cbbd113fbb5b769a45d8e88dc47bc12df4e0 upstream. Masayoshi Mizuma reported a bug with the hang of an application under the memcg limit. It happens on write-protection fault to huge zero page If we successfully allocate a huge page to replace zero page but hit the memcg limit we need to split the zero page with split_huge_page_pmd() and fallback to small pages. The other part of the problem is that VM_FAULT_OOM has special meaning in do_huge_pmd_wp_page() context. __handle_mm_fault() expects the page to be split if it sees VM_FAULT_OOM and it will will retry page fault handling. This causes an infinite loop if the page was not split. do_huge_pmd_wp_zero_page_fallback() can return VM_FAULT_OOM if it failed to allocate one small page, so fallback to small pages will not help. The solution for this part is to replace VM_FAULT_OOM with VM_FAULT_FALLBACK is fallback required. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2013-10-29mm: numa: Sanitize task_numa_fault() callsitesMel Gorman
There are three callers of task_numa_fault(): - do_huge_pmd_numa_page(): Accounts against the current node, not the node where the page resides, unless we migrated, in which case it accounts against the node we migrated to. - do_numa_page(): Accounts against the current node, not the node where the page resides, unless we migrated, in which case it accounts against the node we migrated to. - do_pmd_numa_page(): Accounts not at all when the page isn't migrated, otherwise accounts against the node we migrated towards. This seems wrong to me; all three sites should have the same sementaics, furthermore we should accounts against where the page really is, we already know where the task is. So modify all three sites to always account; we did after all receive the fault; and always account to where the page is after migration, regardless of success. They all still differ on when they clear the PTE/PMD; ideally that would get sorted too. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-17mm: memcg: handle non-error OOM situations more gracefullyJohannes Weiner
Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") assumed that only a few places that can trigger a memcg OOM situation do not return VM_FAULT_OOM, like optional page cache readahead. But there are many more and it's impractical to annotate them all. First of all, we don't want to invoke the OOM killer when the failed allocation is gracefully handled, so defer the actual kill to the end of the fault handling as well. This simplifies the code quite a bit for added bonus. Second, since a failed allocation might not be the abrupt end of the fault, the memcg OOM handler needs to be re-entrant until the fault finishes for subsequent allocation attempts. If an allocation is attempted after the task already OOMed, allow it to bypass the limit so that it can quickly finish the fault and invoke the OOM killer. Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-17mm: migration: do not lose soft dirty bit if page is in migration stateCyrill Gorcunov
If page migration is turned on in config and the page is migrating, we may lose the soft dirty bit. If fork and mprotect are called on migrating pages (once migration is complete) pages do not obtain the soft dirty bit in the correspond pte entries. Fix it adding an appropriate test on swap entries. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()Kirill A. Shutemov
do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault() to handle fallback path. Let's consolidate code back by introducing VM_FAULT_FALLBACK return code. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12mm: memcg: do not trap chargers with full callstack on OOMJohannes Weiner
The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12mm: memcg: enable memcg OOM killer only for user faultsJohannes Weiner
System calls and kernel faults (uaccess, gup) can handle an out of memory situation gracefully and just return -ENOMEM. Enable the memcg OOM killer only for user faults, where it's really the only option available. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: migrate: add hugepage migration code to move_pages()Naoya Horiguchi
Extend move_pages() to handle vma with VM_HUGETLB set. We will be able to migrate hugepage with move_pages(2) after applying the enablement patch which comes later in this series. We avoid getting refcount on tail pages of hugepage, because unlike thp, hugepage is not split and we need not care about races with splitting. And migration of larger (1GB for x86_64) hugepage are not enabled. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: move pgtable related functions to right placeJoonsoo Kim
pgtable related functions are mostly in pgtable-generic.c. So move remaining functions from memory.c to pgtable-generic.c. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-19Merge 3.11-rc6 into char-misc-nextGreg Kroah-Hartman
We want these fixes in this tree. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-16Fix TLB gather virtual address range invalidation corner casesLinus Torvalds
Ben Tebulin reported: "Since v3.7.2 on two independent machines a very specific Git repository fails in 9/10 cases on git-fsck due to an SHA1/memory failures. This only occurs on a very specific repository and can be reproduced stably on two independent laptops. Git mailing list ran out of ideas and for me this looks like some very exotic kernel issue" and bisected the failure to the backport of commit 53a59fc67f97 ("mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT"). That commit itself is not actually buggy, but what it does is to make it much more likely to hit the partial TLB invalidation case, since it introduces a new case in tlb_next_batch() that previously only ever happened when running out of memory. The real bug is that the TLB gather virtual memory range setup is subtly buggered. It was introduced in commit 597e1c3580b7 ("mm/mmu_gather: enable tlb flush range in generic mmu_gather"), and the range handling was already fixed at least once in commit e6c495a96ce0 ("mm: fix the TLB range flushed when __tlb_remove_page() runs out of slots"), but that fix was not complete. The problem with the TLB gather virtual address range is that it isn't set up by the initial tlb_gather_mmu() initialization (which didn't get the TLB range information), but it is set up ad-hoc later by the functions that actually flush the TLB. And so any such case that forgot to update the TLB range entries would potentially miss TLB invalidates. Rather than try to figure out exactly which particular ad-hoc range setup was missing (I personally suspect it's the hugetlb case in zap_huge_pmd(), which didn't have the same logic as zap_pte_range() did), this patch just gets rid of the problem at the source: make the TLB range information available to tlb_gather_mmu(), and initialize it when initializing all the other tlb gather fields. This makes the patch larger, but conceptually much simpler. And the end result is much more understandable; even if you want to play games with partial ranges when invalidating the TLB contents in chunks, now the range information is always there, and anybody who doesn't want to bother with it won't introduce subtle bugs. Ben verified that this fixes his problem. Reported-bisected-and-tested-by: Ben Tebulin <tebulin@googlemail.com> Build-testing-by: Stephen Rothwell <sfr@canb.auug.org.au> Build-testing-by: Richard Weinberger <richard.weinberger@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-14mm: save soft-dirty bits on file pagesCyrill Gorcunov
Andy reported that if file page get reclaimed we lose the soft-dirty bit if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get encoded into pte entry. Thus when #pf happens on such non-present pte we can restore it back. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-14mm: save soft-dirty bits on swapped pagesCyrill Gorcunov
Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set get swapped out, the bit is getting lost and no longer available when pte read back. To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in pte entry for the page being swapped out. When such page is to be read back from a swap cache we check for bit presence and if it's there we clear it and restore the former _PAGE_SOFT_DIRTY bit back. One of the problem was to find a place in pte entry where we can save the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was chosen for that, it doesn't intersect with swap entry format stored in pte. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-12mm: make generic_access_phys available for modulesUwe Kleine-König
In the next commit this function will be used in the uio subsystem Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-09mm: remove unused VM_<READfoo> macros and expand other in-placeJoe Perches
These VM_<READfoo> macros aren't used very often and three of them aren't used at all. Expand the ones that are used in-place, and remove all the now unused #define VM_<foo> macros. VM_READHINTMASK, VM_NormalReadHint and VM_ClearReadHint were added just before 2.4 and appears have never been used. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: kill global variable num_physpagesJiang Liu
Now all references to num_physpages have been removed, so kill it. Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: fix the TLB range flushed when __tlb_remove_page() runs out of slotsVineet Gupta
zap_pte_range loops from @addr to @end. In the middle, if it runs out of batching slots, TLB entries needs to be flushed for @start to @interim, NOT @interim to @end. Since ARC port doesn't use page free batching I can't test it myself but this seems like the right thing to do. Observed this when working on a fix for the issue at thread: http://www.spinics.net/lists/linux-arch/msg21736.html Signed-off-by: Vineet Gupta <vgupta@synopsys.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFTLibin
(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented as a inline funcion vma_pages() in linux/mm.h, so using it. Signed-off-by: Libin <huawei.libin@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-02Merge branch 'sched-mm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull voluntary preemption fixes from Ingo Molnar: "This tree contains a speedup which is achieved through better might_sleep()/might_fault() preemption point annotations for uaccess functions, by Michael S Tsirkin: 1. The only reason uaccess routines might sleep is if they fault. Make this explicit for all architectures. 2. A voluntary preemption point in uaccess functions means compiler can't inline them efficiently, this breaks assumptions that they are very fast and small that e.g. net code seems to make. Remove this preemption point so behaviour matches with what callers assume. 3. Accesses (e.g through socket ops) to kernel memory with KERNEL_DS like net/sunrpc does will never sleep. Remove an unconditinal might_sleep() in the might_fault() inline in kernel.h (used when PROVE_LOCKING is not set). 4. Accesses with pagefault_disable() return EFAULT but won't cause caller to sleep. Check for that and thus avoid might_sleep() when PROVE_LOCKING is set. These changes offer a nice speedup for CONFIG_PREEMPT_VOLUNTARY=y kernels, here's a network bandwidth measurement between a virtual machine and the host: before: incoming: 7122.77 Mb/s outgoing: 8480.37 Mb/s after: incoming: 8619.24 Mb/s [ +21.0% ] outgoing: 9455.42 Mb/s [ +11.5% ] I kept these changes in a separate tree, separate from scheduler changes, because it's a mixed MM and scheduler topic" * 'sched-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mm, sched: Allow uaccess in atomic with pagefault_disable() mm, sched: Drop voluntary schedule from might_fault() x86: uaccess s/might_sleep/might_fault/ tile: uaccess s/might_sleep/might_fault/ powerpc: uaccess s/might_sleep/might_fault/ mn10300: uaccess s/might_sleep/might_fault/ microblaze: uaccess s/might_sleep/might_fault/ m32r: uaccess s/might_sleep/might_fault/ frv: uaccess s/might_sleep/might_fault/ arm64: uaccess s/might_sleep/might_fault/ asm-generic: uaccess s/might_sleep/might_fault/
2013-06-06arch, mm: Remove tlb_fast_mode()Peter Zijlstra
Since the introduction of preemptible mmu_gather TLB fast mode has been broken. TLB fast mode relies on there being absolutely no concurrency; it frees pages first and invalidates TLBs later. However now we can get concurrency and stuff goes *bang*. This patch removes all tlb_fast_mode() code; it was found the better option vs trying to patch the hole by entangling tlb invalidation with the scheduler. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Reported-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-28mm, sched: Allow uaccess in atomic with pagefault_disable()Michael S. Tsirkin
This changes might_fault() so that it does not trigger a false positive diagnostic for e.g. the following sequence: spin_lock_irqsave() pagefault_disable() copy_to_user() pagefault_enable() spin_unlock_irqrestore() In particular vhost wants to do this, to call socket ops from under a lock. There are 3 cases to consider: - CONFIG_PROVE_LOCKING - might_fault is non-inline so it's easy to move the in_atomic test to fix up the false positive warning. - CONFIG_DEBUG_ATOMIC_SLEEP - might_fault is currently inline, but we are calling a non-inline __might_sleep anyway, so let's use the non-line version of might_fault that does the right thing. - !CONFIG_DEBUG_ATOMIC_SLEEP && !CONFIG_PROVE_LOCKING __might_sleep is a nop so might_fault is a nop. Make this explicit. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1369577426-26721-11-git-send-email-mst@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28mm, sched: Drop voluntary schedule from might_fault()Michael S. Tsirkin
might_fault() is called from functions like copy_to_user() which most callers expect to be very fast, like a couple of instructions. So functions like memcpy_toiovec() call them many times in a loop. But might_fault() calls might_sleep() and with CONFIG_PREEMPT_VOLUNTARY this results in a function call. Let's not do this - just call __might_sleep() that produces a diagnostic for sleep within atomic, but drop might_preempt(). Here's a test sending traffic between the VM and the host, host is built with CONFIG_PREEMPT_VOLUNTARY: before: incoming: 7122.77 Mb/s outgoing: 8480.37 Mb/s after: incoming: 8619.24 Mb/s outgoing: 9455.42 Mb/s As a side effect, this fixes an issue pointed out by Ingo: might_fault might schedule differently depending on PROVE_LOCKING. Now there's no preemption point in both cases, so it's consistent. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1369577426-26721-10-git-send-email-mst@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-30Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "Usual stuff, mostly comment fixes, typo fixes, printk fixes and small code cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (45 commits) mm: Convert print_symbol to %pSR gfs2: Convert print_symbol to %pSR m32r: Convert print_symbol to %pSR iostats.txt: add easy-to-find description for field 6 x86 cmpxchg.h: fix wrong comment treewide: Fix typo in printk and comments doc: devicetree: Fix various typos docbook: fix 8250 naming in device-drivers pata_pdc2027x: Fix compiler warning treewide: Fix typo in printks mei: Fix comments in drivers/misc/mei treewide: Fix typos in kernel messages pm44xx: Fix comment for "CONFIG_CPU_IDLE" doc: Fix typo "CONFIG_CGROUP_CGROUP_MEMCG_SWAP" mmzone: correct "pags" to "pages" in comment. kernel-parameters: remove outdated 'noresidual' parameter Remove spurious _H suffixes from ifdef comments sound: Remove stray pluses from Kconfig file radio-shark: Fix printk "CONFIG_LED_CLASS" doc: put proper reference to CONFIG_MODULE_SIG_ENFORCE ...
2013-04-29THP: fix comment about memory barrierMinchan Kim
Currently the memory barrier in __do_huge_pmd_anonymous_page doesn't work. Because lru_cache_add_lru uses pagevec so it could miss spinlock easily so above rule was broken so user might see inconsistent data. I was not first person who pointed out the problem. Mel and Peter pointed out a few months ago and Peter pointed out further that even spin_lock/unlock can't make sure of it: http://marc.info/?t=134333512700004 In particular: *A = a; LOCK UNLOCK *B = b; may occur as: LOCK, STORE *B, STORE *A, UNLOCK At last, Hugh pointed out that even we don't need memory barrier in there because __SetPageUpdate already have done it from Nick's commit 0ed361dec369 ("mm: fix PageUptodate data race") explicitly. So this patch fixes comment on THP and adds same comment for do_anonymous_page, too because everybody except Hugh was missing that. It means we need a comment about that. Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29mm: Convert print_symbol to %pSRJoe Perches
Use the new vsprintf extension to avoid any possible message interleaving. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-04-16vm: add vm_iomap_memory() helper functionLinus Torvalds
Various drivers end up replicating the code to mmap() their memory buffers into user space, and our core memory remapping function may be very flexible but it is unnecessarily complicated for the common cases to use. Our internal VM uses pfn's ("page frame numbers") which simplifies things for the VM, and allows us to pass physical addresses around in a denser and more efficient format than passing a "phys_addr_t" around, and having to shift it up and down by the page size. But it just means that drivers end up doing that shifting instead at the interface level. It also means that drivers end up mucking around with internal VM things like the vma details (vm_pgoff, vm_start/end) way more than they really need to. So this just exports a function to map a certain physical memory range into user space (using a phys_addr_t based interface that is much more natural for a driver) and hides all the complexity from the driver. Some drivers will still end up tweaking the vm_page_prot details for things like prefetching or cacheability etc, but that's actually relevant to the driver, rather than caring about what the page offset of the mapping is into the particular IO memory region. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-12x86-32: Fix possible incomplete TLB invalidate with PAE pagetablesDave Hansen
This patch attempts to fix: https://bugzilla.kernel.org/show_bug.cgi?id=56461 The symptom is a crash and messages like this: chrome: Corrupted page table at address 34a03000 *pdpt = 0000000000000000 *pde = 0000000000000000 Bad pagetable: 000f [#1] PREEMPT SMP Ingo guesses this got introduced by commit 611ae8e3f520 ("x86/tlb: enable tlb flush range support for x86") since that code started to free unused pagetables. On x86-32 PAE kernels, that new code has the potential to free an entire PMD page and will clear one of the four page-directory-pointer-table (aka pgd_t entries). The hardware aggressively "caches" these top-level entries and invlpg does not actually affect the CPU's copy. If we clear one we *HAVE* to do a full TLB flush, otherwise we might continue using a freed pmd page. (note, we do this properly on the population side in pud_populate()). This patch tracks whenever we clear one of these entries in the 'struct mmu_gather', and ensures that we follow up with a full tlb flush. BTW, I disassembled and checked that: if (tlb->fullmm == 0) and if (!tlb->fullmm && !tlb->need_flush_all) generate essentially the same code, so there should be zero impact there to the !PAE case. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Artem S Tashkinov <t.artem@mailcity.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-25Merge tag 'modules-next-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module update from Rusty Russell: "The sweeping change is to make add_taint() explicitly indicate whether to disable lockdep, but it's a mechanical change." * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: MODSIGN: Add option to not sign modules during modules_install MODSIGN: Add -s <signature> option to sign-file MODSIGN: Specify the hash algorithm on sign-file command line MODSIGN: Simplify Makefile with a Kconfig helper module: clean up load_module a little more. modpost: Ignore ARC specific non-alloc sections module: constify within_module_* taint: add explicit flag to show whether lock dep is still OK. module: printk message when module signature fail taints kernel.
2013-02-24mm: cleanup "swapcache" in do_swap_pageHugh Dickins
I dislike the way in which "swapcache" gets used in do_swap_page(): there is always a page from swapcache there (even if maybe uncached by the time we lock it), but tests are made according to "swapcache". Rework that with "page != swapcache", as has been done in unuse_pte(). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24mm,ksm: FOLL_MIGRATION do migration_entry_waitHugh Dickins
In "ksm: remove old stable nodes more thoroughly" I said that I'd never seen its WARN_ON_ONCE(page_mapped(page)). True at the time of writing, but it soon appeared once I tried fuller tests on the whole series. It turned out to be due to the KSM page migration itself: unmerge_and_ remove_all_rmap_items() failed to locate and replace all the KSM pages, because of that hiatus in page migration when old pte has been replaced by migration entry, but not yet by new pte. follow_page() finds no page at that instant, but a KSM page reappears shortly after, without a fault. Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait() for KSM's break_cow(). I'd have preferred to avoid another flag, and do it every time, in case someone else makes the same easy mistake; but did not find another transgressor (the common get_user_pages() is of course safe), and cannot be sure that every follow_page() caller is prepared to sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can already sleep there, since anon_vma locking was changed to mutex, but maybe that's somehow excluded. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24mm: accelerate mm_populate() treatment of THP pagesMichel Lespinasse
This change adds a follow_page_mask function which is equivalent to follow_page, but with an extra page_mask argument. follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters a THP page, and to 0 in other cases. __get_user_pages() makes use of this in order to accelerate populating THP ranges - that is, when both the pages and vmas arrays are NULL, we don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and we also avoid taking mm->page_table_lock that many times). Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24mm: use long type for page counts in mm_populate() and get_user_pages()Michel Lespinasse
Use long type for page counts in mm_populate() so as to avoid integer overflow when running the following test code: int main(void) { void *p = mmap(NULL, 0x100000000000, PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0); printf("p: %p\n", p); mlockall(MCL_CURRENT); printf("done\n"); return 0; } Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-24ksm: remove old stable nodes more thoroughlyHugh Dickins
Switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree. It's not something that people will often want to do, but it would be lame to demand a reboot when they're trying to determine which merge_across_nodes setting is best. How can this happen? We only permit switching merge_across_nodes when pages_shared is 0, and usually set run 2 to force that beforehand, which ought to unmerge everything: yet oopses still occur when you then run 1. Three causes: 1. The old stable tree (built according to the inverse merge_across_nodes) has not been fully torn down. A stable node lingers until get_ksm_page() notices that the page it references no longer references it: but the page is not necessarily freed as soon as expected, particularly when swapcache. Fix this with a pass through the old stable tree, applying get_ksm_page() to each of the remaining nodes (most found stale and removed immediately), with forced removal of any left over. Unless the page is still mapped: I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE and EBUSY than BUG. 2. __ksm_enter() has a nice little optimization, to insert the new mm just behind ksmd's cursor, so there's a full pass for it to stabilize (or be removed) before ksmd addresses it. Nice when ksmd is running, but not so nice when we're trying to unmerge all mms: we were missing those mms forked and inserted behind the unmerge cursor. Easily fixed by inserting at the end when KSM_RUN_UNMERGE. 3. It is possible for a KSM page to be faulted back from swapcache into an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. A long outstanding, unrelated bugfix sneaks in with that third fix: ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O error when read in from swap) to a page which it then marks Uptodate. Fix this case by not copying, letting do_swap_page() discover the error. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>