From 85bd839983778fcd0c1c043327b14a046e979b39 Mon Sep 17 00:00:00 2001 From: Gu Zheng Date: Wed, 10 Jun 2015 11:14:43 -0700 Subject: mm/memory_hotplug.c: set zone->wait_table to null after freeing it Izumi found the following oops when hot re-adding a node: BUG: unable to handle kernel paging request at ffffc90008963690 IP: __wake_up_bit+0x20/0x70 Oops: 0000 [#1] SMP CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80 Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015 task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000 RIP: 0010:[] [] __wake_up_bit+0x20/0x70 RSP: 0018:ffff880017b97be8 EFLAGS: 00010246 RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9 RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648 RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800 R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000 FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0 Call Trace: unlock_page+0x6d/0x70 generic_write_end+0x53/0xb0 xfs_vm_write_end+0x29/0x80 [xfs] generic_perform_write+0x10a/0x1e0 xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs] xfs_file_write_iter+0x79/0x120 [xfs] __vfs_write+0xd4/0x110 vfs_write+0xac/0x1c0 SyS_write+0x58/0xd0 system_call_fastpath+0x12/0x76 Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 <48> 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48 RIP [] __wake_up_bit+0x20/0x70 RSP CR2: ffffc90008963690 Reproduce method (re-add a node):: Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic) This seems an use-after-free problem, and the root cause is zone->wait_table was not set to *NULL* after free it in try_offline_node. When hot re-add a node, we will reuse the pgdat of it, so does the zone struct, and when add pages to the target zone, it will init the zone first (including the wait_table) if the zone is not initialized. The judgement of zone initialized is based on zone->wait_table: static inline bool zone_is_initialized(struct zone *zone) { return !!zone->wait_table; } so if we do not set the zone->wait_table to *NULL* after free it, the memory hotplug routine will skip the init of new zone when hot re-add the node, and the wait_table still points to the freed memory, then we will access the invalid address when trying to wake up the waiting people after the i/o operation with the page is done, such as mentioned above. Signed-off-by: Gu Zheng Reported-by: Taku Izumi Reviewed by: Yasuaki Ishimatsu Cc: KAMEZAWA Hiroyuki Cc: Tang Chen Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 457bde5..9e88f74 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1969,8 +1969,10 @@ void try_offline_node(int nid) * wait_table may be allocated from boot memory, * here only free if it's allocated by vmalloc. */ - if (is_vmalloc_addr(zone->wait_table)) + if (is_vmalloc_addr(zone->wait_table)) { vfree(zone->wait_table); + zone->wait_table = NULL; + } } } EXPORT_SYMBOL(try_offline_node); -- cgit v0.10.2 From 7d638093d4b0e9ef15bd78f38f11f126e773cc14 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Wed, 10 Jun 2015 11:14:46 -0700 Subject: memcg: do not call reclaim if !__GFP_WAIT When trimming memcg consumption excess (see memory.high), we call try_to_free_mem_cgroup_pages without checking if we are allowed to sleep in the current context, which can result in a deadlock. Fix this. Fixes: 241994ed8649 ("mm: memcontrol: default hierarchy interface for memory") Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 14c2f20..9da23a7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2323,6 +2323,8 @@ done_restock: css_get_many(&memcg->css, batch); if (batch > nr_pages) refill_stock(memcg, batch - nr_pages); + if (!(gfp_mask & __GFP_WAIT)) + goto done; /* * If the hierarchy is above the normal consumption range, * make the charging task trim their excess contribution. -- cgit v0.10.2 From d7ad41a1c498729b7584c257710b1b437a0c1470 Mon Sep 17 00:00:00 2001 From: Weijie Yang Date: Wed, 10 Jun 2015 11:14:49 -0700 Subject: zram: clear disk io accounting when reset zram device Clear zram disk io accounting when resetting the zram device. Otherwise the residual io accounting stat will affect the diskstat in the next zram active cycle. Signed-off-by: Weijie Yang Acked-by: Sergey Senozhatsky Acked-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 8dcbced..6e134f4 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -805,7 +805,9 @@ static void zram_reset_device(struct zram *zram) memset(&zram->stats, 0, sizeof(zram->stats)); zram->disksize = 0; zram->max_comp_streams = 1; + set_capacity(zram->disk, 0); + part_stat_set_all(&zram->disk->part0, 0); up_write(&zram->init_lock); /* I/O operation under all of CPU are done so let's free */ -- cgit v0.10.2 From 5129e87cb81996d4d7e8f5a86b5355c5f2d0383b Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Wed, 10 Jun 2015 11:14:52 -0700 Subject: checkpatch: fix "GLOBAL_INITIALISERS" test Commit d5e616fc1c1d ("checkpatch: add a few more --fix corrections") broke the GLOBAL_INITIALISERS test with bad parentheses and optional leading spaces. Fix it. Signed-off-by: Joe Perches Reported-by: Bandan Das Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 89b1df4..c5ec977 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -3169,12 +3169,12 @@ sub process { } # check for global initialisers. - if ($line =~ /^\+(\s*$Type\s*$Ident\s*(?:\s+$Modifier))*\s*=\s*(0|NULL|false)\s*;/) { + if ($line =~ /^\+$Type\s*$Ident(?:\s+$Modifier)*\s*=\s*(?:0|NULL|false)\s*;/) { if (ERROR("GLOBAL_INITIALISERS", "do not initialise globals to 0 or NULL\n" . $herecurr) && $fix) { - $fixed[$fixlinenr] =~ s/($Type\s*$Ident\s*(?:\s+$Modifier))*\s*=\s*(0|NULL|false)\s*;/$1;/; + $fixed[$fixlinenr] =~ s/(^.$Type\s*$Ident(?:\s+$Modifier)*)\s*=\s*(0|NULL|false)\s*;/$1;/; } } # check for static initialisers. -- cgit v0.10.2 From f371763a79d5212c2cb216b46fa8af46ba56cee3 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 10 Jun 2015 11:14:54 -0700 Subject: mm: memcontrol: fix false-positive VM_BUG_ON() on -rt On -rt, the VM_BUG_ON(!irqs_disabled()) triggers inside the memcg swapout path because the spin_lock_irq(&mapping->tree_lock) in the caller doesn't actually disable the hardware interrupts - which is fine, because on -rt the tophalves run in process context and so we are still safe from preemption while updating the statistics. Remove the VM_BUG_ON() but keep the comment of what we rely on. Signed-off-by: Johannes Weiner Reported-by: Clark Williams Cc: Fernando Lopez-Lezcano Cc: Steven Rostedt Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9da23a7..a04225d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5835,9 +5835,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) if (!mem_cgroup_is_root(memcg)) page_counter_uncharge(&memcg->memory, 1); - /* XXX: caller holds IRQ-safe mapping->tree_lock */ - VM_BUG_ON(!irqs_disabled()); - + /* Caller disabled preemption with mapping->tree_lock */ mem_cgroup_charge_statistics(memcg, page, -1); memcg_check_events(memcg, page); } -- cgit v0.10.2 From 02f7b4145da113683ad64c74bf64605e16b71789 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 10 Jun 2015 11:14:57 -0700 Subject: zsmalloc: fix a null pointer dereference in destroy_handle_cache() If zs_create_pool()->create_handle_cache()->kmem_cache_create() or pool->name allocation fails, zs_create_pool()->destroy_handle_cache() will dereference the NULL pool->handle_cachep. Modify destroy_handle_cache() to avoid this. Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 08bd7a3..a8b5e74 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -289,7 +289,8 @@ static int create_handle_cache(struct zs_pool *pool) static void destroy_handle_cache(struct zs_pool *pool) { - kmem_cache_destroy(pool->handle_cachep); + if (pool->handle_cachep) + kmem_cache_destroy(pool->handle_cachep); } static unsigned long alloc_handle(struct zs_pool *pool) -- cgit v0.10.2 From 8e76d4eecf7afeec9328e21cd5880e281838d0d6 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 10 Jun 2015 11:15:00 -0700 Subject: sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings Jovi Zhangwei reported the following problem Below kernel vm bug can be triggered by tcpdump which mmaped a lot of pages with GFP_COMP flag. [Mon May 25 05:29:33 2015] page:ffffea0015414000 count:66 mapcount:1 mapping: (null) index:0x0 [Mon May 25 05:29:33 2015] flags: 0x20047580004000(head) [Mon May 25 05:29:33 2015] page dumped because: VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page)) [Mon May 25 05:29:33 2015] ------------[ cut here ]------------ [Mon May 25 05:29:33 2015] kernel BUG at mm/migrate.c:1661! [Mon May 25 05:29:33 2015] invalid opcode: 0000 [#1] SMP In this case it was triggered by running tcpdump but it's not necessary reproducible on all systems. sudo tcpdump -i bond0.100 'tcp port 4242' -c 100000000000 -w 4242.pcap Compound pages cannot be migrated and it was not expected that such pages be marked for NUMA balancing. This did not take into account that drivers such as net/packet/af_packet.c may insert compound pages into userspace with vm_insert_page. This patch tells the NUMA balancing protection scanner to skip all VM_MIXEDMAP mappings which avoids the possibility that compound pages are marked for migration. Signed-off-by: Mel Gorman Reported-by: Jovi Zhangwei Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ffeaa41..c2980e8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2181,7 +2181,7 @@ void task_numa_work(struct callback_head *work) } for (; vma; vma = vma->vm_next) { if (!vma_migratable(vma) || !vma_policy_mof(vma) || - is_vm_hugetlb_page(vma)) { + is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) { continue; } -- cgit v0.10.2 From 5ec45a192fe6e287f0fc06d5ca4f3bd446d94803 Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Wed, 10 Jun 2015 11:15:02 -0700 Subject: arch/x86/kvm/mmu.c: work around gcc-4.4.4 bug Fix this compile issue with gcc-4.4.4: arch/x86/kvm/mmu.c: In function 'kvm_mmu_pte_write': arch/x86/kvm/mmu.c:4256: error: unknown field 'cr0_wp' specified in initializer arch/x86/kvm/mmu.c:4257: error: unknown field 'cr4_pae' specified in initializer arch/x86/kvm/mmu.c:4257: warning: excess elements in union initializer ... gcc-4.4.4 (at least) has issues when using anonymous unions in initializers. Fixes: edc90b7dc4ceef6 ("KVM: MMU: fix SMAP virtualization") Cc: Xiao Guangrong Cc: Paolo Bonzini Cc: Davidlohr Bueso Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 44a7d25..b733376 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -4215,13 +4215,13 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, u64 entry, gentry, *spte; int npte; bool remote_flush, local_flush, zap_page; - union kvm_mmu_page_role mask = (union kvm_mmu_page_role) { - .cr0_wp = 1, - .cr4_pae = 1, - .nxe = 1, - .smep_andnot_wp = 1, - .smap_andnot_wp = 1, - }; + union kvm_mmu_page_role mask = { }; + + mask.cr0_wp = 1; + mask.cr4_pae = 1; + mask.nxe = 1; + mask.smep_andnot_wp = 1; + mask.smap_andnot_wp = 1; /* * If we don't have indirect shadow pages, it means no page is -- cgit v0.10.2