summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2013-09-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "The usual trivial updates all over the tree -- mostly typo fixes and documentation updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits) doc: Documentation/cputopology.txt fix typo treewide: Convert retrun typos to return Fix comment typo for init_cma_reserved_pageblock Documentation/trace: Correcting and extending tracepoint documentation mm/hotplug: fix a typo in Documentation/memory-hotplug.txt power: Documentation: Update s2ram link doc: fix a typo in Documentation/00-INDEX Documentation/printk-formats.txt: No casts needed for u64/s64 doc: Fix typo "is is" in Documentations treewide: Fix printks with 0x%# zram: doc fixes Documentation/kmemcheck: update kmemcheck documentation doc: documentation/hwspinlock.txt fix typo PM / Hibernate: add section for resume options doc: filesystems : Fix typo in Documentations/filesystems scsi/megaraid fixed several typos in comments ppc: init_32: Fix error typo "CONFIG_START_KERNEL" treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks page_isolation: Fix a comment typo in test_pages_isolated() doc: fix a typo about irq affinity ...
2013-09-05Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "Unfortunately, this merge window it'll have a be a lot of small piles - my fault, actually, for not keeping #for-next in anything that would resemble a sane shape ;-/ This pile: assorted fixes (the first 3 are -stable fodder, IMO) and cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last components) + several long-standing patches from various folks. There definitely will be a lot more (starting with Miklos' check_submount_and_drop() series)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) direct-io: Handle O_(D)SYNC AIO direct-io: Implement generic deferred AIO completions add formats for dentry/file pathnames kvm eventfd: switch to fdget powerpc kvm: use fdget switch fchmod() to fdget switch epoll_ctl() to fdget switch copy_module_from_fd() to fdget git simplify nilfs check for busy subtree ibmasmfs: don't bother passing superblock when not needed don't pass superblock to hypfs_{mkdir,create*} don't pass superblock to hypfs_diag_create_files don't pass superblock to hypfs_vm_create_files() oprofile: get rid of pointless forward declarations of struct super_block oprofilefs_create_...() do not need superblock argument oprofilefs_mkdir() doesn't need superblock argument don't bother with passing superblock to oprofile_create_stats_files() oprofile: don't bother with passing superblock to ->create_files() don't bother passing sb to oprofile_create_files() coh901318: don't open-code simple_read_from_buffer() ...
2013-09-05Merge branch 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Gleb Natapov: "The highlights of the release are nested EPT and pv-ticketlocks support (hypervisor part, guest part, which is most of the code, goes through tip tree). Apart of that there are many fixes for all arches" Fix up semantic conflicts as discussed in the pull request thread.. * 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (88 commits) ARM: KVM: Add newlines to panic strings ARM: KVM: Work around older compiler bug ARM: KVM: Simplify tracepoint text ARM: KVM: Fix kvm_set_pte assignment ARM: KVM: vgic: Bump VGIC_NR_IRQS to 256 ARM: KVM: Bugfix: vgic_bytemap_get_reg per cpu regs ARM: KVM: vgic: fix GICD_ICFGRn access ARM: KVM: vgic: simplify vgic_get_target_reg KVM: MMU: remove unused parameter KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate() KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX KVM: x86: update masterclock when kvmclock_offset is calculated (v2) KVM: PPC: Book3S: Fix compile error in XICS emulation KVM: PPC: Book3S PR: return appropriate error when allocation fails arch: powerpc: kvm: add signed type cast for comparation KVM: x86: add comments where MMIO does not return to the emulator KVM: vmx: count exits to userspace during invalid guest emulation KVM: rename __kvm_io_bus_sort_cmp to kvm_io_bus_cmp kvm: optimize away THP checks in kvm_is_mmio_pfn() ...
2013-09-05Merge tag 'PTR_RET-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull PTR_RET() removal patches from Rusty Russell: "PTR_RET() is a weird name, and led to some confusing usage. We ended up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages. This has been sitting in linux-next for a whole cycle" [ There are still some PTR_RET users scattered about, with some of them possibly being new, but most of them existing in Rusty's tree too. We have that #define PTR_RET(p) PTR_ERR_OR_ZERO(p) thing in <linux/err.h>, so they continue to work for now - Linus ] * tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR(). staging/zcache: don't use PTR_RET(). remoteproc: don't use PTR_RET(). pinctrl: don't use PTR_RET(). acpi: Replace weird use of PTR_RET. s390: Replace weird use of PTR_RET. PTR_RET is now PTR_ERR_OR_ZERO(): Replace most. PTR_RET is now PTR_ERR_OR_ZERO
2013-09-04Merge branch 'x86-ras-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS changes from Ingo Molnar: "[ The reason for drivers/ updates is that Boris asked for the drivers/edac/ changes to go via x86/ras in this cycle ] Main changes: - AMD CPUs: . Add ECC event decoding support for new F15h models . Various erratum fixes . Fix single-channel on dual-channel-controllers bug. - Intel CPUs: . UC uncorrectable memory error parsing fix . Add support for CMC (Corrected Machine Check) 'FF' (Firmware First) flag in the APEI HEST - Various cleanups and fixes" * 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: amd64_edac: Fix incorrect wraparounds amd64_edac: Correct erratum 505 range cpc925_edac: Use proper array termination x86/mce, acpi/apei: Only disable banks listed in HEST if mce is configured amd64_edac: Get rid of boot_cpu_data accesses amd64_edac: Add ECC decoding support for newer F15h models x86, amd_nb: Clarify F15h, model 30h GART and L3 support pci_ids: Add PCI device ID functions 3 and 4 for newer F15h models. x38_edac: Make a local function static i3200_edac: Make a local function static x86/mce: Pay no attention to 'F' bit in MCACOD when parsing 'UC' errors APEI/ERST: Fix error message formatting amd64_edac: Fix single-channel setups EDAC: Replace strict_strtol() with kstrtol() mce: acpi/apei: Soft-offline a page on firmware GHES notification mce: acpi/apei: Add a boot option to disable ff mode for corrected errors mce: acpi/apei: Honour Firmware First for MCA banks listed in APEI HEST CMC
2013-09-04direct-io: Handle O_(D)SYNC AIOChristoph Hellwig
Call generic_write_sync() from the deferred I/O completion handler if O_DSYNC is set for a write request. Also make sure various callers don't call generic_write_sync if the direct I/O code returns -EIOCBQUEUED. Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-04shm_mnt is as longterm as it gets, TYVM...Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-04Merge branch 'for-3.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on the cgroup front. Most changes aren't visible to userland at all at this point and are laying foundation for the planned unified hierarchy. - The biggest change is decoupling the lifetime management of css (cgroup_subsys_state) from that of cgroup's. Because controllers (cpu, memory, block and so on) will need to be dynamically enabled and disabled, css which is the association point between a cgroup and a controller may come and go dynamically across the lifetime of a cgroup. Till now, css's were created when the associated cgroup was created and stayed till the cgroup got destroyed. Assumptions around this tight coupling permeated through cgroup core and controllers. These assumptions are gradually removed, which consists bulk of patches, and css destruction path is completely decoupled from cgroup destruction path. Note that decoupling of creation path is relatively easy on top of these changes and the patchset is pending for the next window. - cgroup has its own event mechanism cgroup.event_control, which is only used by memcg. It is overly complex trying to achieve high flexibility whose benefits seem dubious at best. Going forward, new events will simply generate file modified event and the existing mechanism is being made specific to memcg. This pull request contains prepatory patches for such change. - Various fixes and cleanups" Fixed up conflict in kernel/cgroup.c as per Tejun. * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits) cgroup: fix cgroup_css() invocation in css_from_id() cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp() cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup cgroup: implement CFTYPE_NO_PREFIX cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax cgroup: fix cgroup_write_event_control() cgroup: fix subsystem file accesses on the root cgroup cgroup: change cgroup_from_id() to css_from_id() cgroup: use css_get() in cgroup_create() to check CSS_ROOT cpuset: remove an unncessary forward declaration cgroup: RCU protect each cgroup_subsys_state release cgroup: move subsys file removal to kill_css() cgroup: factor out kill_css() cgroup: decouple cgroup_subsys_state destruction from cgroup destruction cgroup: replace cgroup->css_kill_cnt with ->nr_css cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item cgroup: move cgroup->subsys[] assignment to online_css() cgroup: reorganize css init / exit paths cgroup: add __rcu modifier to cgroup->subsys[] ...
2013-09-03Merge tag 'driver-core-3.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core patches from Greg KH: "Here's the big driver core pull request for 3.12-rc1. Lots of tiny changes here fixing up the way sysfs attributes are created, to try to make drivers simpler, and fix a whole class race conditions with creations of device attributes after the device was announced to userspace. All the various pieces are acked by the different subsystem maintainers" * tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (119 commits) firmware loader: fix pending_fw_head list corruption drivers/base/memory.c: introduce help macro to_memory_block dynamic debug: line queries failing due to uninitialized local variable sysfs: sysfs_create_groups returns a value. debugfs: provide debugfs_create_x64() when disabled rbd: convert bus code to use bus_groups firmware: dcdbas: use binary attribute groups sysfs: add sysfs_create/remove_groups for when SYSFS is not enabled driver core: add #include <linux/sysfs.h> to core files. HID: convert bus code to use dev_groups Input: serio: convert bus code to use drv_groups Input: gameport: convert bus code to use drv_groups driver core: firmware: use __ATTR_RW() driver core: core: use DEVICE_ATTR_RO driver core: bus: use DRIVER_ATTR_WO() driver core: create write-only attribute macros for devices and drivers sysfs: create __ATTR_WO() driver-core: platform: convert bus code to use dev_groups workqueue: convert bus code to use dev_groups MEI: convert bus code to use dev_groups ...
2013-09-03Merge tag 'char-misc-3.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc patches from Greg KH: "Here is the big char/misc driver pull request for 3.12-rc1 Lots of driver updates all over the char/misc tree, full details in the shortlog" * tag 'char-misc-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (62 commits) drivers: uio: Kconfig: add MMU dependancy for UIO drivers: uio: Add driver for Humusoft MF624 DAQ PCI card drivers: uio_pdrv_genirq: use dev_get_platdata() drivers: uio_pruss: use dev_get_platdata() drivers: uio_dmem_genirq: use dev_get_platdata() drivers: parport: Kconfig: exclude h8300 for PARPORT_PC drivers: misc: ti-st: fix potential race if st_kim_start fails Drivers: hv: vmbus: Do not attempt to negoatiate a new version prematurely misc: vmw_balloon: Remove braces to fix build for clang. Drivers: hv: vmbus: Fix a bug in the handling of channel offers vme: vme_ca91cx42.c: fix to pass correct device identity to free_irq() VMCI: Add support for virtual IOMMU VMCI: Remove non-blocking/pinned queuepair support uio: uio_pruss: remove unnecessary platform_set_drvdata() parport: amiga: remove unnecessary platform_set_drvdata() vme: vme_vmivme7805.c: add missing __iomem annotation vme: vme_ca91cx42.c: add missing __iomem annotation vme: vme_tsi148.c: add missing __iomem annotation drivers/misc/hpilo: Correct panic when an AUX iLO is detected uio: drop unused vma_count member in uio_device struct ...
2013-09-03Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull first batch of s390 updates from Martin Schwidefsky: "The most interesting change is that Martin converted s390 to generic hardirqs. Which means that all current architectures have been converted and that CONFIG_GENERIC_HARDIRQS can be removed. Martin prepared a patch for that already (see genirq branch), but the best time to merge that is probably at the end of the merge window / begin of -rc1. Another patch converts s390 to software referenced bits instead of relying on the reference bit in the storage key. Therefore s390 doesn't use storage keys anymore, except for kvm. Besides that we have improvements, cleanups and fixes in PCI, DASD and all over the place." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (32 commits) s390/pci: use virtual memory for iommu bitmap s390/cio: fix unlocked access of global bitmap s390/pci: update function handle after resume from hibernate s390/pci: try harder to modify a function s390/pci: split lpf s390/hibernate: add early resume function s390/pci: add recover sysfs knob s390/pci: use claim_resource s390/pci/hotplug: convert to be builtin only s390/mm: implement software referenced bits s390/dasd: fix statistics for recovered requests s390/tx: allow program interruption filtering in user space s390/pgtable: fix mprotect for single-threaded KVM guests s390/time: return with irqs disabled from psw_idle s390/kprobes: add support for compare and branch instructions s390/switch_to: fix save_access_regs() / restore_access_regs() s390/bitops: fix inline assembly constraints s390/dasd: enable raw_track_access reads without direct I/O s390/mm: introduce ptep_flush_lazy helper s390/time: clock comparator revalidation ...
2013-08-29s390/mm: implement software referenced bitsMartin Schwidefsky
The last remaining use for the storage key of the s390 architecture is reference counting. The alternative is to make page table entries invalid while they are old. On access the fault handler marks the pte/pmd as young which makes the pte/pmd valid if the access rights allow read access. The pte/pmd invalidations required for software managed reference bits cost a bit of performance, on the other hand the RRBE/RRBM instructions to read and reset the referenced bits are quite expensive as well. Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-08-29memcg: check that kmem_cache has memcg_params before accessing itAndrey Vagin
If the system had a few memory groups and all of them were destroyed, memcg_limited_groups_array_size has non-zero value, but all new caches are created without memcg_params, because memcg_kmem_enabled() returns false. We try to enumirate child caches in a few places and all of them are potentially dangerous. For example my kernel is compiled with CONFIG_SLAB and it crashed when I tryed to mount a NFS share after a few experiments with kmemcg. BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff8118166a>] do_tune_cpucache+0x8a/0xd0 PGD b942a067 PUD b999f067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: fscache(+) ip6table_filter ip6_tables iptable_filter ip_tables i2c_piix4 pcspkr virtio_net virtio_balloon i2c_core floppy CPU: 0 PID: 357 Comm: modprobe Not tainted 3.11.0-rc7+ #59 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800b9f98240 ti: ffff8800ba32e000 task.ti: ffff8800ba32e000 RIP: 0010:[<ffffffff8118166a>] [<ffffffff8118166a>] do_tune_cpucache+0x8a/0xd0 RSP: 0018:ffff8800ba32fb70 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006 RDX: 0000000000000000 RSI: ffff8800b9f98910 RDI: 0000000000000246 RBP: ffff8800ba32fba0 R08: 0000000000000002 R09: 0000000000000004 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000010 R13: 0000000000000008 R14: 00000000000000d0 R15: ffff8800375d0200 FS: 00007f55f1378740(0000) GS:ffff8800bfa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f24feba57a0 CR3: 0000000037b51000 CR4: 00000000000006f0 Call Trace: enable_cpucache+0x49/0x100 setup_cpu_cache+0x215/0x280 __kmem_cache_create+0x2fa/0x450 kmem_cache_create_memcg+0x214/0x350 kmem_cache_create+0x2b/0x30 fscache_init+0x19b/0x230 [fscache] do_one_initcall+0xfa/0x1b0 load_module+0x1c41/0x26d0 SyS_finit_module+0x86/0xb0 system_call_fastpath+0x16/0x1b Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Glauber Costa <glommer@openvz.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-28Merge remote-tracking branch 'origin/next' into kvm-ppc-nextAlexander Graf
Conflicts: mm/Kconfig CMA DMA split and ZSWAP introduction were conflicting, fix up manually.
2013-08-27mm: move_ptes -- Set soft dirty bit depending on pte typeCyrill Gorcunov
Dave reported corrupted swap entries | [ 4588.541886] swap_free: Unused swap offset entry 00002d15 | [ 4588.541952] BUG: Bad page map in process trinity-kid12 pte:005a2a80 pmd:22c01f067 and Hugh pointed that in move_ptes _PAGE_SOFT_DIRTY bit set regardless the type of entry pte consists of. The trick here is that when we carry soft dirty status in swap entries we are to use _PAGE_SWP_SOFT_DIRTY instead, because this is the only place in pte which can be used for own needs without intersecting with bits owned by swap entry type/offset. Reported-and-tested-by: Dave Jones <davej@redhat.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Analyzed-by: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-27Fix comment typo for init_cma_reserved_pageblockLi Zhong
It seems the "it's" should be "its" here. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-08-27treewide: Fix printks with 0x%#Joe Perches
Using 0x%# emits 0x0x. Only one is necessary. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-08-25Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Assorted fixes from the last week or so" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: VFS: collect_mounts() should return an ERR_PTR bfs: iget_locked() doesn't return an ERR_PTR efs: iget_locked() doesn't return an ERR_PTR() proc: kill the extra proc_readfd_common()->dir_emit_dots() cope with potentially long ->d_dname() output for shmem/hugetlb
2013-08-24cope with potentially long ->d_dname() output for shmem/hugetlbAl Viro
dynamic_dname() is both too much and too little for those - the output may be well in excess of 64 bytes dynamic_dname() assumes to be enough (thanks to ashmem feeding really long names to shmem_file_setup()) and vsnprintf() is an overkill for those guys. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-23memcg: get rid of swapaccount leftoversMichal Hocko
The swapaccount kernel parameter without any values has been removed by commit a2c8990aed5a ("memsw: remove noswapaccount kernel parameter") but it seems that we didn't get rid of all the left overs. Make sure that menuconfig help text and kernel-parameters.txt are clear about value for the paramter and remove the stalled comment which is not very much useful on its own. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Gergely Risko <gergely@risko.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-20page_isolation: Fix a comment typo in test_pages_isolated()Tang Chen
pageblock_nr_page should be pageblock_nr_pages, and fist is a typo of first. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-08-20backing-dev: convert class code to use dev_groupsGreg Kroah-Hartman
The dev_attrs field of struct class is going away soon, dev_groups should be used instead. This converts the backing device class code to use the correct field. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-19Merge 3.11-rc6 into char-misc-nextGreg Kroah-Hartman
We want these fixes in this tree. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-16Fix TLB gather virtual address range invalidation corner casesLinus Torvalds
Ben Tebulin reported: "Since v3.7.2 on two independent machines a very specific Git repository fails in 9/10 cases on git-fsck due to an SHA1/memory failures. This only occurs on a very specific repository and can be reproduced stably on two independent laptops. Git mailing list ran out of ideas and for me this looks like some very exotic kernel issue" and bisected the failure to the backport of commit 53a59fc67f97 ("mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT"). That commit itself is not actually buggy, but what it does is to make it much more likely to hit the partial TLB invalidation case, since it introduces a new case in tlb_next_batch() that previously only ever happened when running out of memory. The real bug is that the TLB gather virtual memory range setup is subtly buggered. It was introduced in commit 597e1c3580b7 ("mm/mmu_gather: enable tlb flush range in generic mmu_gather"), and the range handling was already fixed at least once in commit e6c495a96ce0 ("mm: fix the TLB range flushed when __tlb_remove_page() runs out of slots"), but that fix was not complete. The problem with the TLB gather virtual address range is that it isn't set up by the initial tlb_gather_mmu() initialization (which didn't get the TLB range information), but it is set up ad-hoc later by the functions that actually flush the TLB. And so any such case that forgot to update the TLB range entries would potentially miss TLB invalidates. Rather than try to figure out exactly which particular ad-hoc range setup was missing (I personally suspect it's the hugetlb case in zap_huge_pmd(), which didn't have the same logic as zap_pte_range() did), this patch just gets rid of the problem at the source: make the TLB range information available to tlb_gather_mmu(), and initialize it when initializing all the other tlb gather fields. This makes the patch larger, but conceptually much simpler. And the end result is much more understandable; even if you want to play games with partial ranges when invalidating the TLB contents in chunks, now the range information is always there, and anybody who doesn't want to bother with it won't introduce subtle bugs. Ben verified that this fixes his problem. Reported-bisected-and-tested-by: Ben Tebulin <tebulin@googlemail.com> Build-testing-by: Stephen Rothwell <sfr@canb.auug.org.au> Build-testing-by: Richard Weinberger <richard.weinberger@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-14Merge tag 'amd_f15_m30' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/ras Pull AMD F15h, model 0x30 and later enablement stuff, more specifically EDAC support, from Borislav Petkov. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-14mm: save soft-dirty bits on file pagesCyrill Gorcunov
Andy reported that if file page get reclaimed we lose the soft-dirty bit if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get encoded into pte entry. Thus when #pf happens on such non-present pte we can restore it back. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-14mm: save soft-dirty bits on swapped pagesCyrill Gorcunov
Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set get swapped out, the bit is getting lost and no longer available when pte read back. To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in pte entry for the page being swapped out. When such page is to be read back from a swap cache we check for bit presence and if it's there we clear it and restore the former _PAGE_SOFT_DIRTY bit back. One of the problem was to find a place in pte entry where we can save the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was chosen for that, it doesn't intersect with swap entry format stored in pte. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-14memcg: don't initialize kmem-cache destroying work for root cachesAndrey Vagin
struct memcg_cache_params has a union. Different parts of this union are used for root and non-root caches. A part with destroying work is used only for non-root caches. I fixed the same problem in another place v3.9-rc1-16204-gf101a94, but didn't notice this one. This patch fixes the kernel panic: [ 46.848187] BUG: unable to handle kernel paging request at 000000fffffffeb8 [ 46.849026] IP: [<ffffffff811a484c>] kmem_cache_destroy_memcg_children+0x6c/0xc0 [ 46.849092] PGD 0 [ 46.849092] Oops: 0000 [#1] SMP ... Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: <stable@vger.kernel.org> [3.9.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-12mm: make generic_access_phys available for modulesUwe Kleine-König
In the next commit this function will be used in the uio subsystem Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-12Merge tag 'please-pull-mce-f-bit' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras Pull MCE-uncorrected-error fix from Tony Luck: "Bit 12 may or may not be set in MCi_STATUS.MCACOD when an uncorrected error is reported. Ignore it when checking error signatures." Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-12Merge branch 'x86/mce' into x86/rasIngo Molnar
Pursue a single RAS/MCE topic branch on x86. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-09cgroup: make css_for_each_descendant() and friends include the origin css in ↵Tejun Heo
the iteration Previously, all css descendant iterators didn't include the origin (root of subtree) css in the iteration. The reasons were maintaining consistency with css_for_each_child() and that at the time of introduction more use cases needed skipping the origin anyway; however, given that css_is_descendant() considers self to be a descendant, omitting the origin css has become more confusing and looking at the accumulated use cases rather clearly indicates that including origin would result in simpler code overall. While this is a change which can easily lead to subtle bugs, cgroup API including the iterators has recently gone through major restructuring and no out-of-tree changes will be applicable without adjustments making this a relatively acceptable opportunity for this type of change. The conversions are mostly straight-forward. If the iteration block had explicit origin handling before or after, it's moved inside the iteration. If not, if (pos == origin) continue; is added. Some conversions add extra reference get/put around origin handling by consolidating origin handling and the rest. While the extra ref operations aren't strictly necessary, this shouldn't cause any noticeable difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
2013-08-09cgroup: make cftype->[un]register_event() deal with cgroup_subsys_state ↵Tejun Heo
instead of cgroup cgroup is in the process of converting to css (cgroup_subsys_state) from cgroup as the principal subsystem interface handle. This is mostly to prepare for the unified hierarchy support where css's will be created and destroyed dynamically but also helps cleaning up subsystem implementations as css is usually what they are interested in anyway. cftype->[un]register_event() is among the remaining couple interfaces which still use struct cgroup. Convert it to cgroup_subsys_state. The conversion is mostly mechanical and removes the last users of mem_cgroup_from_cont() and cg_to_vmpressure(), which are removed. v2: indentation update as suggested by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
2013-08-09cgroup: make task iterators deal with cgroup_subsys_state instead of cgroupTejun Heo
cgroup is in the process of converting to css (cgroup_subsys_state) from cgroup as the principal subsystem interface handle. This is mostly to prepare for the unified hierarchy support where css's will be created and destroyed dynamically but also helps cleaning up subsystem implementations as css is usually what they are interested in anyway. This patch converts task iterators to deal with css instead of cgroup. Note that under unified hierarchy, different sets of tasks will be considered belonging to a given cgroup depending on the subsystem in question and making the iterators deal with css instead cgroup provides them with enough information about the iteration. While at it, fix several function comment formats in cpuset.c. This patch doesn't introduce any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com>
2013-08-09cgroup: make cgroup_task_iter remember the cgroup being iteratedTejun Heo
Currently all cgroup_task_iter functions require @cgrp to be passed in, which is superflous and increases chance of usage error. Make cgroup_task_iter remember the cgroup being iterated and drop @cgrp argument from next and end functions. This patch doesn't introduce any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
2013-08-09cgroup: rename cgroup_iter to cgroup_task_iterTejun Heo
cgroup now has multiple iterators and it's quite confusing to have something which walks over tasks of a single cgroup named cgroup_iter. Let's rename it to cgroup_task_iter. While at it, reformat / update comments and replace the overview comment above the interface function decls with proper function comments. Such overview can be useful but function comments should be more than enough here. This is pure rename and doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
2013-08-09cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroupTejun Heo
cgroup is currently in the process of transitioning to using css (cgroup_subsys_state) as the primary handle instead of cgroup in subsystem API. For hierarchy iterators, this is beneficial because * In most cases, css is the only thing subsystems care about anyway. * On the planned unified hierarchy, iterations for different subsystems will need to skip over different subtrees of the hierarchy depending on which subsystems are enabled on each cgroup. Passing around css makes it unnecessary to explicitly specify the subsystem in question as css is intersection between cgroup and subsystem * For the planned unified hierarchy, css's would need to be created and destroyed dynamically independent from cgroup hierarchy. Having cgroup core manage css iteration makes enforcing deref rules a lot easier. Most subsystem conversions are straight-forward. Noteworthy changes are * blkio: cgroup_to_blkcg() is no longer used. Removed. * freezer: cgroup_freezer() is no longer used. Removed. * devices: cgroup_to_devcgroup() is no longer used. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk>
2013-08-09cgroup: pass around cgroup_subsys_state instead of cgroup in file methodsTejun Heo
cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup. Please see the previous commit which converts the subsystem methods for rationale. This patch converts all cftype file operations to take @css instead of @cgroup. cftypes for the cgroup core files don't have their subsytem pointer set. These will automatically use the dummy_css added by the previous patch and can be converted the same way. Most subsystem conversions are straight forwards but there are some interesting ones. * freezer: update_if_frozen() is also converted to take @css instead of @cgroup for consistency. This will make the code look simpler too once iterators are converted to use css. * memory/vmpressure: mem_cgroup_from_css() needs to be exported to vmpressure while mem_cgroup_from_cont() can be made static. Updated accordingly. * cpu: cgroup_tg() doesn't have any user left. Removed. * cpuacct: cgroup_ca() doesn't have any user left. Removed. * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left. Removed. * net_cls: cgrp_cls_state() doesn't have any user left. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-08-09cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methodsTejun Heo
cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup * in subsystem implementations for the following reasons. * With unified hierarchy, subsystems will be dynamically bound and unbound from cgroups and thus css's (cgroup_subsys_state) may be created and destroyed dynamically over the lifetime of a cgroup, which is different from the current state where all css's are allocated and destroyed together with the associated cgroup. This in turn means that cgroup_css() should be synchronized and may return NULL, making it more cumbersome to use. * Differing levels of per-subsystem granularity in the unified hierarchy means that the task and descendant iterators should behave differently depending on the specific subsystem the iteration is being performed for. * In majority of the cases, subsystems only care about its part in the cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods often obtain the matching css pointer from the cgroup and don't bother with the cgroup pointer itself. Passing around css fits much better. This patch converts all cgroup_subsys methods to take @css instead of @cgroup. The conversions are mostly straight-forward. A few noteworthy changes are * ->css_alloc() now takes css of the parent cgroup rather than the pointer to the new cgroup as the css for the new cgroup doesn't exist yet. Knowing the parent css is enough for all the existing subsystems. * In kernel/cgroup.c::offline_css(), unnecessary open coded css dereference is replaced with local variable access. This patch shouldn't cause any behavior differences. v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced with local variable @css as suggested by Li Zefan. Rebased on top of new for-3.12 which includes for-3.11-fixes so that ->css_free() invocation added by da0a12caff ("cgroup: fix a leak when percpu_ref_init() fails") is converted too. Suggested by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-08-09cgroup: add css_parent()Tejun Heo
Currently, controllers have to explicitly follow the cgroup hierarchy to find the parent of a given css. cgroup is moving towards using cgroup_subsys_state as the main controller interface construct, so let's provide a way to climb the hierarchy using just csses. This patch implements css_parent() which, given a css, returns its parent. The function is guarnateed to valid non-NULL parent css as long as the target css is not at the top of the hierarchy. freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices are converted to use css_parent() instead of accessing cgroup->parent directly. * __parent_ca() is dropped from cpuacct and its usage is replaced with parent_ca(). The only difference between the two was NULL test on cgroup->parent which is now embedded in css_parent() making the distinction moot. Note that eventually a css->parent field will be added to css and the NULL check in css_parent() will go away. This patch shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-08-09cgroup: add/update accessors which obtain subsys specific data from cssTejun Heo
css (cgroup_subsys_state) is usually embedded in a subsys specific data structure. Subsystems either use container_of() directly to cast from css to such data structure or has an accessor function wrapping such cast. As cgroup as whole is moving towards using css as the main interface handle, add and update such accessors to ease dealing with css's. All accessors explicitly handle NULL input and return NULL in those cases. While this looks like an extra branch in the code, as all controllers specific data structures have css as the first field, the casting doesn't involve any offsetting and the compiler can trivially optimize out the branch. * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such accessor. Added. * memory, hugetlb and devices already had one but didn't explicitly handle NULL input. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-08-09hugetlb_cgroup: pass around @hugetlb_cgroup instead of @cgroupTejun Heo
cgroup controller API will be converted to primarily use struct cgroup_subsys_state instead of struct cgroup. In preparation, make hugetlb_cgroup functions pass around struct hugetlb_cgroup instead of struct cgroup. This patch shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org>
2013-08-09cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/Tejun Heo
The names of the two struct cgroup_subsys_state accessors - cgroup_subsys_state() and task_subsys_state() - are somewhat awkward. The former clashes with the type name and the latter doesn't even indicate it's somehow related to cgroup. We're about to revamp large portion of cgroup API, so, let's rename them so that they're less awkward. Most per-controller usages of the accessors are localized in accessor wrappers and given the amount of scheduled changes, this isn't gonna add any noticeable headache. Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state() to task_css(). This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-08-08Revert "slub: do not put a slab to cpu partial list when cpu_partial is 0"Linus Torvalds
This reverts commit 318df36e57c0ca9f2146660d41ff28e8650af423. This commit caused Steven Rostedt's hackbench runs to run out of memory due to a leak. As noted by Joonsoo Kim, it is buggy in the following scenario: "I guess, you may set 0 to all kmem caches's cpu_partial via sysfs, doesn't it? In this case, memory leak is possible in following case. Code flow of possible leak is follwing case. * in __slab_free() 1. (!new.inuse || !prior) && !was_frozen 2. !kmem_cache_debug && !prior 3. new.frozen = 1 4. after cmpxchg_double_slab, run the (!n) case with new.frozen=1 5. with this patch, put_cpu_partial() doesn't do anything, because this cache's cpu_partial is 0 6. return In step 5, leak occur" And Steven does indeed have cpu_partial set to 0 due to RT testing. Joonsoo is cooking up a patch, but everybody agrees that reverting this for now is the right thing to do. Reported-and-bisected-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-04tmpfs: fix SEEK_DATA/SEEK_HOLE regressionHugh Dickins
Commit 46a1c2c7ae53 ("vfs: export lseek_execute() to modules") broke the tmpfs SEEK_DATA/SEEK_HOLE implementation, because vfs_setpos() converts the carefully prepared -ENXIO to -EINVAL. Other filesystems avoid it in error cases: do the same in tmpfs. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-31vmpressure: make sure there are no events queued after memcg is offlinedMichal Hocko
vmpressure is called synchronously from reclaim where the target_memcg is guaranteed to be alive but the eventfd is signaled from the work queue context. This means that memcg (along with vmpressure structure which is embedded into it) might go away while the work item is pending which would result in use-after-release bug. We have two possible ways how to fix this. Either vmpressure pins memcg before it schedules vmpr->work and unpin it in vmpressure_work_fn or explicitely flush the work item from the css_offline context (as suggested by Tejun). This patch implements the later one and it introduces vmpressure_cleanup which flushes the vmpressure work queue item item. It hooks into mem_cgroup_css_offline after the memcg itself is cleaned up. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Tejun Heo <tj@kernel.org> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Li Zefan <lizefan@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-31vmpressure: do not check for pending work to prevent from new workMichal Hocko
because it is racy and it doesn't give us much anyway as schedule_work handles this case already. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Tejun Heo <tj@kernel.org> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Li Zefan <lizefan@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-31vmpressure: change vmpressure::sr_lock to spinlockMichal Hocko
There is nothing that can sleep inside critical sections protected by this lock and those sections are really small so there doesn't make much sense to use mutex for them. Change the log to a spinlock Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Tejun Heo <tj@kernel.org> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Li Zefan <lizefan@huawei.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-31mm: zbud: fix condition check on allocation sizeHeesub Shin
zbud_alloc() incorrectly verifies the size of allocation limit. It should deny the allocation request greater than (PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE), not (PAGE_SIZE - ZHDR_SIZE_ALIGNED) which has no remaining spaces for its buddy. There is no point in spending the entire zbud page storing only a single page, since we don't have any benefits. Signed-off-by: Heesub Shin <heesub.shin@samsung.com> Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Sunae Seo <sunae.seo@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-31thp, mm: avoid PageUnevictable on active/inactive lru listsKirill A. Shutemov
active/inactive lru lists can contain unevicable pages (i.e. ramfs pages that have been placed on the LRU lists when first allocated), but these pages must not have PageUnevictable set - otherwise shrink_[in]active_list goes crazy: kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122! 1090 static unsigned long isolate_lru_pages(unsigned long nr_to_scan, 1091 struct lruvec *lruvec, struct list_head *dst, 1092 unsigned long *nr_scanned, struct scan_control *sc, 1093 isolate_mode_t mode, enum lru_list lru) 1094 { ... 1108 switch (__isolate_lru_page(page, mode)) { 1109 case 0: ... 1116 case -EBUSY: ... 1121 default: 1122 BUG(); 1123 } 1124 } ... 1130 } __isolate_lru_page() returns EINVAL for PageUnevictable(page). For lru_add_page_tail(), it means we should not set PageUnevictable() for tail pages unless we're sure that it will go to LRU_UNEVICTABLE. Let's just copy PG_active and PG_unevictable from head page in __split_huge_page_refcount(), it will simplify lru_add_page_tail(). This will fix one more bug in lru_add_page_tail(): if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail will go to the same lru as page, but nobody cares to sync page_tail active/inactive state with page. So we can end up with inactive page on active lru. The patch will fix it as well since we copy PG_active from head page. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>