summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2012-05-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull more networking updates from David Miller: "Ok, everything from here on out will be bug fixes." 1) One final sync of wireless and bluetooth stuff from John Linville. These changes have all been in his tree for more than a week, and therefore have had the necessary -next exposure. John was just away on a trip and didn't have a change to send the pull request until a day or two ago. 2) Put back some defines in user exposed header file areas that were removed during the tokenring purge. From Stephen Hemminger and Paul Gortmaker. 3) A bug fix for UDP hash table allocation got lost in the pile due to one of those "you got it.. no I've got it.." situations. :-) From Tim Bird. 4) SKB coalescing in TCP needs to have stricter checks, otherwise we'll try to coalesce overlapping frags and crash. Fix from Eric Dumazet. 5) RCU routing table lookups can race with free_fib_info(), causing crashes when we deref the device pointers in the route. Fix by releasing the net device in the RCU callback. From Yanmin Zhang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (293 commits) tcp: take care of overlaps in tcp_try_coalesce() ipv4: fix the rcu race between free_fib_info and ip_route_output_slow mm: add a low limit to alloc_large_system_hash ipx: restore token ring define to include/linux/ipx.h if: restore token ring ARP type to header xen: do not disable netfront in dom0 phy/micrel: Fix ID of KSZ9021 mISDN: Add X-Tensions USB ISDN TA XC-525 gianfar:don't add FCB length to hard_header_len Bluetooth: Report proper error number in disconnection Bluetooth: Create flags for bt_sk() Bluetooth: report the right security level in getsockopt Bluetooth: Lock the L2CAP channel when sending Bluetooth: Restore locking semantics when looking up L2CAP channels Bluetooth: Fix a redundant and problematic incoming MTU check Bluetooth: Add support for Foxconn/Hon Hai AR5BBU22 0489:E03C Bluetooth: Fix EIR data generation for mgmt_device_found Bluetooth: Fix Inquiry with RSSI event mask Bluetooth: improve readability of l2cap_seq_list code Bluetooth: Fix skb length calculation ...
2012-05-24Merge branch 'perf-uprobes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull user-space probe instrumentation from Ingo Molnar: "The uprobes code originates from SystemTap and has been used for years in Fedora and RHEL kernels. This version is much rewritten, reviews from PeterZ, Oleg and myself shaped the end result. This tree includes uprobes support in 'perf probe' - but SystemTap (and other tools) can take advantage of user probe points as well. Sample usage of uprobes via perf, for example to profile malloc() calls without modifying user-space binaries. First boot a new kernel with CONFIG_UPROBE_EVENT=y enabled. If you don't know which function you want to probe you can pick one from 'perf top' or can get a list all functions that can be probed within libc (binaries can be specified as well): $ perf probe -F -x /lib/libc.so.6 To probe libc's malloc(): $ perf probe -x /lib64/libc.so.6 malloc Added new event: probe_libc:malloc (on 0x7eac0) You can now use it in all perf tools, such as: perf record -e probe_libc:malloc -aR sleep 1 Make use of it to create a call graph (as the flat profile is going to look very boring): $ perf record -e probe_libc:malloc -gR make [ perf record: Woken up 173 times to write data ] [ perf record: Captured and wrote 44.190 MB perf.data (~1930712 $ perf report | less 32.03% git libc-2.15.so [.] malloc | --- malloc 29.49% cc1 libc-2.15.so [.] malloc | --- malloc | |--0.95%-- 0x208eb1000000000 | |--0.63%-- htab_traverse_noresize 11.04% as libc-2.15.so [.] malloc | --- malloc | 7.15% ld libc-2.15.so [.] malloc | --- malloc | 5.07% sh libc-2.15.so [.] malloc | --- malloc | 4.99% python-config libc-2.15.so [.] malloc | --- malloc | 4.54% make libc-2.15.so [.] malloc | --- malloc | |--7.34%-- glob | | | |--93.18%-- 0x41588f | | | --6.82%-- glob | 0x41588f ... Or: $ perf report -g flat | less # Overhead Command Shared Object Symbol # ........ ............. ............. .......... # 32.03% git libc-2.15.so [.] malloc 27.19% malloc 29.49% cc1 libc-2.15.so [.] malloc 24.77% malloc 11.04% as libc-2.15.so [.] malloc 11.02% malloc 7.15% ld libc-2.15.so [.] malloc 6.57% malloc ... The core uprobes design is fairly straightforward: uprobes probe points register themselves at (inode:offset) addresses of libraries/binaries, after which all existing (or new) vmas that map that address will have a software breakpoint injected at that address. vmas are COW-ed to preserve original content. The probe points are kept in an rbtree. If user-space executes the probed inode:offset instruction address then an event is generated which can be recovered from the regular perf event channels and mmap-ed ring-buffer. Multiple probes at the same address are supported, they create a dynamic callback list of event consumers. The basic model is further complicated by the XOL speedup: the original instruction that is probed is copied (in an architecture specific fashion) and executed out of line when the probe triggers. The XOL area is a single vma per process, with a fixed number of entries (which limits probe execution parallelism). The API: uprobes are installed/removed via /sys/kernel/debug/tracing/uprobe_events, the API is integrated to align with the kprobes interface as much as possible, but is separate to it. Injecting a probe point is privileged operation, which can be relaxed by setting perf_paranoid to -1. You can use multiple probes as well and mix them with kprobes and regular PMU events or tracepoints, when instrumenting a task." Fix up trivial conflicts in mm/memory.c due to previous cleanup of unmap_single_vma(). * 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) perf probe: Detect probe target when m/x options are absent perf probe: Provide perf interface for uprobes tracing: Fix kconfig warning due to a typo tracing: Provide trace events interface for uprobes tracing: Extract out common code for kprobes/uprobes trace events tracing: Modify is_delete, is_return from int to bool uprobes/core: Decrement uprobe count before the pages are unmapped uprobes/core: Make background page replacement logic account for rss_stat counters uprobes/core: Optimize probe hits with the help of a counter uprobes/core: Allocate XOL slots for uprobes use uprobes/core: Handle breakpoint and singlestep exceptions uprobes/core: Rename bkpt to swbp uprobes/core: Make order of function parameters consistent across functions uprobes/core: Make macro names consistent uprobes: Update copyright notices uprobes/core: Move insn to arch specific structure uprobes/core: Remove uprobe_opcode_sz uprobes/core: Make instruction tables volatile uprobes: Move to kernel/events/ uprobes/core: Clean up, refactor and improve the code ...
2012-05-24mm: add a low limit to alloc_large_system_hashTim Bird
UDP stack needs a minimum hash size value for proper operation and also uses alloc_large_system_hash() for proper NUMA distribution of its hash tables and automatic sizing depending on available system memory. On some low memory situations, udp_table_init() must ignore the alloc_large_system_hash() result and reallocs a bigger memory area. As we cannot easily free old hash table, we leak it and kmemleak can issue a warning. This patch adds a low limit parameter to alloc_large_system_hash() to solve this problem. We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table allocation. Reported-by: Mark Asselstine <mark.asselstine@windriver.com> Reported-by: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-24mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkagesMel Gorman
Dave Jones' system call fuzz testing tool "trinity" triggered the following bug error with slab debugging enabled ============================================================================= BUG numa_policy (Not tainted): Poison overwritten ----------------------------------------------------------------------------- INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154 __slab_alloc+0x3d3/0x445 kmem_cache_alloc+0x29d/0x2b0 mpol_new+0xa3/0x140 sys_mbind+0x142/0x620 system_call_fastpath+0x16/0x1b INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154 __slab_free+0x2e/0x1de kmem_cache_free+0x25a/0x260 __mpol_put+0x27/0x30 remove_vma+0x68/0x90 exit_mmap+0x118/0x140 mmput+0x73/0x110 exit_mm+0x108/0x130 do_exit+0x162/0xb90 do_group_exit+0x4f/0xc0 sys_exit_group+0x17/0x20 system_call_fastpath+0x16/0x1b INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x (null) flags=0x20000000004080 INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0 This implied a reference counting bug and the problem happened during mbind(). mbind() applies a new memory policy to a range and uses mbind_range() to merge existing VMAs or split them as necessary. In the event of splits, mpol_dup() will allocate a new struct mempolicy and maintain existing reference counts whose rules are documented in Documentation/vm/numa_memory_policy.txt . The problem occurs with shared memory policies. The vm_op->set_policy increments the reference count if necessary and split_vma() and vma_merge() have already handled the existing reference counts. However, policy_vma() screws it up by replacing an existing vma->vm_policy with one that potentially has the wrong reference count leading to a premature free. This patch removes the damage caused by policy_vma(). With this patch applied Dave's trinity tool runs an mbind test for 5 minutes without error. /proc/slabinfo reported that there are no numa_policy or shared_policy_node objects allocated after the test completed and the shared memory region was deleted. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Jones <davej@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Stephen Wilson <wilsons@start.ca> Cc: Christoph Lameter <cl@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-24Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace enhancements from Eric Biederman: "This is a course correction for the user namespace, so that we can reach an inexpensive, maintainable, and reasonably complete implementation. Highlights: - Config guards make it impossible to enable the user namespace and code that has not been converted to be user namespace safe. - Use of the new kuid_t type ensures the if you somehow get past the config guards the kernel will encounter type errors if you enable user namespaces and attempt to compile in code whose permission checks have not been updated to be user namespace safe. - All uids from child user namespaces are mapped into the initial user namespace before they are processed. Removing the need to add an additional check to see if the user namespace of the compared uids remains the same. - With the user namespaces compiled out the performance is as good or better than it is today. - For most operations absolutely nothing changes performance or operationally with the user namespace enabled. - The worst case performance I could come up with was timing 1 billion cache cold stat operations with the user namespace code enabled. This went from 156s to 164s on my laptop (or 156ns to 164ns per stat operation). - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value. Most uid/gid setting system calls treat these value specially anyway so attempting to use -1 as a uid would likely cause entertaining failures in userspace. - If setuid is called with a uid that can not be mapped setuid fails. I have looked at sendmail, login, ssh and every other program I could think of that would call setuid and they all check for and handle the case where setuid fails. - If stat or a similar system call is called from a context in which we can not map a uid we lie and return overflowuid. The LFS experience suggests not lying and returning an error code might be better, but the historical precedent with uids is different and I can not think of anything that would break by lying about a uid we can't map. - Capabilities are localized to the current user namespace making it safe to give the initial user in a user namespace all capabilities. My git tree covers all of the modifications needed to convert the core kernel and enough changes to make a system bootable to runlevel 1." Fix up trivial conflicts due to nearby independent changes in fs/stat.c * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits) userns: Silence silly gcc warning. cred: use correct cred accessor with regards to rcu read lock userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq userns: Convert cgroup permission checks to use uid_eq userns: Convert tmpfs to use kuid and kgid where appropriate userns: Convert sysfs to use kgid/kuid where appropriate userns: Convert sysctl permission checks to use kuid and kgids. userns: Convert proc to use kuid/kgid where appropriate userns: Convert ext4 to user kuid/kgid where appropriate userns: Convert ext3 to use kuid/kgid where appropriate userns: Convert ext2 to use kuid/kgid where appropriate. userns: Convert devpts to use kuid/kgid where appropriate userns: Convert binary formats to use kuid/kgid where appropriate userns: Add negative depends on entries to avoid building code that is userns unsafe userns: signal remove unnecessary map_cred_ns userns: Teach inode_capable to understand inodes whose uids map to other namespaces. userns: Fail exec for suid and sgid binaries with ids outside our user namespace. userns: Convert stat to return values mapped from kuids and kgids userns: Convert user specfied uids and gids in chown into kuids and kgid userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs ...
2012-05-23Merge branch 'for-3.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "cgroup file type addition / removal is updated so that file types are added and removed instead of individual files so that dynamic file type addition / removal can be implemented by cgroup and used by controllers. blkio controller changes which will come through block tree are dependent on this. Other changes include res_counter cleanup and disallowing kthread / PF_THREAD_BOUND threads to be attached to non-root cgroups. There's a reported bug with the file type addition / removal handling which can lead to oops on cgroup umount. The issue is being looked into. It shouldn't cause problems for most setups and isn't a security concern." Fix up trivial conflict in Documentation/feature-removal-schedule.txt * 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits) res_counter: Account max_usage when calling res_counter_charge_nofail() res_counter: Merge res_counter_charge and res_counter_charge_nofail cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads cgroup: remove cgroup_subsys->populate() cgroup: get rid of populate for memcg cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg cgroup: make css->refcnt clearing on cgroup removal optional cgroup: use negative bias on css->refcnt to block css_tryget() cgroup: implement cgroup_rm_cftypes() cgroup: introduce struct cfent cgroup: relocate __d_cgrp() and __d_cft() cgroup: remove cgroup_add_file[s]() cgroup: convert memcg controller to the new cftype interface memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP cgroup: convert all non-memcg controllers to the new cftype interface cgroup: relocate cftype and cgroup_subsys definitions in controllers cgroup: merge cft_release_agent cftype array into the base files array cgroup: implement cgroup_add_cftypes() and friends cgroup: build list of all cgroups under a given cgroupfs_root cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir() ...
2012-05-22Merge tag 'driver-core-3.5-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg Kroah-Hartman: "Here's the driver core, and other driver subsystems, pull request for the 3.5-rc1 merge window. Outside of a few minor driver core changes, we ended up with the following different subsystem and core changes as well, due to interdependancies on the driver core: - hyperv driver updates - drivers/memory being created and some drivers moved into it - extcon driver subsystem created out of the old Android staging switch driver code - dynamic debug updates - printk rework, and /dev/kmsg changes All of this has been tested in the linux-next releases for a few weeks with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" Fix up conflicts in drivers/extcon/extcon-max8997.c where git noticed that a patch to the deleted drivers/misc/max8997-muic.c driver needs to be applied to this one. * tag 'driver-core-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (90 commits) uio_pdrv_genirq: get irq through platform resource if not set otherwise memory: tegra{20,30}-mc: Remove empty *_remove() printk() - isolate KERN_CONT users from ordinary complete lines sysfs: get rid of some lockdep false positives Drivers: hv: util: Properly handle version negotiations. Drivers: hv: Get rid of an unnecessary check in vmbus_prep_negotiate_resp() memory: tegra{20,30}-mc: Use dev_err_ratelimited() driver core: Add dev_*_ratelimited() family Driver Core: don't oops with unregistered driver in driver_find_device() printk() - restore prefix/timestamp printing for multi-newline strings printk: add stub for prepend_timestamp() ARM: tegra30: Make MC optional in Kconfig ARM: tegra20: Make MC optional in Kconfig ARM: tegra30: MC: Remove unnecessary BUG*() ARM: tegra20: MC: Remove unnecessary BUG*() printk: correctly align __log_buf ARM: tegra30: Add Tegra Memory Controller(MC) driver ARM: tegra20: Add Tegra Memory Controller(MC) driver printk() - restore timestamp printing at console output printk() - do not merge continuation lines of different threads ...
2012-05-21Merge branch 'vm-cleanups' (unmap_vma() interface cleanup)Linus Torvalds
This series sanitizes the interface to unmap_vma(). The crazy interface annoyed me no end when I was looking at unmap_single_vma(), which we can spend quite a lot of time in (especially with loads that have a lot of small fork/exec's: shell scripts etc). Moving the nr_accounted calculations to where they belong at least clarifies things a little. I hope to come back to look at the performance of this later, but if/when I get back to it I at least don't have to see the crazy interfaces any more. * vm-cleanups: vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces vm: simplify unmap_vmas() calling convention
2012-05-19memcg,thp: fix res_counter:96 regressionHugh Dickins
Occasionally, testing memcg's move_charge_at_immigrate on rc7 shows a flurry of hundreds of warnings at kernel/res_counter.c:96, where res_counter_uncharge_locked() does WARN_ON(counter->usage < val). The first trace of each flurry implicates __mem_cgroup_cancel_charge() of mc.precharge, and an audit of mc.precharge handling points to mem_cgroup_move_charge_pte_range()'s THP handling in commit 12724850e806 ("memcg: avoid THP split in task migration"). Checking !mc.precharge is good everywhere else, when a single page is to be charged; but here the "mc.precharge -= HPAGE_PMD_NR" likely to follow, is liable to result in underflow (a lot can change since the precharge was estimated). Simply check against HPAGE_PMD_NR: there's probably a better alternative, trying precharge for more, splitting if unsuccessful; but this one-liner is safer for now - no kernel/res_counter.c:96 warnings seen in 26 hours. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-18slub: missing test for partial pages flush work in flush_all()majianpeng
I found some kernel messages such as: SLUB raid5-md127: kmem_cache_destroy called for cache that still has objects. Pid: 6143, comm: mdadm Tainted: G O 3.4.0-rc6+ #75 Call Trace: kmem_cache_destroy+0x328/0x400 free_conf+0x2d/0xf0 [raid456] stop+0x41/0x60 [raid456] md_stop+0x1a/0x60 [md_mod] do_md_stop+0x74/0x470 [md_mod] md_ioctl+0xff/0x11f0 [md_mod] blkdev_ioctl+0xd8/0x7a0 block_ioctl+0x3b/0x40 do_vfs_ioctl+0x96/0x560 sys_ioctl+0x91/0xa0 system_call_fastpath+0x16/0x1b Then using kmemleak I found these messages: unreferenced object 0xffff8800b6db7380 (size 112): comm "mdadm", pid 5783, jiffies 4294810749 (age 90.589s) hex dump (first 32 bytes): 01 01 db b6 ad 4e ad de ff ff ff ff ff ff ff ff .....N.......... ff ff ff ff ff ff ff ff 98 40 4a 82 ff ff ff ff .........@J..... backtrace: kmemleak_alloc+0x21/0x50 kmem_cache_alloc+0xeb/0x1b0 kmem_cache_open+0x2f1/0x430 kmem_cache_create+0x158/0x320 setup_conf+0x649/0x770 [raid456] run+0x68b/0x840 [raid456] md_run+0x529/0x940 [md_mod] do_md_run+0x18/0xc0 [md_mod] md_ioctl+0xba8/0x11f0 [md_mod] blkdev_ioctl+0xd8/0x7a0 block_ioctl+0x3b/0x40 do_vfs_ioctl+0x96/0x560 sys_ioctl+0x91/0xa0 system_call_fastpath+0x16/0x1b This bug was introduced by commit a8364d5555b ("slub: only IPI CPUs that have per cpu obj to flush"), which did not include checks for per cpu partial pages being present on a cpu. Signed-off-by: majianpeng <majianpeng@gmail.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Tested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-15userns: Convert the move_pages, and migrate_pages permission checks to use ↵Eric W. Biederman
uid_eq Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-15userns: Convert tmpfs to use kuid and kgid where appropriateEric W. Biederman
Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-14Merge branch 'perf/uprobes' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/uprobes
2012-05-11mm: raise MemFree by reverting percpu_pagelist_fraction to 0Hugh Dickins
Why is there less MemFree than there used to be? It perturbed a test, so I've just been bisecting linux-next, and now find the offender went upstream yesterday. Commit 93278814d359 "mm: fix division by 0 in percpu_pagelist_fraction()" mistakenly initialized percpu_pagelist_fraction to the sysctl's minimum 8, which leaves 1/8th of memory on percpu lists (on each cpu??); but most of us expect it to be left unset at 0 (and it's not then used as a divisor). MemTotal: 8061476kB 8061476kB 8061476kB 8061476kB 8061476kB 8061476kB Repetitive test with percpu_pagelist_fraction 8: MemFree: 6948420kB 6237172kB 6949696kB 6840692kB 6949048kB 6862984kB Same test with percpu_pagelist_fraction back to 0: MemFree: 7945000kB 7944908kB 7948568kB 7949060kB 7948796kB 7948812kB Signed-off-by: Hugh Dickins <hughd@google.com> [ We really should fix the crazy sysctl interface too, but that's a separate thing - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds
Merge misc fixes from Andrew Morton. * emailed from Andrew Morton <akpm@linux-foundation.org>: (8 patches) MAINTAINERS: add maintainer for LED subsystem mm: nobootmem: fix sign extend problem in __free_pages_memory() drivers/leds: correct __devexit annotations memcg: free spare array to avoid memory leak namespaces, pid_ns: fix leakage on fork() failure hugetlb: prevent BUG_ON in hugetlb_fault() -> hugetlb_cow() mm: fix division by 0 in percpu_pagelist_fraction() proc/pid/pagemap: correctly report non-present ptes and holes between vmas
2012-05-10mm: nobootmem: fix sign extend problem in __free_pages_memory()Russ Anderson
Systems with 8 TBytes of memory or greater can hit a problem where only the the first 8 TB of memory shows up. This is due to "int i" being smaller than "unsigned long start_aligned", causing the high bits to be dropped. The fix is to change `i' to unsigned long to match start_aligned and end_aligned. Thanks to Jack Steiner for assistance tracking this down. Signed-off-by: Russ Anderson <rja@sgi.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10memcg: free spare array to avoid memory leakSha Zhengju
When the last event is unregistered, there is no need to keep the spare array anymore. So free it to avoid memory leak. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10hugetlb: prevent BUG_ON in hugetlb_fault() -> hugetlb_cow()Chris Metcalf
Commit 66aebce747eaf ("hugetlb: fix race condition in hugetlb_fault()") added code to avoid a race condition by elevating the page refcount in hugetlb_fault() while calling hugetlb_cow(). However, one code path in hugetlb_cow() includes an assertion that the page count is 1, whereas it may now also have the value 2 in this path. The consensus is that this BUG_ON has served its purpose, so rather than extending it to cover both cases, we just remove it. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@vger.kernel.org> [3.0.29+, 3.2.16+, 3.3.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10mm: fix division by 0 in percpu_pagelist_fraction()Sasha Levin
percpu_pagelist_fraction_sysctl_handler() has only considered -EINVAL as a possible error from proc_dointvec_minmax(). If any other error is returned, it would proceed to divide by zero since percpu_pagelist_fraction wasn't getting initialized at any point. For example, writing 0 bytes into the proc file would trigger the issue. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-09kmemleak: Fix the kmemleak tracking of the percpu areas with !SMPCatalin Marinas
Kmemleak tracks the percpu allocations via a specific API and the originally allocated areas must be removed from kmemleak (via kmemleak_free). The code was already doing this for SMP systems. Reported-by: Sami Liedes <sami.liedes@iki.fi> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-05-09percpu: pcpu_embed_first_chunk() should free unused parts after all allocs ↵Tejun Heo
are complete pcpu_embed_first_chunk() allocates memory for each node, copies percpu data and frees unused portions of it before proceeding to the next group. This assumes that allocations for different nodes doesn't overlap; however, depending on memory topology, the bootmem allocator may end up allocating memory from a different node than the requested one which may overlap with the portion freed from one of the previous percpu areas. This leads to percpu groups for different nodes overlapping which is a serious bug. This patch separates out copy & partial free from the allocation loop such that all allocations are complete before partial frees happen. This also fixes overlapping frees which could happen on allocation failure path - out_free_areas path frees whole groups but the groups could have portions freed at that point. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Reported-by: "Pavel V. Panteleev" <pp_84@mail.ru> Tested-by: "Pavel V. Panteleev" <pp_84@mail.ru> LKML-Reference: <E1SNhwY-0007ui-V7.pp_84-mail-ru@f220.mail.ru>
2012-05-08Merge branch 'for-3.4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull two percpu fixes from Tejun Heo: "One adds missing KERN_CONT on split printk()s and the other makes the percpu allocator avoid using PMD_SIZE as atom_size on x86_32. Using PMD_SIZE led to vmalloc area exhaustion on certain configurations (x86_32 android) and the only cost of using PAGE_SIZE instead is static percpu area not being aligned to large page mapping." * 'for-3.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu, x86: don't use PMD_SIZE as embedded atom_size on 32bit percpu: use KERN_CONT in pcpu_dump_alloc_info()
2012-05-08mm: use KERN_CONT in printk() continuation linesKay Sievers
Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-06vm: remove 'nr_accounted' calculations from the unmap_vmas() interfacesLinus Torvalds
The VM accounting makes no sense at this level, and half of the callers didn't ever actually use the end result. The only time we want to unaccount the memory is when we actually remove the vma, so do the accounting at that point instead. This simplifies the interfaces (no need to pass down that silly page counter to functions that really don't care), and also makes it much more obvious what is actually going on: we do vm_[un]acct_memory() when adding or removing the vma, not on random page walking. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-06vm: simplify unmap_vmas() calling conventionLinus Torvalds
None of the callers want to pass in 'zap_details', and it doesn't even make sense for the case of actually unmapping vma's. So remove the argument, and clean up the interface. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-03userns: Store uid and gid values in struct cred with kuid_t and kgid_t typesEric W. Biederman
cred.h and a few trivial users of struct cred are changed. The rest of the users of struct cred are left for other patches as there are too many changes to make in one go and leave the change reviewable. If the user namespace is disabled and CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile and behave correctly. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-04-26Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds
Merge fixes from Andrew Morton: "13 fixes. The acerhdf patches aren't (really) fixes. But they've been stuck in my tree for up to two years, sent to Matthew multiple times and the developers are unhappy." * emailed from Andrew Morton <akpm@linux-foundation.org>: (13 patches) mm: fix NULL ptr dereference in move_pages mm: fix NULL ptr dereference in migrate_pages revert "proc: clear_refs: do not clear reserved pages" drivers/rtc/rtc-ds1307.c: fix BUG shown with lock debugging enabled arch/arm/mach-ux500/mbox-db5500.c: world-writable sysfs fifo file hugetlbfs: lockdep annotate root inode properly acerhdf: lowered default temp fanon/fanoff values acerhdf: add support for new hardware acerhdf: add support for Aspire 1410 BIOS v1.3314 fs/buffer.c: remove BUG() in possible but rare condition mm: fix up the vmscan stat in vmstat epoll: clear the tfile_check_list on -ELOOP mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma
2012-04-26mm: fix NULL ptr dereference in move_pagesSasha Levin
Commit 3268c63 ("mm: fix move/migrate_pages() race on task struct") has added an odd construct where 'mm' is checked for being NULL, and if it is, it would get dereferenced anyways by mput()ing it. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-26mm: fix NULL ptr dereference in migrate_pagesSasha Levin
Commit 3268c63 ("mm: fix move/migrate_pages() race on task struct") has added an odd construct where 'mm' is checked for being NULL, and if it is, it would get dereferenced anyways by mput()ing it. This would lead to the following NULL ptr deref and BUG() when calling migrate_pages() with a pid that has no mm struct: [25904.193704] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 [25904.194235] IP: [<ffffffff810b0de7>] mmput+0x27/0xf0 [25904.194235] PGD 773e6067 PUD 77da0067 PMD 0 [25904.194235] Oops: 0002 [#1] PREEMPT SMP [25904.194235] CPU 2 [25904.194235] Pid: 31608, comm: trinity Tainted: G W 3.4.0-rc2-next-20120412-sasha #69 [25904.194235] RIP: 0010:[<ffffffff810b0de7>] [<ffffffff810b0de7>] mmput+0x27/0xf0 [25904.194235] RSP: 0018:ffff880077d49e08 EFLAGS: 00010202 [25904.194235] RAX: 0000000000000286 RBX: 0000000000000000 RCX: 0000000000000000 [25904.194235] RDX: ffff880075ef8000 RSI: 000000000000023d RDI: 0000000000000286 [25904.194235] RBP: ffff880077d49e18 R08: 0000000000000001 R09: 0000000000000001 [25904.194235] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [25904.194235] R13: 00000000ffffffea R14: ffff880034287740 R15: ffff8800218d3010 [25904.194235] FS: 00007fc8b244c700(0000) GS:ffff880029800000(0000) knlGS:0000000000000000 [25904.194235] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [25904.194235] CR2: 0000000000000050 CR3: 00000000767c6000 CR4: 00000000000406e0 [25904.194235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [25904.194235] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [25904.194235] Process trinity (pid: 31608, threadinfo ffff880077d48000, task ffff880075ef8000) [25904.194235] Stack: [25904.194235] ffff8800342876c0 0000000000000000 ffff880077d49f78 ffffffff811b8020 [25904.194235] ffffffff811b7d91 ffff880075ef8000 ffff88002256d200 0000000000000000 [25904.194235] 00000000000003ff 0000000000000000 0000000000000000 0000000000000000 [25904.194235] Call Trace: [25904.194235] [<ffffffff811b8020>] sys_migrate_pages+0x340/0x3a0 [25904.194235] [<ffffffff811b7d91>] ? sys_migrate_pages+0xb1/0x3a0 [25904.194235] [<ffffffff8266cbb9>] system_call_fastpath+0x16/0x1b [25904.194235] Code: c9 c3 66 90 55 31 d2 48 89 e5 be 3d 02 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 48 89 fb 48 c7 c7 cf 0e e1 82 e8 69 18 03 00 <f0> ff 4b 50 0f 94 c0 84 c0 0f 84 aa 00 00 00 48 89 df e8 72 f1 [25904.194235] RIP [<ffffffff810b0de7>] mmput+0x27/0xf0 [25904.194235] RSP <ffff880077d49e08> [25904.194235] CR2: 0000000000000050 [25904.348999] ---[ end trace a307b3ed40206b4b ]--- Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-26mm: fix up the vmscan stat in vmstatYing Han
The "pgsteal" stat is confusing because it counts both direct reclaim as well as background reclaim. However, we have "kswapd_steal" which also counts background reclaim value. This patch fixes it and also makes it match the existng "pgscan_" stats. Test: pgsteal_kswapd_dma32 447623 pgsteal_kswapd_normal 42272677 pgsteal_kswapd_movable 0 pgsteal_direct_dma32 2801 pgsteal_direct_normal 44353270 pgsteal_direct_movable 0 Signed-off-by: Ying Han <yinghan@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-26mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vmaKonstantin Khlebnikov
Fix a gcc warning (and bug?) introduced in cc9a6c877 ("cpuset: mm: reduce large amounts of memory barrier related damage v3") Local variable "page" can be uninitialized if the nodemask from vma policy does not intersects with nodemask from cpuset. Even if it doesn't happens it is better to initialize this variable explicitly than to introduce a kernel oops in a weird corner case. mm/hugetlb.c: In function `alloc_huge_page': mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-26mm: memcg: move pc lookup point to commit_charge()Johannes Weiner
None of the callsites actually need the page_cgroup descriptor themselves, so just pass the page and do the look up in there. We already had two bugs (6568d4a 'mm: memcg: update the correct soft limit tree during migration' and 'memcg: fix Bad page state after replace_page_cache') where the passed page and pc were not referring to the same page frame. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-26mm: nobootmem: Correct alloc_bootmem semantics.David Miller
The comments above __alloc_bootmem_node() claim that the code will first try the allocation using 'goal' and if that fails it will try again but with the 'goal' requirement dropped. Unfortunately, this is not what the code does, so fix it to do so. This is important for nobootmem conversions to architectures such as sparc where MAX_DMA_ADDRESS is infinity. On such architectures all of the allocations done by generic spots, such as the sparse-vmemmap implementation, will pass in: __pa(MAX_DMA_ADDRESS) as the goal, and with the limit given as "-1" this will always fail unless we add the appropriate fallback logic here. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-24mm: fix s390 BUG by __set_page_dirty_no_writeback on swapHugh Dickins
Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390 3.0.13: called from __set_page_dirty_nobuffers() when page_remove_rmap() tries to transfer dirty flag from s390 storage key to struct page and radix_tree. That would be because of reclaim's shrink_page_list() calling add_to_swap() on this page at the same time: first PageSwapCache is set (causing page_mapping(page) to appear as &swapper_space), then page->private set, then tree_lock taken, then page inserted into radix_tree - so there's an interval before taking the lock when the radix_tree slot is empty. We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up before the SetPageSwapCache. But a better fix is simply to do what's five years overdue: Ken Chen introduced __set_page_dirty_no_writeback() (if !PageDirty TestSetPageDirty) for tmpfs to skip all the radix_tree overhead, and swap is just the same - it ignores the radix_tree tag, and does not participate in dirty page accounting, so should be using __set_page_dirty_no_writeback() too. s390 testing now confirms that this does indeed fix the problem. Reported-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ken Chen <kenchen@google.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-21kill mm argument of vm_munmap()Al Viro
it's always current->mm Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-04-21VM: add "vm_mmap()" helper functionLinus Torvalds
This continues the theme started with vm_brk() and vm_munmap(): vm_mmap() does the same thing as do_mmap(), but additionally does the required VM locking. This uninlines (and rewrites it to be clearer) do_mmap(), which sadly duplicates it in mm/mmap.c and mm/nommu.c. But that way we don't have to export our internal do_mmap_pgoff() function. Some day we hopefully don't have to export do_mmap() either, if all modular users can become the simpler vm_mmap() instead. We're actually very close to that already, with the notable exception of the (broken) use in i810, and a couple of stragglers in binfmt_elf. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-21VM: add "vm_munmap()" helper functionLinus Torvalds
Like the vm_brk() function, this is the same as "do_munmap()", except it does the VM locking for the caller. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-21VM: add "vm_brk()" helper functionLinus Torvalds
It does the same thing as "do_brk()", except it handles the VM locking too. It turns out that all external callers want that anyway, so we can make do_brk() static to just mm/mmap.c while at it. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-20memblock: memblock should be able to handle zero length operationsTejun Heo
Commit 24aa07882b ("memblock, x86: Replace memblock_x86_reserve/ free_range() with generic ones") replaced x86 specific memblock operations with the generic ones; unfortunately, it lost zero length operation handling in the process making the kernel panic if somebody tries to reserve zero length area. There isn't much to be gained by being cranky to zero length operations and panicking is almost the worst response. Drop the BUG_ON() in memblock_reserve() and update memblock_add_region/isolate_range() so that all zero length operations are handled as noops. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Reported-by: Valere Monseur <valere.monseur@ymail.com> Bisected-by: Joseph Freeman <jfree143dev@gmail.com> Tested-by: Joseph Freeman <jfree143dev@gmail.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=43098 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-19memcg: fix Bad page state after replace_page_cacheHugh Dickins
My 9ce70c0240d0 "memcg: fix deadlock by inverting lrucare nesting" put a nasty little bug into v3.3's version of mem_cgroup_replace_page_cache(), sometimes used for FUSE. Replacing __mem_cgroup_commit_charge_lrucare() by __mem_cgroup_commit_charge(), I used the "pc" pointer set up earlier: but it's for oldpage, and needs now to be for newpage. Once oldpage was freed, its PageCgroupUsed bit (cleared above but set again here) caused "Bad page state" messages - and perhaps worse, being missed from newpage. (I didn't find this by using FUSE, but in reusing the function for tmpfs.) Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org [v3.3 only] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-14uprobes/core: Decrement uprobe count before the pages are unmappedSrikar Dronamraju
Uprobes has a callback (uprobe_munmap()) in the unmap path to maintain the uprobes count. In the exit path this callback gets called in unlink_file_vma(). However by the time unlink_file_vma() is called, the pages would have been unmapped (in unmap_vmas()) and the task->rss_stat counts accounted (in zap_pte_range()). If the exiting process has probepoints, uprobe_munmap() checks if the breakpoint instruction was around before decrementing the probe count. This results in a file backed page being reread by uprobe_munmap() and hence it does not find the breakpoint. This patch fixes this problem by moving the callback to unmap_single_vma(). Since unmap_single_vma() may not unmap the complete vma, add start and end parameters to uprobe_munmap(). This bug became apparent courtesy of commit c3f0327f8e9d ("mm: add rss counters consistency check"). Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com> Cc: Linux-mm <linux-mm@kvack.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Anton Arapov <anton@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20120411103527.23245.9835.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-04-14Merge branch 'perf/core' into perf/uprobesIngo Molnar
Merge in latest upstream (and the latest perf development tree), to prepare for tooling changes, and also to pick up v3.4 MM changes that the uprobes code needs to take care of. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-04-12Revert "mm: vmscan: fix misused nr_reclaimed in shrink_mem_cgroup_zone()"Ying Han
This reverts commit c38446cc65e1f2b3eb8630c53943b94c4f65f670. Before the commit, the code makes senses to me but not after the commit. The "nr_reclaimed" is the number of pages reclaimed by scanning through the memcg's lru lists. The "nr_to_reclaim" is the target value for the whole function. For example, we like to early break the reclaim if reclaimed 32 pages under direct reclaim (not DEF_PRIORITY). After the reverted commit, the target "nr_to_reclaim" is decremented each time by "nr_reclaimed" but we still use it to compare the "nr_reclaimed". It just doesn't make sense to me... Signed-off-by: Ying Han <yinghan@google.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-12hugetlb: fix race condition in hugetlb_fault()Chris Metcalf
The race is as follows: Suppose a multi-threaded task forks a new process (on cpu A), thus bumping up the ref count on all the pages. While the fork is occurring (and thus we have marked all the PTEs as read-only), another thread in the original process (on cpu B) tries to write to a huge page, taking an access violation from the write-protect and calling hugetlb_cow(). Now, suppose the fork() fails. It will undo the COW and decrement the ref count on the pages, so the ref count on the huge page drops back to 1. Meanwhile hugetlb_cow() also decrements the ref count by one on the original page, since the original address space doesn't need it any more, having copied a new page to replace the original page. This leaves the ref count at zero, and when we call unlock_page(), we panic. fork on CPU A fault on CPU B ============= ============== ... down_write(&parent->mmap_sem); down_write_nested(&child->mmap_sem); ... while duplicating vmas if error break; ... up_write(&child->mmap_sem); up_write(&parent->mmap_sem); ... down_read(&parent->mmap_sem); ... lock_page(page); handle COW page_mapcount(old_page) == 2 alloc and prepare new_page ... handle error page_remove_rmap(page); put_page(page); ... fold new_page into pte page_remove_rmap(page); put_page(page); ... oops ==> unlock_page(page); up_read(&parent->mmap_sem); The solution is to take an extra reference to the page while we are holding the lock on it. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-12memcg: do not open code accesses to res_counter membersGlauber Costa
We should use the accessor res_counter_read_u64 for that. Although a purely cosmetic change is sometimes better delayed, to avoid conflicting with other people's work, we are starting to have people touching this code as well, and reproducing the open code behavior because that's the standard =) Time to fix it, then. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-12memcg: fix broken boolen expressionKirill A. Shutemov
action != CPU_DEAD || action != CPU_DEAD_FROZEN is always true. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-10cgroup: get rid of populate for memcgGlauber Costa
The last man standing justifying the need for populate() is the sock memcg initialization functions. Now that we are able to pass a struct mem_cgroup instead of a struct cgroup to the socket initialization, there is nothing that stops us from initializing everything in create(). Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> CC: Li Zefan <lizefan@huawei.com> CC: Johannes Weiner <hannes@cmpxchg.org> CC: Michal Hocko <mhocko@suse.cz>
2012-04-10cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcgGlauber Costa
The only reason cgroup was used, was to be consistent with the populate() interface. Now that we're getting rid of it, not only we no longer need it, but we also *can't* call it this way. Since we will no longer rely on populate(), this will be called from create(). During create, the association between struct mem_cgroup and struct cgroup does not yet exist, since cgroup internals hasn't yet initialized its bookkeeping. This means we would not be able to draw the memcg pointer from the cgroup pointer in these functions, which is highly undesirable. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> CC: Li Zefan <lizefan@huawei.com> CC: Johannes Weiner <hannes@cmpxchg.org> CC: Michal Hocko <mhocko@suse.cz>
2012-04-01cgroup: make css->refcnt clearing on cgroup removal optionalTejun Heo
Currently, cgroup removal tries to drain all css references. If there are active css references, the removal logic waits and retries ->pre_detroy() until either all refs drop to zero or removal is cancelled. This semantics is unusual and adds non-trivial complexity to cgroup core and IMHO is fundamentally misguided in that it couples internal implementation details (references to internal data structure) with externally visible operation (rmdir). To userland, this is a behavior peculiarity which is unnecessary and difficult to expect (css refs is otherwise invisible from userland), and, to policy implementations, this is an unnecessary restriction (e.g. blkcg wants to hold css refs for caching purposes but can't as that becomes visible as rmdir hang). Unfortunately, memcg currently depends on ->pre_destroy() retrials and cgroup removal vetoing and can't be immmediately switched to the new behavior. This patch introduces the new behavior of not waiting for css refs to drain and maintains the old behavior for subsystems which have __DEPRECATED_clear_css_refs set. Once, memcg is updated, we can drop the code paths for the old behavior as proposed in the following patch. Note that the following patch is incorrect in that dput work item is in cgroup and may lose some of dputs when multiples css's are released back-to-back, and __css_put() triggers check_for_release() when refcnt reaches 0 instead of 1; however, it shows what part can be removed. http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251 Note that, in not-too-distant future, cgroup core will start emitting warning messages for subsys which require the old behavior, so please get moving. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2012-04-01cgroup: convert memcg controller to the new cftype interfaceTejun Heo
Convert memcg to use the new cftype based interface. kmem support abuses ->populate() for mem_cgroup_sockets_init() so it can't be removed at the moment. tcp_memcontrol is updated so that tcp_files[] is registered via a __initcall. This change also allows removing the forward declaration of tcp_files[]. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Hugh Dickins <hughd@google.com> Cc: Greg Thelen <gthelen@google.com>