summaryrefslogtreecommitdiff
path: root/kernel/cpuset.c
AgeCommit message (Collapse)Author
2016-03-21Merge branch 'for-4.6-ns' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup namespace support from Tejun Heo: "These are changes to implement namespace support for cgroup which has been pending for quite some time now. It is very straight-forward and only affects what part of cgroup hierarchies are visible. After unsharing, mounting a cgroup fs will be scoped to the cgroups the task belonged to at the time of unsharing and the cgroup paths exposed to userland would be adjusted accordingly" * 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix and restructure error handling in copy_cgroup_ns() cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns() Add FS_USERNS_FLAG to cgroup fs cgroup: Add documentation for cgroup namespaces cgroup: mount cgroupns-root when inside non-init cgroupns kernfs: define kernfs_node_dentry cgroup: cgroup namespace setns support cgroup: introduce cgroup namespaces sched: new clone flag CLONE_NEWCGROUP for cgroup namespace kernfs: Add API to generate relative kernfs path
2016-02-23cgroup: convert cgroup_subsys flag fields to bool bitfieldsTejun Heo
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
2016-02-16cgroup: introduce cgroup namespacesAditya Kali
Introduce the ability to create new cgroup namespace. The newly created cgroup namespace remembers the cgroup of the process at the point of creation of the cgroup namespace (referred as cgroupns-root). The main purpose of cgroup namespace is to virtualize the contents of /proc/self/cgroup file. Processes inside a cgroup namespace are only able to see paths relative to their namespace root (unless they are moved outside of their cgroupns-root, at which point they will see a relative path from their cgroupns-root). For a correctly setup container this enables container-tools (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized containers without leaking system level cgroup hierarchy to the task. This patch only implements the 'unshare' part of the cgroupns. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-01-22cpuset: make mm migration asynchronousTejun Heo
If "cpuset.memory_migrate" is set, when a process is moved from one cpuset to another with a different memory node mask, pages in used by the process are migrated to the new set of nodes. This was performed synchronously in the ->attach() callback, which is synchronized against process management. Recently, the synchronization was changed from per-process rwsem to global percpu rwsem for simplicity and optimization. Combined with the synchronous mm migration, this led to deadlocks because mm migration could schedule a work item which may in turn try to create a new worker blocking on the process management lock held from cgroup process migration path. This heavy an operation shouldn't be performed synchronously from that deep inside cgroup migration in the first place. This patch punts the actual migration to an ordered workqueue and updates cgroup process migration and cpuset config update paths to flush the workqueue after all locks are released. This way, the operations still seem synchronous to userland without entangling mm migration with process management synchronization. CPU hotplug can also invoke mm migration but there's no reason for it to wait for mm migrations and thus doesn't synchronize against their completions. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: stable@vger.kernel.org # v4.4+
2015-12-03Merge branch 'for-4.4-fixes' into for-4.5Tejun Heo
2015-12-03cgroup: fix handling of multi-destination migration from subtree_control ↵Tejun Heo
enabling Consider the following v2 hierarchy. P0 (+memory) --- P1 (-memory) --- A \- B P0 has memory enabled in its subtree_control while P1 doesn't. If both A and B contain processes, they would belong to the memory css of P1. Now if memory is enabled on P1's subtree_control, memory csses should be created on both A and B and A's processes should be moved to the former and B's processes the latter. IOW, enabling controllers can cause atomic migrations into different csses. The core cgroup migration logic has been updated accordingly but the controller migration methods haven't and still assume that all tasks migrate to a single target css; furthermore, the methods were fed the css in which subtree_control was updated which is the parent of the target csses. pids controller depends on the migration methods to move charges and this made the controller attribute charges to the wrong csses often triggering the following warning by driving a counter negative. WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40() Modules linked in: CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29 ... ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000 ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00 ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8 Call Trace: [<ffffffff81551ffc>] dump_stack+0x4e/0x82 [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0 [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40 [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0 [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330 [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190 [<ffffffff81189016>] cgroup_attach_task+0x176/0x200 [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460 [<ffffffff81189684>] cgroup_procs_write+0x14/0x20 [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0 [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190 [<ffffffff81265f88>] __vfs_write+0x28/0xe0 [<ffffffff812666fc>] vfs_write+0xac/0x1a0 [<ffffffff81267019>] SyS_write+0x49/0xb0 [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76 This patch fixes the bug by removing @css parameter from the three migration methods, ->can_attach, ->cancel_attach() and ->attach() and updating cgroup_taskset iteration helpers also return the destination css in addition to the task being migrated. All controllers are updated accordingly. * Controllers which don't care whether there are one or multiple target csses can be converted trivially. cpu, io, freezer, perf, netclassid and netprio fall in this category. * cpuset's current implementation assumes that there's single source and destination and thus doesn't support v2 hierarchy already. The only change made by this patchset is how that single destination css is obtained. * memory migration path already doesn't do anything on v2. How the single destination css is obtained is updated and the prep stage of mem_cgroup_can_attach() is reordered to accomodate the change. * pids is the only controller which was affected by this bug. It now correctly handles multi-destination migrations and no longer causes counter underflow from incorrect accounting. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-11-25cpuset: Replace all instances of time_t with time64_tArnd Bergmann
The following patch replaces all instances of time_t with time64_t i.e. change the type used for representing time from 32-bit to 64-bit. All 32-bit kernels to date use a signed 32-bit time_t type, which can only represent time until January 2038. Since embedded systems running 32-bit Linux are going to survive beyond that date, we have to change all current uses, in a backwards compatible way. The patch also changes the function get_seconds() that returns a 32-bit integer to ktime_get_seconds() that returns seconds as 64-bit integer. The patch changes the type of ticks from time_t to u32. We keep ticks as 32-bits as the function uses 32-bit arithmetic which would prove less expensive than 64-bit arithmetic and the function is expected to be called atleast once every 32 seconds. Signed-off-by: Heena Sirwani <heenasirwani@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-06Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge patch-bomb from Andrew Morton: - inotify tweaks - some ocfs2 updates (many more are awaiting review) - various misc bits - kernel/watchdog.c updates - Some of mm. I have a huge number of MM patches this time and quite a lot of it is quite difficult and much will be held over to next time. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits) selftests: vm: add tests for lock on fault mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage mm: introduce VM_LOCKONFAULT mm: mlock: add new mlock system call mm: mlock: refactor mlock, munlock, and munlockall code kasan: always taint kernel on report mm, slub, kasan: enable user tracking by default with KASAN=y kasan: use IS_ALIGNED in memory_is_poisoned_8() kasan: Fix a type conversion error lib: test_kasan: add some testcases kasan: update reference to kasan prototype repo kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile kasan: various fixes in documentation kasan: update log messages kasan: accurately determine the type of the bad access kasan: update reported bug types for kernel memory accesses kasan: update reported bug types for not user nor kernel memory accesses mm/kasan: prevent deadlock in kasan reporting mm/kasan: don't use kasan shadow pointer in generic functions mm/kasan: MODULE_VADDR is not available on all archs ...
2015-11-06mm, oom: remove task_lock protecting comm printingDavid Rientjes
The oom killer takes task_lock() in a couple of places solely to protect printing the task's comm. A process's comm, including current's comm, may change due to /proc/pid/comm or PR_SET_NAME. The comm will always be NULL-terminated, so the worst race scenario would only be during update. We can tolerate a comm being printed that is in the middle of an update to avoid taking the lock. Other locations in the kernel have already dropped task_lock() when printing comm, so this is consistent. Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-15cgroup: replace cgroup_has_tasks() with cgroup_is_populated()Tejun Heo
Currently, cgroup_has_tasks() tests whether the target cgroup has any css_set linked to it. This works because a css_set's refcnt converges with the number of tasks linked to it and thus there's no css_set linked to a cgroup if it doesn't have any live tasks. To help tracking resource usage of zombie tasks, putting the ref of css_set will be separated from disassociating the task from the css_set which means that a cgroup may have css_sets linked to it even when it doesn't have any live tasks. This patch replaces cgroup_has_tasks() with cgroup_is_populated() which tests cgroup->nr_populated instead which locally counts the number of populated css_sets. Unlike cgroup_has_tasks(), cgroup_is_populated() is recursive - if any of the descendants is populated, the cgroup is populated too. While this changes the meaning of the test, all the existing users are okay with the change. While at it, replace the open-coded ->populated_cnt test in cgroup_events_show() with cgroup_is_populated(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org>
2015-09-22cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader()Tejun Heo
It wasn't explicitly documented but, when a process is being migrated, cpuset and memcg depend on cgroup_taskset_first() returning the threadgroup leader; however, this approach is somewhat ghetto and would no longer work for the planned multi-process migration. This patch introduces explicit cgroup_taskset_for_each_leader() which iterates over only the threadgroup leaders and replaces cgroup_taskset_first() usages for accessing the leader with it. This prepares both memcg and cpuset for multi-process migration. This patch also updates the documentation for cgroup_taskset_for_each() to clarify the iteration rules and removes comments mentioning task ordering in tasksets. v2: A previous patch which added threadgroup leader test was dropped. Patch updated accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org>
2015-09-22cpuset: migrate memory only for threadgroup leadersTejun Heo
If memory_migrate flag is set, cpuset migrates memory according to the destnation css's nodemask. The current implementation migrates memory whenever any thread of a process is migrated making the behavior somewhat arbitrary. Let's tie memory operations to the threadgroup leader so that memory is migrated only when the leader is migrated. While this is a behavior change, given the inherent fuziness, this change is not too likely to be noticed and allows us to clearly define who owns the memory (always the leader) and helps the planned atomic multi-process migration. Note that we're currently migrating memory in migration path proper while holding all the locks. In the long term, this should be moved out to an async work item. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com>
2015-09-18cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLETejun Heo
cftype->mode allows controllers to give arbitrary permissions to interface knobs. Except for "cgroup.event_control", the existing uses are spurious. * Some explicitly specify S_IRUGO | S_IWUSR even though that's the default. * "cpuset.memory_pressure" specifies S_IRUGO while also setting a write callback which returns -EACCES. All it needs to do is simply not setting a write callback. "cgroup.event_control" uses cftype->mode to make the file world-writable. It's a misdesigned interface and we don't want controllers to be tweaking interface file permissions in general. This patch removes cftype->mode and all its spurious uses and implements CFTYPE_WORLD_WRITABLE for "cgroup.event_control" which is marked as compatibility-only. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org>
2015-09-18cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl()Tejun Heo
cgroup_on_dfl() tests whether the cgroup's root is the default hierarchy; however, an individual controller is only interested in whether the controller is attached to the default hierarchy and never tests a cgroup which doesn't belong to the hierarchy that the controller is attached to. This patch replaces cgroup_on_dfl() tests in controllers with faster static_key based cgroup_subsys_on_dfl(). This leaves cgroup core as the only user of cgroup_on_dfl() and the function is moved from the header file to cgroup.c. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org>
2015-08-10cpuset: use trialcs->mems_allowed as a temp variableAlban Crequy
The comment says it's using trialcs->mems_allowed as a temp variable but it didn't match the code. Change the code to match the comment. This fixes an issue when writing in cpuset.mems when a sub-directory exists: we need to write several times for the information to persist: | root@alban:/sys/fs/cgroup/cpuset# mkdir footest9 | root@alban:/sys/fs/cgroup/cpuset# cd footest9 | root@alban:/sys/fs/cgroup/cpuset/footest9# mkdir aa | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems | | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems | | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems | 0 | root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems | | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > aa/cpuset.mems | root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems | 0 | root@alban:/sys/fs/cgroup/cpuset/footest9# This should help to fix the following issue in Docker: https://github.com/opencontainers/runc/issues/133 In some conditions, a Docker container needs to be started twice in order to work. Signed-off-by: Alban Crequy <alban@endocode.com> Tested-by: Iago López Galeiras <iago@endocode.com> Cc: <stable@vger.kernel.org> # 3.17+ Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-04-14kernel, cpuset: remove exception for __GFP_THISNODEDavid Rientjes
Nothing calls __cpuset_node_allowed() with __GFP_THISNODE set anymore, so remove the obscure comment about it and its special-case exception. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-19cpusets, isolcpus: exclude isolcpus from load balancing in cpusetsRik van Riel
Ensure that cpus specified with the isolcpus= boot commandline option stay outside of the load balancing in the kernel scheduler. Operations like load balancing can introduce unwanted latencies, which is exactly what the isolcpus= commandline is there to prevent. Previously, simply creating a new cpuset, without even touching the cpuset.cpus field inside the new cpuset, would undo the effects of isolcpus=, by creating a scheduler domain spanning the whole system, and setting up load balancing inside that domain. The cpuset root cpuset.cpus file is read-only, so there was not even a way to undo that effect. This does not impact the majority of cpusets users, since isolcpus= is a fairly specialized feature used for realtime purposes. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: cgroups@vger.kernel.org Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: David Rientjes <rientjes@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-03-02cpuset: Fix cpuset sched_relax_domain_levelJason Low
The cpuset.sched_relax_domain_level can control how far we do immediate load balancing on a system. However, it was found on recent kernels that echo'ing a value into cpuset.sched_relax_domain_level did not reduce any immediate load balancing. The reason this occurred was because the update_domain_attr_tree() traversal did not update for the "top_cpuset". This resulted in nothing being changed when modifying the sched_relax_domain_level parameter. This patch is able to address that problem by having update_domain_attr_tree() allow updates for the root in the cpuset traversal. Fixes: fc560a26acce ("cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()") Cc: <stable@vger.kernel.org> # 3.9+ Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Serge Hallyn <serge.hallyn@canonical.com>
2015-03-02cpuset: fix a warning when clearing configured masks in old hierarchyZefan Li
When we clear cpuset.cpus, cpuset.effective_cpus won't be cleared: # mount -t cgroup -o cpuset xxx /mnt # mkdir /mnt/tmp # echo 0 > /mnt/tmp/cpuset.cpus # echo > /mnt/tmp/cpuset.cpus # cat cpuset.cpus # cat cpuset.effective_cpus 0-15 And a kernel warning in update_cpumasks_hier() is triggered: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 4028 at kernel/cpuset.c:894 update_cpumasks_hier+0x471/0x650() Cc: <stable@vger.kernel.org> # 3.17+ Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Serge Hallyn <serge.hallyn@canonical.com>
2015-03-02cpuset: initialize effective masks when clone_children is enabledZefan Li
If clone_children is enabled, effective masks won't be initialized due to the bug: # mount -t cgroup -o cpuset xxx /mnt # echo 1 > cgroup.clone_children # mkdir /mnt/tmp # cat /mnt/tmp/ # cat cpuset.effective_cpus # cat cpuset.cpus 0-15 And then this cpuset won't constrain the tasks in it. Either the bug or the fix has no effect on unified hierarchy, as there's no clone_chidren flag there any more. Reported-by: Christian Brauner <christianvanbrauner@gmail.com> Reported-by: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: <stable@vger.kernel.org> # 3.17+ Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Serge Hallyn <serge.hallyn@canonical.com>
2015-02-14cpuset: use %*pb[l] to print bitmaps including cpumasks and nodemasksTejun Heo
printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. * kernel/cpuset.c::cpuset_print_task_mems_allowed() used a static buffer which is protected by a dedicated spinlock. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13kernel/cpuset.c: Mark cpuset_init_current_mems_allowed as __initRasmus Villemoes
The only caller of cpuset_init_current_mems_allowed is the __init annotated build_all_zonelists_init, so we can also make the former __init. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com> Cc: Pintu Kumar <pintu.k@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-12Merge branch 'for-3.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup update from Tejun Heo: "cpuset got simplified a bit. cgroup core got a fix on unified hierarchy and grew some effective css related interfaces which will be used for blkio support for writeback IO traffic which is currently being worked on" * 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: implement cgroup_get_e_css() cgroup: add cgroup_subsys->css_e_css_changed() cgroup: add cgroup_subsys->css_released() cgroup: fix the async css offline wait logic in cgroup_subtree_control_write() cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write() cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask() cpuset: lock vs unlock typo cpuset: simplify cpuset_node_allowed API cpuset: convert callback_mutex to a spinlock
2014-10-28sched/deadline: Ensure that updates to exclusive cpusets don't break ACJuri Lelli
How we deal with updates to exclusive cpusets is currently broken. As an example, suppose we have an exclusive cpuset composed of two cpus: A[cpu0,cpu1]. We can assign SCHED_DEADLINE task to it up to the allowed bandwidth. If we want now to modify cpusetA's cpumask, we have to check that removing a cpu's amount of bandwidth doesn't break AC guarantees. This thing isn't checked in the current code. This patch fixes the problem above, denying an update if the new cpumask won't have enough bandwidth for SCHED_DEADLINE tasks that are currently active. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Li Zefan <lizefan@huawei.com> Cc: cgroups@vger.kernel.org Link: http://lkml.kernel.org/r/5433E6AF.5080105@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28sched/deadline: Fix bandwidth check/update when migrating tasks between ↵Juri Lelli
exclusive cpusets Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks affinity (performing what is commonly called clustered scheduling). Unfortunately, such thing is currently broken for two reasons: - No check is performed when the user tries to attach a task to an exlusive cpuset (recall that exclusive cpusets have an associated maximum allowed bandwidth). - Bandwidths of source and destination cpusets are not correctly updated after a task is migrated between them. This patch fixes both things at once, as they are opposite faces of the same coin. The check is performed in cpuset_can_attach(), as there aren't any points of failure after that function. The updated is split in two halves. We first reserve bandwidth in the destination cpuset, after we pass the check in cpuset_can_attach(). And we then release bandwidth from the source cpuset when the task's affinity is actually changed. Even if there can be time windows when sched_setattr() may erroneously fail in the source cpuset, we are fine with it, as we can't perfom an atomic update of both cpusets at once. Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Reported-by: Vincent Legout <vincent@legout.info> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dario Faggioli <raistlin@linux.it> Cc: Michael Trimarchi <michael@amarulasolutions.com> Cc: Fabio Checconi <fchecconi@gmail.com> Cc: michael@amarulasolutions.com Cc: luca.abeni@unitn.it Cc: Li Zefan <lizefan@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: cgroups@vger.kernel.org Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-27cpuset: lock vs unlock typoDan Carpenter
This will deadlock instead of unlocking. Fixes: f73eae8d8384 ('cpuset: simplify cpuset_node_allowed API') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-10-27cpuset: simplify cpuset_node_allowed APIVladimir Davydov
Current cpuset API for checking if a zone/node is allowed to allocate from looks rather awkward. We have hardwall and softwall versions of cpuset_node_allowed with the softwall version doing literally the same as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags. If it isn't, the softwall version may check the given node against the enclosing hardwall cpuset, which it needs to take the callback lock to do. Such a distinction was introduced by commit 02a0e53d8227 ("cpuset: rework cpuset_zone_allowed api"). Before, we had the only version with the __GFP_HARDWALL flag determining its behavior. The purpose of the commit was to avoid sleep-in-atomic bugs when someone would mistakenly call the function without the __GFP_HARDWALL flag for an atomic allocation. The suffixes introduced were intended to make the callers think before using the function. However, since the callback lock was converted from mutex to spinlock by the previous patch, the softwall check function cannot sleep, and these precautions are no longer necessary. So let's simplify the API back to the single check. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-10-27cpuset: convert callback_mutex to a spinlockVladimir Davydov
The callback_mutex is only used to synchronize reads/updates of cpusets' flags and cpu/node masks. These operations should always proceed fast so there's no reason why we can't use a spinlock instead of the mutex. Converting the callback_mutex into a spinlock will let us call cpuset_zone_allowed_softwall from atomic context. This, in turn, makes it possible to simplify the code by merging the hardwall and asoftwall checks into the same function, which is the business of the next patch. Suggested-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-10-10Merge branch 'for-3.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Nothing too interesting. Just a handful of cleanup patches" * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: Revert "cgroup: remove redundant variable in cgroup_mount()" cgroup: remove redundant variable in cgroup_mount() cgroup: fix missing unlock in cgroup_release_agent() cgroup: remove CGRP_RELEASABLE flag perf/cgroup: Remove perf_put_cgroup() cgroup: remove redundant check in cgroup_ino() cpuset: simplify proc_cpuset_show() cgroup: simplify proc_cgroup_show() cgroup: use a per-cgroup work for release agent cgroup: remove bogus comments cgroup: remove redundant code in cgroup_rmdir() cgroup: remove some useless forward declarations cgroup: fix a typo in comment.
2014-09-25cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flagsZefan Li
When we change cpuset.memory_spread_{page,slab}, cpuset will flip PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset. This should be done using atomic bitops, but currently we don't, which is broken. Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened when one thread tried to clear PF_USED_MATH while at the same time another thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on the same task. Here's the full report: https://lkml.org/lkml/2014/9/19/230 To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags. v4: - updated mm/slab.c. (Fengguang Wu) - updated Documentation. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Kees Cook <keescook@chromium.org> Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time") Cc: <stable@vger.kernel.org> # 2.6.31+ Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-18cpuset: simplify proc_cpuset_show()Zefan Li
Use the ONE macro instead of REG, and we can simplify proc_cpuset_show(). Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-04Merge branch 'for-3.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "Mostly changes to get the v2 interface ready. The core features are mostly ready now and I think it's reasonable to expect to drop the devel mask in one or two devel cycles at least for a subset of controllers. - cgroup added a controller dependency mechanism so that block cgroup can depend on memory cgroup. This will be used to finally support IO provisioning on the writeback traffic, which is currently being implemented. - The v2 interface now uses a separate table so that the interface files for the new interface are explicitly declared in one place. Each controller will explicitly review and add the files for the new interface. - cpuset is getting ready for the hierarchical behavior which is in the similar style with other controllers so that an ancestor's configuration change doesn't change the descendants' configurations irreversibly and processes aren't silently migrated when a CPU or node goes down. All the changes are to the new interface and no behavior changed for the multiple hierarchies" * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits) cpuset: fix the WARN_ON() in update_nodemasks_hier() cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core cgroup: distinguish the default and legacy hierarchies when handling cftypes cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[] cpuset: export effective masks to userspace cpuset: allow writing offlined masks to cpuset.cpus/mems cpuset: enable onlined cpu/node in effective masks cpuset: refactor cpuset_hotplug_update_tasks() cpuset: make cs->{cpus, mems}_allowed as user-configured masks cpuset: apply cs->effective_{cpus,mems} cpuset: initialize top_cpuset's configured masks at mount cpuset: use effective cpumask to build sched domains cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty cpuset: update cs->effective_{cpus, mems} when config changes cpuset: update cpuset->effective_{cpus,mems} at hotplug cpuset: add cs->effective_cpus and cs->effective_mems cgroup: clean up sane_behavior handling ...
2014-07-30cpuset: fix the WARN_ON() in update_nodemasks_hier()Li Zefan
The WARN_ON() is used to check if we break the legal hierarchy, on which the effective mems should be equal to configured mems. Reported-by: Mike Qiu <qiudayu@linux.vnet.ibm.com> Tested-by: Mike Qiu <qiudayu@linux.vnet.ibm.com> Signed-off-by: Li Zefan <lizefan@huawei.com>
2014-07-15cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypesTejun Heo
Currently, cgroup_subsys->base_cftypes is used for both the unified default hierarchy and legacy ones and subsystems can mark each file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear only on one of them. This is quite hairy and error-prone. Also, we may end up exposing interface files to the default hierarchy without thinking it through. cgroup_subsys will grow two separate cftype arrays and apply each only on the hierarchies of the matching type. This will allow organizing cftypes in a lot clearer way and encourage subsystems to scrutinize the interface which is being exposed in the new default hierarchy. In preparation, this patch renames cgroup_subsys->base_cftypes to cgroup_subsys->legacy_cftypes. This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Aristeu Rozanski <aris@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2014-07-09cpuset: export effective masks to userspaceLi Zefan
cpuset.cpus and cpuset.mems are the configured masks, and we need to export effective masks to userspace, so users know the real cpus_allowed and mems_allowed that apply to the tasks in a cpuset. v2: - export those masks unconditionally, suggested by Tejun. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-09cpuset: allow writing offlined masks to cpuset.cpus/memsLi Zefan
As the configured masks won't be limited by its parent, and the top cpuset's masks won't change when hotplug happens, it's natural to allow writing offlined masks to the configured masks. If on default hierarchy: # echo 0 > /sys/devices/system/cpu/cpu1/online # mkdir /cpuset/sub # echo 1 > /cpuset/sub/cpuset.cpus # cat /cpuset/sub/cpuset.cpus 1 If on legacy hierarchy: # echo 0 > /sys/devices/system/cpu/cpu1/online # mkdir /cpuset/sub # echo 1 > /cpuset/sub/cpuset.cpus -bash: echo: write error: Invalid argument Note the checks don't need to be gated by cgroup_on_dfl, because we've initialized top_cpuset.{cpus,mems}_allowed accordingly in cpuset_bind(). Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-09cpuset: enable onlined cpu/node in effective masksLi Zefan
Firstly offline cpu1: # echo 0-1 > cpuset.cpus # echo 0 > /sys/devices/system/cpu/cpu1/online # cat cpuset.cpus 0-1 # cat cpuset.effective_cpus 0 Then online it: # echo 1 > /sys/devices/system/cpu/cpu1/online # cat cpuset.cpus 0-1 # cat cpuset.effective_cpus 0-1 And cpuset will bring it back to the effective mask. The implementation is quite straightforward. Instead of calculating the offlined cpus/mems and do updates, we just set the new effective_mask to online_mask & congifured_mask. This is a behavior change for default hierarchy, so legacy hierarchy won't be affected. v2: - make refactoring of cpuset_hotplug_update_tasks() as seperate patch, suggested by Tejun. - make hotplug_update_tasks_insane() use @new_cpus and @new_mems as hotplug_update_tasks_sane() does. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-09cpuset: refactor cpuset_hotplug_update_tasks()Li Zefan
We mix the handling for both default hierarchy and legacy hierarchy in the same function, and it's quite messy, so split into two functions. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-09cpuset: make cs->{cpus, mems}_allowed as user-configured masksLi Zefan
Now we've used effective cpumasks to enforce hierarchical manner, we can use cs->{cpus,mems}_allowed as configured masks. Configured masks can be changed by writing cpuset.cpus and cpuset.mems only. The new behaviors are: - They won't be changed by hotplug anymore. - They won't be limited by its parent's masks. This ia a behavior change, but won't take effect unless mount with sane_behavior. v2: - Add comments to explain the differences between configured masks and effective masks. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-09cpuset: apply cs->effective_{cpus,mems}Li Zefan
Now we can use cs->effective_{cpus,mems} as effective masks. It's used whenever: - we update tasks' cpus_allowed/mems_allowed, - we want to retrieve tasks_cs(tsk)'s cpus_allowed/mems_allowed. They actually replace effective_{cpu,node}mask_cpuset(). effective_mask == configured_mask & parent effective_mask except when the reault is empty, in which case it inherits parent effective_mask. The result equals the mask computed from effective_{cpu,node}mask_cpuset(). This won't affect the original legacy hierarchy, because in this case we make sure the effective masks are always the same with user-configured masks. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-09cpuset: initialize top_cpuset's configured masks at mountLi Zefan
We now have to support different behaviors for default hierachy and legacy hiearchy, top_cpuset's configured masks need to be initialized accordingly. Suppose we've offlined cpu1. On default hierarchy: # mount -t cgroup -o __DEVEL__sane_behavior xxx /cpuset # cat /cpuset/cpuset.cpus 0-15 On legacy hierarchy: # mount -t cgroup xxx /cpuset # cat /cpuset/cpuset.cpus 0,2-15 Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-09cpuset: use effective cpumask to build sched domainsLi Zefan
We're going to have separate user-configured masks and effective ones. Eventually configured masks can only be changed by writing cpuset.cpus and cpuset.mems, and they won't be restricted by parent cpuset. While effective masks reflect cpu/memory hotplug and hierachical restriction, and these are the real masks that apply to the tasks in the cpuset. We calculate effective mask this way: - top cpuset's effective_mask == online_mask, otherwise - cpuset's effective_mask == configured_mask & parent effective_mask, if the result is empty, it inherits parent effective mask. Those behavior changes are for default hierarchy only. For legacy hierarchy, effective_mask and configured_mask are the same, so we won't break old interfaces. We should partition sched domains according to effective_cpus, which is the real cpulist that takes effects on tasks in the cpuset. This won't introduce behavior change. v2: - Add a comment for the call of rebuild_sched_domains(), suggested by Tejun. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-09cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes emptyLi Zefan
We're going to have separate user-configured masks and effective ones. Eventually configured masks can only be changed by writing cpuset.cpus and cpuset.mems, and they won't be restricted by parent cpuset. While effective masks reflect cpu/memory hotplug and hierachical restriction, and these are the real masks that apply to the tasks in the cpuset. We calculate effective mask this way: - top cpuset's effective_mask == online_mask, otherwise - cpuset's effective_mask == configured_mask & parent effective_mask, if the result is empty, it inherits parent effective mask. Those behavior changes are for default hierarchy only. For legacy hierarchy, effective_mask and configured_mask are the same, so we won't break old interfaces. To make cs->effective_{cpus,mems} to be effective masks, we need to - update the effective masks at hotplug - update the effective masks at config change - take on ancestor's mask when the effective mask is empty The last item is done here. This won't introduce behavior change. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-09cpuset: update cs->effective_{cpus, mems} when config changesLi Zefan
We're going to have separate user-configured masks and effective ones. Eventually configured masks can only be changed by writing cpuset.cpus and cpuset.mems, and they won't be restricted by parent cpuset. While effective masks reflect cpu/memory hotplug and hierachical restriction, and these are the real masks that apply to the tasks in the cpuset. We calculate effective mask this way: - top cpuset's effective_mask == online_mask, otherwise - cpuset's effective_mask == configured_mask & parent effective_mask, if the result is empty, it inherits parent effective mask. Those behavior changes are for default hierarchy only. For legacy hierarchy, effective_mask and configured_mask are the same, so we won't break old interfaces. To make cs->effective_{cpus,mems} to be effective masks, we need to - update the effective masks at hotplug - update the effective masks at config change - take on ancestor's mask when the effective mask is empty The second item is done here. We don't need to treat root_cs specially in update_cpumasks_hier(). This won't introduce behavior change. v3: - add a WARN_ON() to check if effective masks are the same with configured masks on legacy hierarchy. - pass trialcs->cpus_allowed to update_cpumasks_hier() and add a comment for it. Similar change for update_nodemasks_hier(). Suggested by Tejun. v2: - revise the comment in update_{cpu,node}masks_hier(), suggested by Tejun. - fix to use @cp instead of @cs in these two functions. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-09cpuset: update cpuset->effective_{cpus,mems} at hotplugLi Zefan
We're going to have separate user-configured masks and effective ones. Eventually configured masks can only be changed by writing cpuset.cpus and cpuset.mems, and they won't be restricted by parent cpuset. While effective masks reflect cpu/memory hotplug and hierachical restriction, and these are the real masks that apply to the tasks in the cpuset. We calculate effective mask this way: - top cpuset's effective_mask == online_mask, otherwise - cpuset's effective_mask == configured_mask & parent effective_mask, if the result is empty, it inherits parent effective mask. Those behavior changes are for default hierarchy only. For legacy hierarchy, effective_mask and configured_mask are the same, so we won't break old interfaces. To make cs->effective_{cpus,mems} to be effective masks, we need to - update the effective masks at hotplug - update the effective masks at config change - take on ancestor's mask when the effective mask is empty The first item is done here. This won't introduce behavior change. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-09cpuset: add cs->effective_cpus and cs->effective_memsLi Zefan
We're going to have separate user-configured masks and effective ones. Eventually configured masks can only be changed by writing cpuset.cpus and cpuset.mems, and they won't be restricted by parent cpuset. While effective masks reflect cpu/memory hotplug and hierachical restriction, and these are the real masks that apply to the tasks in the cpuset. We calculate effective mask this way: - top cpuset's effective_mask == online_mask, otherwise - cpuset's effective_mask == configured_mask & parent effective_mask, if the result is empty, it inherits parent effective mask. Those behavior changes are for default hierarchy only. For legacy hierachy, effective_mask and configured_mask are the same, so we won't break old interfaces. This patch adds the effective masks to struct cpuset and initializes them. The effective masks of the top cpuset is the same with configured masks, and a child cpuset inherits its parent's effective masks. This won't introduce behavior change. v2: - s/real_{mems,cpus}_allowed/effective_{mems,cpus}, suggested by Tejun. - don't init effective masks in cpuset_css_online() if !cgroup_on_dfl. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-09cgroup: remove sane_behavior support on non-default hierarchiesTejun Heo
sane_behavior has been used as a development vehicle for the default unified hierarchy. Now that the default hierarchy is in place, the flag became redundant and confusing as its usage is allowed on all hierarchies. There are gonna be either the default hierarchy or legacy ones. Let's make that clear by removing sane_behavior support on non-default hierarchies. This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of cgroup_on_dfl() with sane_behavior specific part dropped. On the default and legacy hierarchies w/o sane_behavior, this shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz>
2014-07-01cpuset: break kernfs active protection in cpuset_write_resmask()Tejun Heo
Writing to either "cpuset.cpus" or "cpuset.mems" file flushes cpuset_hotplug_work so that cpu or memory hotunplug doesn't end up migrating tasks off a cpuset after new resources are added to it. As cpuset_hotplug_work calls into cgroup core via cgroup_transfer_tasks(), this flushing adds the dependency to cgroup core locking from cpuset_write_resmak(). This used to be okay because cgroup interface files were protected by a different mutex; however, 8353da1f91f1 ("cgroup: remove cgroup_tree_mutex") simplified the cgroup core locking and this dependency became a deadlock hazard - cgroup file removal performed under cgroup core lock tries to drain on-going file operation which is trying to flush cpuset_hotplug_work blocked on the same cgroup core lock. The locking simplification was done because kernfs added an a lot easier way to deal with circular dependencies involving kernfs active protection. Let's use the same strategy in cpuset and break active protection in cpuset_write_resmask(). While it isn't the prettiest, this is a very rare, likely unique, situation which also goes away on the unified hierarchy. The commands to trigger the deadlock warning without the patch and the lockdep output follow. localhost:/ # mount -t cgroup -o cpuset xxx /cpuset localhost:/ # mkdir /cpuset/tmp localhost:/ # echo 1 > /cpuset/tmp/cpuset.cpus localhost:/ # echo 0 > cpuset/tmp/cpuset.mems localhost:/ # echo $$ > /cpuset/tmp/tasks localhost:/ # echo 0 > /sys/devices/system/cpu/cpu1/online ====================================================== [ INFO: possible circular locking dependency detected ] 3.16.0-rc1-0.1-default+ #7 Not tainted ------------------------------------------------------- kworker/1:0/32649 is trying to acquire lock: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150 but task is already holding lock: (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (cpuset_hotplug_work){+.+...}: ... -> #1 (s_active#175){++++.+}: ... -> #0 (cgroup_mutex){+.+.+.}: ... other info that might help us debug this: Chain exists of: cgroup_mutex --> s_active#175 --> cpuset_hotplug_work Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpuset_hotplug_work); lock(s_active#175); lock(cpuset_hotplug_work); lock(cgroup_mutex); *** DEADLOCK *** 2 locks held by kworker/1:0/32649: #0: ("events"){.+.+.+}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 #1: (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 stack backtrace: CPU: 1 PID: 32649 Comm: kworker/1:0 Not tainted 3.16.0-rc1-0.1-default+ #7 ... Call Trace: [<ffffffff815a5f78>] dump_stack+0x72/0x8a [<ffffffff810c263f>] print_circular_bug+0x10f/0x120 [<ffffffff810c481e>] check_prev_add+0x43e/0x4b0 [<ffffffff810c4ee6>] validate_chain+0x656/0x7c0 [<ffffffff810c53d2>] __lock_acquire+0x382/0x660 [<ffffffff810c57a9>] lock_acquire+0xf9/0x170 [<ffffffff815aa13f>] mutex_lock_nested+0x6f/0x380 [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150 [<ffffffff811129c0>] hotplug_update_tasks_insane+0x110/0x1d0 [<ffffffff81112bbd>] cpuset_hotplug_update_tasks+0x13d/0x180 [<ffffffff811148ec>] cpuset_hotplug_workfn+0x18c/0x630 [<ffffffff810854d4>] process_one_work+0x254/0x520 [<ffffffff810875dd>] worker_thread+0x13d/0x3d0 [<ffffffff8108e0c8>] kthread+0xf8/0x100 [<ffffffff815acaec>] ret_from_fork+0x7c/0xb0 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Li Zefan <lizefan@huawei.com> Tested-by: Li Zefan <lizefan@huawei.com>
2014-06-25cpuset,mempolicy: fix sleeping function called from invalid contextGu Zheng
When runing with the kernel(3.15-rc7+), the follow bug occurs: [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586 [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python [ 9969.441175] INFO: lockdep is turned off. [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85 [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012 [ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18 [ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c [ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000 [ 9969.974071] Call Trace: [ 9970.003403] [<ffffffff8162f523>] dump_stack+0x4d/0x66 [ 9970.065074] [<ffffffff8109995a>] __might_sleep+0xfa/0x130 [ 9970.130743] [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0 [ 9970.200638] [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210 [ 9970.272610] [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140 [ 9970.344584] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150 [ 9970.409282] [<ffffffff811b1385>] __mpol_dup+0xe5/0x150 [ 9970.471897] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150 [ 9970.536585] [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40 [ 9970.613763] [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10 [ 9970.683660] [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50 [ 9970.759795] [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40 [ 9970.834885] [<ffffffff8106a598>] do_fork+0xd8/0x380 [ 9970.894375] [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0 [ 9970.969470] [<ffffffff8106a8c6>] SyS_clone+0x16/0x20 [ 9971.030011] [<ffffffff81642009>] stub_clone+0x69/0x90 [ 9971.091573] [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b The cause is that cpuset_mems_allowed() try to take mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock protection region to protect the access to cpuset only in current_cpuset_is_being_rebound(). So that we can avoid this bug. This patch is a temporary solution that just addresses the bug mentioned above, can not fix the long-standing issue about cpuset.mems rebinding on fork(): "When the forker's task_struct is duplicated (which includes ->mems_allowed) and it races with an update to cpuset_being_rebound in update_tasks_nodemask() then the task's mems_allowed doesn't get updated. And the child task's mems_allowed can be wrong if the cpuset's nodemask changes before the child has been added to the cgroup's tasklist." Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable <stable@vger.kernel.org>
2014-06-09Merge branch 'for-3.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on cgroup side. Heavy restructuring including locking simplification took place to improve the code base and enable implementation of the unified hierarchy, which currently exists behind a __DEVEL__ mount option. The core support is mostly complete but individual controllers need further work. To explain the design and rationales of the the unified hierarchy Documentation/cgroups/unified-hierarchy.txt is added. Another notable change is css (cgroup_subsys_state - what each controller uses to identify and interact with a cgroup) iteration update. This is part of continuing updates on css object lifetime and visibility. cgroup started with reference count draining on removal way back and is now reaching a point where csses behave and are iterated like normal refcnted objects albeit with some complexities to allow distinguishing the state where they're being deleted. The css iteration update isn't taken advantage of yet but is planned to be used to simplify memcg significantly" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits) cgroup: disallow disabled controllers on the default hierarchy cgroup: don't destroy the default root cgroup: disallow debug controller on the default hierarchy cgroup: clean up MAINTAINERS entries cgroup: implement css_tryget() device_cgroup: use css_has_online_children() instead of has_children() cgroup: convert cgroup_has_live_children() into css_has_online_children() cgroup: use CSS_ONLINE instead of CGRP_DEAD cgroup: iterate cgroup_subsys_states directly cgroup: introduce CSS_RELEASED and reduce css iteration fallback window cgroup: move cgroup->serial_nr into cgroup_subsys_state cgroup: link all cgroup_subsys_states in their sibling lists cgroup: move cgroup->sibling and ->children into cgroup_subsys_state cgroup: remove cgroup->parent device_cgroup: remove direct access to cgroup->children memcg: update memcg_has_children() to use css_next_child() memcg: remove tasks/children test from mem_cgroup_force_empty() cgroup: remove css_parent() cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css cgroup: use cgroup->self.refcnt for cgroup refcnting ...