summaryrefslogtreecommitdiff
path: root/kernel/sys.c
AgeCommit message (Collapse)Author
2012-03-23prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervisionLennart Poettering
Userspace service managers/supervisors need to track their started services. Many services daemonize by double-forking and get implicitly re-parented to PID 1. The service manager will no longer be able to receive the SIGCHLD signals for them, and is no longer in charge of reaping the children with wait(). All information about the children is lost at the moment PID 1 cleans up the re-parented processes. With this prctl, a service manager process can mark itself as a sort of 'sub-init', able to stay as the parent for all orphaned processes created by the started services. All SIGCHLD signals will be delivered to the service manager. Receiving SIGCHLD and doing wait() is in cases of a service-manager much preferred over any possible asynchronous notification about specific PIDs, because the service manager has full access to the child process data in /proc and the PID can not be re-used until the wait(), the service-manager itself is in charge of, has happened. As a side effect, the relevant parent PID information does not get lost by a double-fork, which results in a more elaborate process tree and 'ps' output: before: # ps afx 253 ? Ss 0:00 /bin/dbus-daemon --system --nofork 294 ? Sl 0:00 /usr/libexec/polkit-1/polkitd 328 ? S 0:00 /usr/sbin/modem-manager 608 ? Sl 0:00 /usr/libexec/colord 658 ? Sl 0:00 /usr/libexec/upowerd 819 ? Sl 0:00 /usr/libexec/imsettings-daemon 916 ? Sl 0:00 /usr/libexec/udisks-daemon 917 ? S 0:00 \_ udisks-daemon: not polling any devices after: # ps afx 294 ? Ss 0:00 /bin/dbus-daemon --system --nofork 426 ? Sl 0:00 \_ /usr/libexec/polkit-1/polkitd 449 ? S 0:00 \_ /usr/sbin/modem-manager 635 ? Sl 0:00 \_ /usr/libexec/colord 705 ? Sl 0:00 \_ /usr/libexec/upowerd 959 ? Sl 0:00 \_ /usr/libexec/udisks-daemon 960 ? S 0:00 | \_ udisks-daemon: not polling any devices 977 ? Sl 0:00 \_ /usr/libexec/packagekitd This prctl is orthogonal to PID namespaces. PID namespaces are isolated from each other, while a service management process usually requires the services to live in the same namespace, to be able to talk to each other. Users of this will be the systemd per-user instance, which provides init-like functionality for the user's login session and D-Bus, which activates bus services on-demand. Both need init-like capabilities to be able to properly keep track of the services they start. Many thanks to Oleg for several rounds of review and insights. [akpm@linux-foundation.org: fix comment layout and spelling] [akpm@linux-foundation.org: add lengthy code comment from Oleg] Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Lennart Poettering <lennart@poettering.net> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Acked-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-16prctl: use CAP_SYS_RESOURCE for PR_SET_MM optionCyrill Gorcunov
CAP_SYS_ADMIN is already overloaded left and right, so to have more fine-grained access control use CAP_SYS_RESOURCE here. The CAP_SYS_RESOUCE is chosen because this prctl option allows a current process to adjust some fields of memory map descriptor which rather represents what the process owns: pointers to code, data, stack segments, command line, auxiliary vector data and etc. Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Bolle <pebolle@tiscali.nl> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13c/r: prctl: add PR_SET_MM codes to set up mm_struct entriesCyrill Gorcunov
When we restore a task we need to set up text, data and data heap sizes from userspace to the values a task had at checkpoint time. This patch adds auxilary prctl codes for that. While most of them have a statistical nature (their values are involved into calculation of /proc/<pid>/statm output) the start_brk and brk values are used to compute an allowed size of program data segment expansion. Which means an arbitrary changes of this values might be dangerous operation. So to restrict access the following requirements applied to prctl calls: - The process has to have CAP_SYS_ADMIN capability granted. - For all opcodes except start_brk/brk members an appropriate VMA area must exist and should fit certain VMA flags, such as: - code segment must be executable but not writable; - data segment must not be executable. start_brk/brk values must not intersect with data segment and must not exceed RLIMIT_DATA resource limit. Still the main guard is CAP_SYS_ADMIN capability check. Note the kernel should be compiled with CONFIG_CHECKPOINT_RESTORE support otherwise these prctl calls will return -EINVAL. [akpm@linux-foundation.org: cache current->mm in a local, saving 200 bytes text] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-15[S390] cputime: add sparse checking and cleanupMartin Schwidefsky
Make cputime_t and cputime64_t nocast to enable sparse checking to detect incorrect use of cputime. Drop the cputime macros for simple scalar operations. The conversion macros are still needed. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-11-07Merge branch 'modsplit-Oct31_2011' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
2011-11-02sysctl: add support for poll()Lucas De Marchi
Adding support for poll() in sysctl fs allows userspace to receive notifications of changes in sysctl entries. This adds a infrastructure to allow files in sysctl fs to be pollable and implements it for hostname and domainname. [akpm@linux-foundation.org: s/declare/define/ for definitions] Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Greg KH <gregkh@suse.de> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31kernel: fix several implicit usasges of kmod.hPaul Gortmaker
These files were implicitly relying on <linux/kmod.h> coming in via module.h, as without it we get things like: kernel/power/suspend.c:100: error: implicit declaration of function ‘usermodehelper_disable’ kernel/power/suspend.c:109: error: implicit declaration of function ‘usermodehelper_enable’ kernel/power/user.c:254: error: implicit declaration of function ‘usermodehelper_disable’ kernel/power/user.c:261: error: implicit declaration of function ‘usermodehelper_enable’ kernel/sys.c:317: error: implicit declaration of function ‘usermodehelper_disable’ kernel/sys.c:1816: error: implicit declaration of function ‘call_usermodehelper_setup’ kernel/sys.c:1822: error: implicit declaration of function ‘call_usermodehelper_setfns’ kernel/sys.c:1824: error: implicit declaration of function ‘call_usermodehelper_exec’ Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31kernel: Map most files to use export.h instead of module.hPaul Gortmaker
The changed files were only including linux/module.h for the EXPORT_SYMBOL infrastructure, and nothing else. Revector them onto the isolated export header for faster compile times. Nothing to see here but a whole lot of instances of: -#include <linux/module.h> +#include <linux/export.h> This commit is only changing the kernel dir; next targets will probably be mm, fs, the arch dirs, etc. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-24Merge branch 'master' of ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2011-10-17Avoid using variable-length arrays in kernel/sys.cLinus Torvalds
The size is always valid, but variable-length arrays generate worse code for no good reason (unless the function happens to be inlined and the compiler sees the length for the simple constant it is). Also, there seems to be some code generation problem on POWER, where Henrik Bakken reports that register r28 can get corrupted under some subtle circumstances (interrupt happening at the wrong time?). That all indicates some seriously broken compiler issues, but since variable length arrays are bad regardless, there's little point in trying to chase it down. "Just don't do that, then". Reported-by: Henrik Grindal Bakken <henribak@cisco.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-28connector: add comm change event report to proc connectorVladimir Zapolskiy
Add an event to monitor comm value changes of tasks. Such an event becomes vital, if someone desires to control threads of a process in different manner. A natural characteristic of threads is its comm value, and helpfully application developers have an opportunity to change it in runtime. Reporting about such events via proc connector allows to fine-grain monitoring and control potentials, for instance a process control daemon listening to proc connector and following comm value policies can place specific threads to assigned cgroup partitions. It might be possible to achieve a pale partial one-shot likeness without this update, if an application changes comm value of a thread generator task beforehand, then a new thread is cloned, and after that proc connector listener gets the fork event and reads new thread's comm value from procfs stat file, but this change visibly simplifies and extends the matter. Signed-off-by: Vladimir Zapolskiy <vzapolskiy@gmail.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-25Add a personality to report 2.6.x version numbersAndi Kleen
I ran into a couple of programs which broke with the new Linux 3.0 version. Some of those were binary only. I tried to use LD_PRELOAD to work around it, but it was quite difficult and in one case impossible because of a mix of 32bit and 64bit executables. For example, all kind of management software from HP doesnt work, unless we pretend to run a 2.6 kernel. $ uname -a Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux $ hpacucli ctrl all show Error: No controllers detected. $ rpm -qf /usr/sbin/hpacucli hpacucli-8.75-12.0 Another notable case is that Python now reports "linux3" from sys.platform(); which in turn can break things that were checking sys.platform() == "linux2": https://bugzilla.mozilla.org/show_bug.cgi?id=664564 It seems pretty clear to me though it's a bug in the apps that are using '==' instead of .startswith(), but this allows us to unbreak broken programs. This patch adds a UNAME26 personality that makes the kernel report a 2.6.40+x version number instead. The x is the x in 3.x. I know this is somewhat ugly, but I didn't find a better workaround, and compatibility to existing programs is important. Some programs also read /proc/sys/kernel/osrelease. This can be worked around in user space with mount --bind (and a mount namespace) To use: wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c gcc -o uname26 uname26.c ./uname26 program Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-11move RLIMIT_NPROC check from set_user() to do_execve_common()Vasiliy Kulikov
The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC check in set_user() to check for NPROC exceeding via setuid() and similar functions. Before the check there was a possibility to greatly exceed the allowed number of processes by an unprivileged user if the program relied on rlimit only. But the check created new security threat: many poorly written programs simply don't check setuid() return code and believe it cannot fail if executed with root privileges. So, the check is removed in this patch because of too often privilege escalations related to buggy programs. The NPROC can still be enforced in the common code flow of daemons spawning user processes. Most of daemons do fork()+setuid()+execve(). The check introduced in execve() (1) enforces the same limit as in setuid() and (2) doesn't create similar security issues. Neil Brown suggested to track what specific process has exceeded the limit by setting PF_NPROC_EXCEEDED process flag. With the change only this process would fail on execve(), and other processes' execve() behaviour is not changed. Solar Designer suggested to re-check whether NPROC limit is still exceeded at the moment of execve(). If the process was sleeping for days between set*uid() and execve(), and the NPROC counter step down under the limit, the defered execve() failure because NPROC limit was exceeded days ago would be unexpected. If the limit is not exceeded anymore, we clear the flag on successful calls to execve() and fork(). The flag is also cleared on successful calls to set_user() as the limit was exceeded for the previous user, not the current one. Similar check was introduced in -ow patches (without the process flag). v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user(). Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26notifiers: sys: move reboot notifiers into reboot.hAmerigo Wang
It is not necessary to share the same notifier.h. This patch already moves register_reboot_notifier() and unregister_reboot_notifier() from kernel/notifier.c to kernel/sys.c. [amwang@redhat.com: make allyesconfig succeed on ppc64] Signed-off-by: WANG Cong <amwang@redhat.com> Cc: David Miller <davem@davemloft.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-20Merge branch 'driver-core-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (44 commits) debugfs: Silence DEBUG_STRICT_USER_COPY_CHECKS=y warning sysfs: remove "last sysfs file:" line from the oops messages drivers/base/memory.c: fix warning due to "memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION" memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION SYSFS: Fix erroneous comments for sysfs_update_group(). driver core: remove the driver-model structures from the documentation driver core: Add the device driver-model structures to kerneldoc Translated Documentation/email-clients.txt RAW driver: Remove call to kobject_put(). reboot: disable usermodehelper to prevent fs access efivars: prevent oops on unload when efi is not enabled Allow setting of number of raw devices as a module parameter Introduce CONFIG_GOOGLE_FIRMWARE driver: Google Memory Console driver: Google EFI SMI x86: Better comments for get_bios_ebda() x86: get_bios_ebda_length() misc: fix ti-st build issues params.c: Use new strtobool function to process boolean inputs debugfs: move to new strtobool ... Fix up trivial conflicts in fs/debugfs/file.c due to the same patch being applied twice, and an unrelated cleanup nearby.
2011-05-11PM: Remove sysdev suspend, resume and shutdown operationsRafael J. Wysocki
Since suspend, resume and shutdown operations in struct sysdev_class and struct sysdev_driver are not used any more, remove them. Also drop sysdev_suspend(), sysdev_resume() and sysdev_shutdown() used for executing those operations and modify all of their users accordingly. This reduces kernel code size quite a bit and reduces its complexity. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-05-07reboot: disable usermodehelper to prevent fs accessKay Sievers
In case CONFIG_UEVENT_HELPER_PATH is not set to "", which it should be on every system, the kernel forks processes during shutdown, which try to access the rootfs, even when the binary does not exist. It causes exceptions and long delays in the disk driver, which gets read requests at the time it tries to shut down the disk. This patch disables all kernel-forked processes during reboot to allow a clean poweroff. Cc: Tejun Heo <htejun@gmail.com> Tested-By: Anton Guda <atu@dmeti.dp.ua> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-24userns: user namespaces: convert all capable checks in kernel/sys.cSerge E. Hallyn
This allows setuid/setgid in containers. It also fixes some corner cases where kernel logic foregoes capability checks when uids are equivalent. The latter will need to be done throughout the whole kernel. Changelog: Jan 11: Use nsown_capable() as suggested by Bastian Blank. Jan 11: Fix logic errors in uid checks pointed out by Bastian. Feb 15: allow prlimit to current (was regression in previous version) Feb 23: remove debugging printks, uninline set_one_prio_perm and make it bool, and document its return value. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-24userns: allow sethostname in a containerSerge E. Hallyn
Changelog: Feb 23: let clone_uts_ns() handle setting uts->user_ns To do so we need to pass in the task_struct who'll get the utsname, so we can get its user_ns. Feb 23: As per Oleg's coment, just pass in tsk, instead of two of its members. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-14PM / Core: Introduce struct syscore_ops for core subsystems PMRafael J. Wysocki
Some subsystems need to carry out suspend/resume and shutdown operations with one CPU on-line and interrupts disabled. The only way to register such operations is to define a sysdev class and a sysdev specifically for this purpose which is cumbersome and inefficient. Moreover, the arguments taken by sysdev suspend, resume and shutdown callbacks are practically never necessary. For this reason, introduce a simpler interface allowing subsystems to register operations to be executed very late during system suspend and shutdown and very early during resume in the form of strcut syscore_ops objects. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-01-31Fix prlimit64 for suid/sgid processesKacper Kornet
Since check_prlimit_permission always fails in the case of SUID/GUID processes, such processes are not able to read or set their own limits. This commit changes this by assuming that process can always read/change its own limits. Signed-off-by: Kacper Kornet <kornet@camk.edu.pl> Acked-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13kmsg_dump: add kmsg_dump() calls to the reboot, halt, poweroff and ↵Seiji Aguchi
emergency_restart paths We need to know the reason why system rebooted in support service. However, we can't inform our customers of the reason because final messages are lost on current Linux kernel. This patch improves the situation above because the final messages are saved by adding kmsg_dump() to reboot, halt, poweroff and emergency_restart path. Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Marco Stornelli <marco.stornelli@gmail.com> Reviewed-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-30sched: Add 'autogroup' scheduling feature: automated per session task groupsMike Galbraith
A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. Currently, only per session autogroups are implemented, but the patch leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls setsid(), a new task group is created, the process is moved into the new task group, and a reference to the preveious task group is dropped. Child processes inherit this task group thereafter, and increase it's refcount. When the last thread of a process exits, the process's reference is dropped, such that when the last process referencing an autogroup exits, the autogroup is destroyed. At runqueue selection time, IFF a task has no cgroup assignment, its current autogroup is used. Autogroup bandwidth is controllable via setting it's nice level through the proc filesystem: cat /proc/<pid>/autogroup Displays the task's group and the group's nice level. echo <nice level> > /proc/<pid>/autogroup Sets the task group's shares to the weight of nice <level> task. Setting nice level is rate limited for !admin users due to the abuse risk of task group locking. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via the boot option noautogroup, and can also be turned on/off on the fly via: echo [01] > /proc/sys/kernel/sched_autogroup_enabled ... which will automatically move tasks to/from the root task group. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Paul Turner <pjt@google.com> Cc: Oleg Nesterov <oleg@redhat.com> [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-01pid: make setpgid() system call use RCU read-side critical sectionPaul E. McKenney
[ 23.584719] [ 23.584720] =================================================== [ 23.585059] [ INFO: suspicious rcu_dereference_check() usage. ] [ 23.585176] --------------------------------------------------- [ 23.585176] kernel/pid.c:419 invoked rcu_dereference_check() without protection! [ 23.585176] [ 23.585176] other info that might help us debug this: [ 23.585176] [ 23.585176] [ 23.585176] rcu_scheduler_active = 1, debug_locks = 1 [ 23.585176] 1 lock held by rc.sysinit/728: [ 23.585176] #0: (tasklist_lock){.+.+..}, at: [<ffffffff8104771f>] sys_setpgid+0x5f/0x193 [ 23.585176] [ 23.585176] stack backtrace: [ 23.585176] Pid: 728, comm: rc.sysinit Not tainted 2.6.36-rc2 #2 [ 23.585176] Call Trace: [ 23.585176] [<ffffffff8105b436>] lockdep_rcu_dereference+0x99/0xa2 [ 23.585176] [<ffffffff8104c324>] find_task_by_pid_ns+0x50/0x6a [ 23.585176] [<ffffffff8104c35b>] find_task_by_vpid+0x1d/0x1f [ 23.585176] [<ffffffff81047727>] sys_setpgid+0x67/0x193 [ 23.585176] [<ffffffff810029eb>] system_call_fastpath+0x16/0x1b [ 24.959669] type=1400 audit(1282938522.956:4): avc: denied { module_request } for pid=766 comm="hwclock" kmod="char-major-10-135" scontext=system_u:system_r:hwclock_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclas It turns out that the setpgid() system call fails to enter an RCU read-side critical section before doing a PID-to-task_struct translation. This commit therefore does rcu_read_lock() before the translation, and also does rcu_read_unlock() after the last use of the returned pointer. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: David Howells <dhowells@redhat.com>
2010-07-16rlimits: implement prlimit64 syscallJiri Slaby
This patch adds the code to support the sys_prlimit64 syscall which modifies-and-returns the rlim values of a selected process atomically. The first parameter, pid, being 0 means current process. Unlike the current implementation, it is a generic interface, architecture indepentent so that we needn't handle compat stuff anymore. In the future, after glibc start to use this we can deprecate sys_setrlimit and sys_getrlimit in favor to clean up the code finally. It also adds a possibility of changing limits of other processes. We check the user's permissions to do that and if it succeeds, the new limits are propagated online. This is good for large scale applications such as SAP or databases where administrators need to change limits time by time (e.g. on crashes increase core size). And it is unacceptable to restart the service. For safety, all rlim users now either use accessors or doesn't need them due to - locking - the fact a process was just forked and nobody else knows about it yet (and nobody can't thus read/write limits) hence it is safe to modify limits now. The limitation is that we currently stay at ulong internal representation. So the rlim64_is_infinity check is used where value is compared against ULONG_MAX on 32-bit which is the maximum value there. And since internally the limits are held in struct rlimit, converters which are used before and after do_prlimit call in sys_prlimit64 are introduced. Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2010-07-16rlimits: switch more rlimit syscalls to do_prlimitJiri Slaby
After we added more generic do_prlimit, switch sys_getrlimit to that. Also switch compat handling, so we can get rid of ugly __user casts and avoid setting process' address limit to kernel data and back. Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2010-07-16rlimits: redo do_setrlimit to more generic do_prlimitJiri Slaby
It now allows also reading of limits. I.e. all read and writes will later use this function. It takes two parameters, new and old limits which can be both NULL. If new is non-NULL, the value in it is set to rlimits. If old is non-NULL, current rlimits are stored there. If both are non-NULL, old are stored prior to setting the new ones, atomically. (Similar to sigaction.) Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2010-07-16rlimits: do security check under task_lockJiri Slaby
Do security_task_setrlimit under task_lock. Other tasks may change limits under our hands while we are checking limits inside the function. From now on, they can't. Note that all the security work is done under a spinlock here now. Security hooks count with that, they are called from interrupt context (like security_task_kill) and with spinlocks already held (e.g. capable->security_capable). Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: James Morris <jmorris@namei.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
2010-07-16rlimits: allow setrlimit to non-current tasksJiri Slaby
Add locking to allow setrlimit accept task parameter other than current. Namely, lock tasklist_lock for read and check whether the task structure has sighand non-null. Do all the signal processing under that lock still held. There are some points: 1) security_task_setrlimit is now called with that lock held. This is not new, many security_* functions are called with this lock held already so it doesn't harm (all this security_* stuff does almost the same). 2) task->sighand->siglock (in update_rlimit_cpu) is nested in tasklist_lock. This dependence is already existing. 3) tsk->alloc_lock is nested in tasklist_lock. This is OK too, already existing dependence. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com>
2010-07-16rlimits: split sys_setrlimitJiri Slaby
Create do_setrlimit from sys_setrlimit and declare do_setrlimit in the resource header. This is the first phase to have generic do_prlimit which allows to be called from read, write and compat rlimits code. The new do_setrlimit also accepts a task pointer to change the limits of. Currently, it cannot be other than current, but this will change with locking later. Also pass tsk->group_leader to security_task_setrlimit to check whether current is allowed to change rlimits of the process and not its arbitrary thread because it makes more sense given that rlimit are per process and not per-thread. Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2010-07-16rlimits: make sure ->rlim_max never grows in sys_setrlimitOleg Nesterov
Mostly preparation for Jiri's changes, but probably makes sense anyway. sys_setrlimit() checks new_rlim.rlim_max <= old_rlim->rlim_max, but when it takes task_lock() old_rlim->rlim_max can be already lowered. Move this check under task_lock(). Currently this is not important, we can only race with our sub-thread, this means the application is stupid. But when we change the code to allow the update of !current task's limits, it becomes important to make sure ->rlim_max can be lowered "reliably" even if we race with the application doing sys_setrlimit(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2010-07-16rlimits: add task_struct to update_rlimit_cpuJiri Slaby
Add task_struct as a parameter to update_rlimit_cpu to be able to set rlimit_cpu of different task than current. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: James Morris <jmorris@namei.org>
2010-07-16rlimits: security, add task_struct to setrlimitJiri Slaby
Add task_struct to task_setrlimit of security_operations to be able to set rlimit of task other than current. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Eric Paris <eparis@redhat.com> Acked-by: James Morris <jmorris@namei.org>
2010-05-27kmod: add init function to usermodehelperNeil Horman
About 6 months ago, I made a set of changes to how the core-dump-to-a-pipe feature in the kernel works. We had reports of several races, including some reports of apps bypassing our recursion check so that a process that was forked as part of a core_pattern setup could infinitely crash and refork until the system crashed. We fixed those by improving our recursion checks. The new check basically refuses to fork a process if its core limit is zero, which works well. Unfortunately, I've been getting grief from maintainer of user space programs that are inserted as the forked process of core_pattern. They contend that in order for their programs (such as abrt and apport) to work, all the running processes in a system must have their core limits set to a non-zero value, to which I say 'yes'. I did this by design, and think thats the right way to do things. But I've been asked to ease this burden on user space enough times that I thought I would take a look at it. The first suggestion was to make the recursion check fail on a non-zero 'special' number, like one. That way the core collector process could set its core size ulimit to 1, and enable the kernel's recursion detection. This isn't a bad idea on the surface, but I don't like it since its opt-in, in that if a program like abrt or apport has a bug and fails to set such a core limit, we're left with a recursively crashing system again. So I've come up with this. What I've done is modify the call_usermodehelper api such that an extra parameter is added, a function pointer which will be called by the user helper task, after it forks, but before it exec's the required process. This will give the caller the opportunity to get a call back in the processes context, allowing it to do whatever it needs to to the process in the kernel prior to exec-ing the user space code. In the case of do_coredump, this callback is ues to set the core ulimit of the helper process to 1. This elimnates the opt-in problem that I had above, as it allows the ulimit for core sizes to be set to the value of 1, which is what the recursion check looks for in do_coredump. This patch: Create new function call_usermodehelper_fns() and allow it to assign both an init and cleanup function, as we'll as arbitrary data. The init function is called from the context of the forked process and allows for customization of the helper process prior to calling exec. Its return code gates the continuation of the process, or causes its exit. Also add an arbitrary data pointer to the subprocess_info struct allowing for data to be passed from the caller to the new process, and the subsequent cleanup process Also, use this patch to cleanup the cleanup function. It currently takes an argp and envp pointer for freeing, which is ugly. Lets instead just make the subprocess_info structure public, and pass that to the cleanup and init routines Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-06Merge branch 'master' into nextJames Morris
2010-04-24kernel/sys.c: fix compat uname machineAndreas Schwab
On ppc64 you get this error: $ setarch ppc -R true setarch: ppc: Unrecognized architecture because uname still reports ppc64 as the machine. So mask off the personality flags when checking for PER_LINUX32. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-12security: remove dead hook task_setgidEric Paris
Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2010-04-12security: remove dead hook task_setuidEric Paris
Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-12Add generic sys_olduname()Christoph Hellwig
Add generic implementations of the old and really old uname system calls. Note that sh only implements sys_olduname but not sys_oldolduname, but I'm not going to bother with another ifdef for that special case. m32r implemented an old uname but never wired it up, so kill it, too. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12improve sys_newuname() for compat architecturesChristoph Hellwig
On an architecture that supports 32-bit compat we need to override the reported machine in uname with the 32-bit value. Instead of doing this separately in every architecture introduce a COMPAT_UTS_MACHINE define in <asm/compat.h> and apply it directly in sys_newuname(). Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06kernel core: use helpers for rlimitsJiri Slaby
Make sure compiler won't do weird things with limits. E.g. fetching them twice may return 2 different values after writable limits are implemented. I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource: add helpers for fetching rlimits") or ACCESS_ONCE if not applicable. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-28Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits) sched: Fix SCHED_MC regression caused by change in sched cpu_power sched: Don't use possibly stale sched_class kthread, sched: Remove reference to kthread_create_on_cpu sched: cpuacct: Use bigger percpu counter batch values for stats counters percpu_counter: Make __percpu_counter_add an inline function on UP sched: Remove member rt_se from struct rt_rq sched: Change usage of rt_rq->rt_se to rt_rq->tg->rt_se[cpu] sched: Remove unused update_shares_locked() sched: Use for_each_bit sched: Queue a deboosted task to the head of the RT prio queue sched: Implement head queueing for sched_rt sched: Extend enqueue_task to allow head queueing sched: Remove USER_SCHED sched: Fix the place where group powers are updated sched: Assume *balance is valid sched: Remove load_balance_newidle() sched: Unify load_balance{,_newidle}() sched: Add a lock break for PREEMPT=y sched: Remove from fwd decls sched: Remove rq_iterator from move_one_task ... Fix up trivial conflicts in kernel/sched.c
2010-02-23kernel/sys.c: fix missing rcu protection for sys_getpriority()Tetsuo Handa
find_task_by_vpid() is not safe without rcu_read_lock(). 2.6.33-rc7 got RCU protection for sys_setpriority() but missed it for sys_getpriority(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-21sched: Remove USER_SCHEDDhaval Giani
Remove the USER_SCHED feature. It has been scheduled to be removed in 2.6.34 as per http://marc.info/?l=linux-kernel&m=125728479022976&w=2 Signed-off-by: Dhaval Giani <dhaval.giani@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1263990378.24844.3.camel@localhost> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-19Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sys: Fix missing rcu protection for __task_cred() access signals: Fix more rcu assumptions signal: Fix racy access to __task_cred in kill_pid_info_as_uid()
2009-12-15kernel/sys.c: fix "warning: do-while statement is not a compound statement" ↵H Hartley Sweeten
noise do_each_thread/while_each_thread wrap a block of code that is in this format: for (...) do ... while If curly braces do not surround the inner loop the following warning is generated by sparse: warning: do-while statement is not a compound statement Fix the warning by adding the braces. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-10sys: Fix missing rcu protection for __task_cred() accessThomas Gleixner
commit c69e8d9 (CRED: Use RCU to access another task's creds and to release a task's own creds) added non rcu_read_lock() protected access to task creds of the target task in set_prio_one(). The comment above the function says: * - the caller must hold the RCU read lock The calling code in sys_setpriority does read_lock(&tasklist_lock) but not rcu_read_lock(). This works only when CONFIG_TREE_PREEMPT_RCU=n. With CONFIG_TREE_PREEMPT_RCU=y the rcu_callbacks can run in the tick interrupt when they see no read side critical section. There is another instance of __task_cred() in sys_setpriority() itself which is equally unprotected. Wrap the whole code section into a rcu read side critical section to fix this quick and dirty. Will be revisited in course of the read_lock(&tasklist_lock) -> rcu crusade. Oleg noted further: This also fixes another bug here. find_task_by_vpid() is not safe without rcu_read_lock(). I do not mean it is not safe to use the result, just find_pid_ns() by itself is not safe. Usually tasklist gives enough protection, but if copy_process() fails it calls free_pid() lockless and does call_rcu(delayed_put_pid(). This means, without rcu lock find_pid_ns() can't scan the hash table safely. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20091210004703.029784964@linutronix.de> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2009-12-09Merge branch 'bkl-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'bkl-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sys: Remove BKL from sys_reboot pm_qos: clean up racy global "name" variable pm_qos: remove BKL
2009-12-02sched, cputime: Introduce thread_group_times()Hidetoshi Seto
This is a real fix for problem of utime/stime values decreasing described in the thread: http://lkml.org/lkml/2009/11/3/522 Now cputime is accounted in the following way: - {u,s}time in task_struct are increased every time when the thread is interrupted by a tick (timer interrupt). - When a thread exits, its {u,s}time are added to signal->{u,s}time, after adjusted by task_times(). - When all threads in a thread_group exits, accumulated {u,s}time (and also c{u,s}time) in signal struct are added to c{u,s}time in signal struct of the group's parent. So {u,s}time in task struct are "raw" tick count, while {u,s}time and c{u,s}time in signal struct are "adjusted" values. And accounted values are used by: - task_times(), to get cputime of a thread: This function returns adjusted values that originates from raw {u,s}time and scaled by sum_exec_runtime that accounted by CFS. - thread_group_cputime(), to get cputime of a thread group: This function returns sum of all {u,s}time of living threads in the group, plus {u,s}time in the signal struct that is sum of adjusted cputimes of all exited threads belonged to the group. The problem is the return value of thread_group_cputime(), because it is mixed sum of "raw" value and "adjusted" value: group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time) This misbehavior can break {u,s}time monotonicity. Assume that if there is a thread that have raw values greater than adjusted values (e.g. interrupted by 1000Hz ticks 50 times but only runs 45ms) and if it exits, cputime will decrease (e.g. -5ms). To fix this, we could do: group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time) But task_times() contains hard divisions, so applying it for every thread should be avoided. This patch fixes the above problem in the following way: - Modify thread's exit (= __exit_signal()) not to use task_times(). It means {u,s}time in signal struct accumulates raw values instead of adjusted values. As the result it makes thread_group_cputime() to return pure sum of "raw" values. - Introduce a new function thread_group_times(*task, *utime, *stime) that converts "raw" values of thread_group_cputime() to "adjusted" values, in same calculation procedure as task_times(). - Modify group's exit (= wait_task_zombie()) to use this introduced thread_group_times(). It make c{u,s}time in signal struct to have adjusted values like before this patch. - Replace some thread_group_cputime() by thread_group_times(). This replacements are only applied where conveys the "adjusted" cputime to users, and where already uses task_times() near by it. (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.) This patch have a positive side effect: - Before this patch, if a group contains many short-life threads (e.g. runs 0.9ms and not interrupted by ticks), the group's cputime could be invisible since thread's cputime was accumulated after adjusted: imagine adjustment function as adj(ticks, runtime), {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0. After this patch it will not happen because the adjustment is applied after accumulated. v2: - remove if()s, put new variables into signal_struct. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Spencer Candland <spencer@bluehost.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> LKML-Reference: <4B162517.8040909@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>