summaryrefslogtreecommitdiff
path: root/kernel/sysctl.c
AgeCommit message (Collapse)Author
2011-03-24sysctl: restrict write access to dmesg_restrictRichard Weinberger
When dmesg_restrict is set to 1 CAP_SYS_ADMIN is needed to read the kernel ring buffer. But a root user without CAP_SYS_ADMIN is able to reset dmesg_restrict to 0. This is an issue when e.g. LXC (Linux Containers) are used and complete user space is running without CAP_SYS_ADMIN. A unprivileged and jailed root user can bypass the dmesg_restrict protection. With this patch writing to dmesg_restrict is only allowed when root has CAP_SYS_ADMIN. Signed-off-by: Richard Weinberger <richard@nod.at> Acked-by: Dan Rosenberg <drosenberg@vsecurity.com> Acked-by: Serge E. Hallyn <serge@hallyn.com> Cc: Eric Paris <eparis@redhat.com> Cc: Kees Cook <kees.cook@canonical.com> Cc: James Morris <jmorris@namei.org> Cc: Eugene Teo <eugeneteo@kernel.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-24sysctl: add some missing input constraint checksPetr Holasek
Add boundaries of allowed input ranges for: dirty_expire_centisecs, drop_caches, overcommit_memory, page-cluster and panic_on_oom. Signed-off-by: Petr Holasek <pholasek@redhat.com> Acked-by: Dave Young <hidave.darkstar@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-16Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits) AppArmor: kill unused macros in lsm.c AppArmor: cleanup generated files correctly KEYS: Add an iovec version of KEYCTL_INSTANTIATE KEYS: Add a new keyctl op to reject a key with a specified error code KEYS: Add a key type op to permit the key description to be vetted KEYS: Add an RCU payload dereference macro AppArmor: Cleanup make file to remove cruft and make it easier to read SELinux: implement the new sb_remount LSM hook LSM: Pass -o remount options to the LSM SELinux: Compute SID for the newly created socket SELinux: Socket retains creator role and MLS attribute SELinux: Auto-generate security_is_socket_class TOMOYO: Fix memory leak upon file open. Revert "selinux: simplify ioctl checking" selinux: drop unused packet flow permissions selinux: Fix packet forwarding checks on postrouting selinux: Fix wrong checks for selinux_policycap_netpeer selinux: Fix check for xfrm selinux context algorithm ima: remove unnecessary call to ima_must_measure IMA: remove IMA imbalance checking ...
2011-03-16Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (26 commits) sched: Resched proper CPU on yield_to() sched: Allow users with sufficient RLIMIT_NICE to change from SCHED_IDLE policy sched: Allow SCHED_BATCH to preempt SCHED_IDLE tasks sched: Clean up the IRQ_TIME_ACCOUNTING code sched: Add #ifdef around irq time accounting functions sched, autogroup: Stop claiming ownership of the root task group sched, autogroup: Stop going ahead if autogroup is disabled sched, autogroup, sysctl: Use proc_dointvec_minmax() instead sched: Fix the group_imb logic sched: Clean up some f_b_g() comments sched: Clean up remnants of sd_idle sched: Wholesale removal of sd_idle logic sched: Add yield_to(task, preempt) functionality sched: Use a buddy to implement yield_task_fair() sched: Limit the scope of clear_buddies sched: Check the right ->nr_running in yield_task_fair() sched: Avoid expensive initial update_cfs_load(), on UP too sched: Fix switch_from_fair() sched: Simplify the idle scheduling class softirqs: Account ksoftirqd time as cpustat softirq ...
2011-03-16Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (184 commits) perf probe: Clean up probe_point_lazy_walker() return value tracing: Fix irqoff selftest expanding max buffer tracing: Align 4 byte ints together in struct tracer tracing: Export trace_set_clr_event() tracing: Explain about unstable clock on resume with ring buffer warning ftrace/graph: Trace function entry before updating index ftrace: Add .ref.text as one of the safe areas to trace tracing: Adjust conditional expression latency formatting. tracing: Fix event alignment: skb:kfree_skb tracing: Fix event alignment: mce:mce_record tracing: Fix event alignment: kvm:kvm_hv_hypercall tracing: Fix event alignment: module:module_request tracing: Fix event alignment: ftrace:context_switch and ftrace:wakeup tracing: Remove lock_depth from event entry perf header: Stop using 'self' perf session: Use evlist/evsel for managing perf.data attributes perf top: Don't let events to eat up whole header line perf top: Fix events overflow in top command ring-buffer: Remove unused #include <linux/trace_irq.h> tracing: Add an 'overwrite' trace_option. ...
2011-03-15Merge branch 'next' into for-linusJames Morris
2011-03-08unfuck proc_sysctl ->d_compare()Al Viro
a) struct inode is not going to be freed under ->d_compare(); however, the thing PROC_I(inode)->sysctl points to just might. Fortunately, it's enough to make freeing that sucker delayed, provided that we don't step on its ->unregistering, clear the pointer to it in PROC_I(inode) before dropping the reference and check if it's NULL in ->d_compare(). b) I'm not sure that we *can* walk into NULL inode here (we recheck dentry->seq between verifying that it's still hashed / fetching dentry->d_inode and passing it to ->d_compare() and there's no negative hashed dentries in /proc/sys/*), but if we can walk into that, we really should not have ->d_compare() return 0 on it! Said that, I really suspect that this check can be simply killed. Nick? Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-08Merge branch 'master' of git://git.infradead.org/users/eparis/selinux into nextJames Morris
2011-02-23sched, autogroup, sysctl: Use proc_dointvec_minmax() insteadYong Zhang
sched_autogroup_enabled has min/max value, proc_dointvec_minmax() is be used for this case. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1298185696-4403-2-git-send-email-yong.zhang0@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-16Merge commit 'v2.6.38-rc5' into sched/coreIngo Molnar
Merge reason: Pick up upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-16perf: Optimize throttling codePeter Zijlstra
By pre-computing the maximum number of samples per tick we can avoid a multiplication and a conditional since MAX_INTERRUPTS > max_samples_per_tick. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-03sched: Use a buddy to implement yield_task_fair()Rik van Riel
Use the buddy mechanism to implement yield_task_fair. This allows us to skip onto the next highest priority se at every level in the CFS tree, unless doing so would introduce gross unfairness in CPU time distribution. We order the buddy selection in pick_next_entity to check yield first, then last, then next. We need next to be able to override yield, because it is possible for the "next" and "yield" task to be different processen in the same sub-tree of the CFS tree. When they are, we need to go into that sub-tree regardless of the "yield" hint, and pick the correct entity once we get to the right level. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-01security: remove unused security_sysctl hookLucian Adrian Grijincu
The only user for this hook was selinux. sysctl routes every call through /proc/sys/. Selinux and other security modules use the file system checks for sysctl too, so no need for this hook any more. Signed-off-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2011-01-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: wacom - pass touch resolution to clients through input_absinfo Input: wacom - add 2 Bamboo Pen and touch models Input: sysrq - ensure sysrq_enabled and __sysrq_enabled are consistent Input: sparse-keymap - fix KEY_VSW handling in sparse_keymap_setup Input: tegra-kbc - add tegra keyboard driver Input: gpio_keys - switch to using request_any_context_irq Input: serio - allow registered drivers to get status flag Input: ct82710c - return proper error code for ct82c710_open Input: bu21013_ts - added regulator support Input: bu21013_ts - remove duplicate resolution parameters Input: tnetv107x-ts - don't treat NULL clk as an error Input: tnetv107x-keypad - don't treat NULL clk as an error Fix up trivial conflicts in drivers/input/keyboard/Makefile due to additions of tc3589x/Tegra drivers
2011-01-24Input: sysrq - ensure sysrq_enabled and __sysrq_enabled are consistentAndy Whitcroft
Currently sysrq_enabled and __sysrq_enabled are initialised separately and inconsistently, leading to sysrq being actually enabled by reported as not enabled in sysfs. The first change to the sysfs configurable synchronises these two: static int __read_mostly sysrq_enabled = 1; static int __sysrq_enabled; Add a common define to carry the default for these preventing them becoming out of sync again. Default this to 1 to mirror previous behaviour. Signed-off-by: Andy Whitcroft <apw@canonical.com> Cc: stable@kernel.org Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
2011-01-13sysctl: remove obsolete commentsJovi Zhang
ctl_unnumbered.txt have been removed in Documentation directory so just also remove this invalid comments [akpm@linux-foundation.org: fix Documentation/sysctl/00-INDEX, per Dave] Signed-off-by: Jovi Zhang <bookjovi@gmail.com> Cc: Dave Young <hidave.darkstar@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13sysctl: fix #ifdef guard commentJovi Zhang
Signed-off-by: Jovi Zhang <bookjovi@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13kptr_restrict for hiding kernel pointers from unprivileged usersDan Rosenberg
Add the %pK printk format specifier and the /proc/sys/kernel/kptr_restrict sysctl. The %pK format specifier is designed to hide exposed kernel pointers, specifically via /proc interfaces. Exposing these pointers provides an easy target for kernel write vulnerabilities, since they reveal the locations of writable structures containing easily triggerable function pointers. The behavior of %pK depends on the kptr_restrict sysctl. If kptr_restrict is set to 0, no deviation from the standard %p behavior occurs. If kptr_restrict is set to 1, the default, if the current user (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG (currently in the LSM tree), kernel pointers using %pK are printed as 0's. If kptr_restrict is set to 2, kernel pointers using %pK are printed as 0's regardless of privileges. Replacing with 0's was chosen over the default "(null)", which cannot be parsed by userland %p, which expects "(nil)". [akpm@linux-foundation.org: check for IRQ context when !kptr_restrict, save an indent level, s/WARN/WARN_ONCE/] [akpm@linux-foundation.org: coding-style fixup] [randy.dunlap@oracle.com: fix kernel/sysctl.c warning] Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: James Morris <jmorris@namei.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Thomas Graf <tgraf@infradead.org> Cc: Eugene Teo <eugeneteo@kernel.org> Cc: Kees Cook <kees.cook@canonical.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David S. Miller <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-06Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits) sched: Change wait_for_completion_*_timeout() to return a signed long sched, autogroup: Fix reference leak sched, autogroup: Fix potential access to freed memory sched: Remove redundant CONFIG_CGROUP_SCHED ifdef sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight sched: Move periodic share updates to entity_tick() printk: Use this_cpu_{read|write} api on printk_pending sched: Make pushable_tasks CONFIG_SMP dependant sched: Add 'autogroup' scheduling feature: automated per session task groups sched: Fix unregister_fair_sched_group() sched: Remove unused argument dest_cpu to migrate_task() mutexes, sched: Introduce arch_mutex_cpu_relax() sched: Add some clock info to sched_debug cpu: Remove incorrect BUG_ON cpu: Remove unused variable sched: Fix UP build breakage sched: Make task dump print all 15 chars of proc comm sched: Update tg->shares after cpu.shares write sched: Allow update_cfs_load() to update global load sched: Implement demand based update_cfs_load() ...
2010-12-09x86, NMI: Add back unknown_nmi_panic and nmi_watchdog sysctlsDon Zickus
Originally adapted from Huang Ying's patch which moved the unknown_nmi_panic to the traps.c file. Because the old nmi watchdog was deleted before this change happened, the unknown_nmi_panic sysctl was lost. This re-adds it. Also, the nmi_watchdog sysctl was re-implemented and its documentation updated accordingly. Patch-inspired-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: fweisbec@gmail.com LKML-Reference: <1291068437-5331-3-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30sched: Add 'autogroup' scheduling feature: automated per session task groupsMike Galbraith
A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. Currently, only per session autogroups are implemented, but the patch leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls setsid(), a new task group is created, the process is moved into the new task group, and a reference to the preveious task group is dropped. Child processes inherit this task group thereafter, and increase it's refcount. When the last thread of a process exits, the process's reference is dropped, such that when the last process referencing an autogroup exits, the autogroup is destroyed. At runqueue selection time, IFF a task has no cgroup assignment, its current autogroup is used. Autogroup bandwidth is controllable via setting it's nice level through the proc filesystem: cat /proc/<pid>/autogroup Displays the task's group and the group's nice level. echo <nice level> > /proc/<pid>/autogroup Sets the task group's shares to the weight of nice <level> task. Setting nice level is rate limited for !admin users due to the abuse risk of task group locking. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via the boot option noautogroup, and can also be turned on/off on the fly via: echo [01] > /proc/sys/kernel/sched_autogroup_enabled ... which will automatically move tasks to/from the root task group. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Paul Turner <pjt@google.com> Cc: Oleg Nesterov <oleg@redhat.com> [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26Merge commit 'v2.6.37-rc3' into sched/coreIngo Molnar
Merge reason: Pick up latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-26Merge commit 'v2.6.37-rc3' into perf/coreIngo Molnar
Merge reason: Pick up latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Add sysctl_sched_shares_windowPaul Turner
Introduce a new sysctl for the shares window and disambiguate it from sched_time_avg. A 10ms window appears to be a good compromise between accuracy and performance. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234938.112173964@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18sched: Rewrite tg_shares_up)Peter Zijlstra
By tracking a per-cpu load-avg for each cfs_rq and folding it into a global task_group load on each tick we can rework tg_shares_up to be strictly per-cpu. This should improve cpu-cgroup performance for smp systems significantly. [ Paul: changed to use queueing cfs_rq + bug fixes ] Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101115234937.580480400@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-18x86, nmi_watchdog: Remove the old nmi_watchdogDon Zickus
Now that we have a new nmi_watchdog that is more generic and sits on top of the perf subsystem, we really do not need the old nmi_watchdog any more. In addition, the old nmi_watchdog doesn't really work if you are using the default clocksource, hpet. The old nmi_watchdog code relied on local apic interrupts to determine if the cpu is still alive. With hpet as the clocksource, these interrupts don't increment any more and the old nmi_watchdog triggers false postives. This piece removes the old nmi_watchdog code and stubs out any variables and functions calls. The stubs are the same ones used by the new nmi_watchdog code, so it should be well tested. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: fweisbec@gmail.com Cc: gorcunov@openvz.org LKML-Reference: <1289578944-28564-2-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-16kernel/sysctl.c: Fix build failure with !CONFIG_PRINTKJoe Perches
Sigh... Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-12Restrict unprivileged access to kernel syslogDan Rosenberg
The kernel syslog contains debugging information that is often useful during exploitation of other vulnerabilities, such as kernel heap addresses. Rather than futilely attempt to sanitize hundreds (or thousands) of printk statements and simultaneously cripple useful debugging functionality, it is far simpler to create an option that prevents unprivileged users from reading the syslog. This patch, loosely based on grsecurity's GRKERNSEC_DMESG, creates the dmesg_restrict sysctl. When set to "0", the default, no restrictions are enforced. When set to "1", only users with CAP_SYS_ADMIN can read the kernel syslog via dmesg(8) or other mechanisms. [akpm@linux-foundation.org: explain the config option in kernel.txt] Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Eugene Teo <eugeneteo@kernel.org> Acked-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits) split invalidate_inodes() fs: skip I_FREEING inodes in writeback_sb_inodes fs: fold invalidate_list into invalidate_inodes fs: do not drop inode_lock in dispose_list fs: inode split IO and LRU lists fs: switch bdev inode bdi's correctly fs: fix buffer invalidation in invalidate_list fsnotify: use dget_parent smbfs: use dget_parent exportfs: use dget_parent fs: use RCU read side protection in d_validate fs: clean up dentry lru modification fs: split __shrink_dcache_sb fs: improve DCACHE_REFERENCED usage fs: use percpu counter for nr_dentry and nr_dentry_unused fs: simplify __d_free fs: take dcache_lock inside __d_path fs: do not assign default i_ino in new_inode fs: introduce a per-cpu last_ino allocator new helper: ihold() ...
2010-10-26printk: declare printk_ratelimit_state in ratelimit.hNamhyung Kim
Adding declaration of printk_ratelimit_state in ratelimit.h removes potential build breakage and following sparse warning: kernel/printk.c:1426:1: warning: symbol 'printk_ratelimit_state' was not declared. Should it be static? [akpm@linux-foundation.org: remove unneeded ifdef] Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26fs: allow for more than 2^31 filesEric Dumazet
Robin Holt tried to boot a 16TB system and found af_unix was overflowing a 32bit value : <quote> We were seeing a failure which prevented boot. The kernel was incapable of creating either a named pipe or unix domain socket. This comes down to a common kernel function called unix_create1() which does: atomic_inc(&unix_nr_socks); if (atomic_read(&unix_nr_socks) > 2 * get_max_files()) goto out; The function get_max_files() is a simple return of files_stat.max_files. files_stat.max_files is a signed integer and is computed in fs/file_table.c's files_init(). n = (mempages * (PAGE_SIZE / 1024)) / 10; files_stat.max_files = n; In our case, mempages (total_ram_pages) is approx 3,758,096,384 (0xe0000000). That leaves max_files at approximately 1,503,238,553. This causes 2 * get_max_files() to integer overflow. </quote> Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long integers, and change af_unix to use an atomic_long_t instead of atomic_t. get_max_files() is changed to return an unsigned long. get_nr_files() is changed to return a long. unix_nr_socks is changed from atomic_t to atomic_long_t, while not strictly needed to address Robin problem. Before patch (on a 64bit kernel) : # echo 2147483648 >/proc/sys/fs/file-max # cat /proc/sys/fs/file-max -18446744071562067968 After patch: # echo 2147483648 >/proc/sys/fs/file-max # cat /proc/sys/fs/file-max 2147483648 # cat /proc/sys/fs/file-nr 704 0 2147483648 Reported-by: Robin Holt <holt@sgi.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David Miller <davem@davemloft.net> Reviewed-by: Robin Holt <holt@sgi.com> Tested-by: Robin Holt <holt@sgi.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26fs: use percpu counter for nr_dentry and nr_dentry_unusedChristoph Hellwig
The nr_dentry stat is a globally touched cacheline and atomic operation twice over the lifetime of a dentry. It is used for the benfit of userspace only. Turn it into a per-cpu counter and always decrement it in d_free instead of doing various batching operations to reduce lock hold times in the callers. Based on an earlier patch from Nick Piggin <npiggin@suse.de>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-26fs: Convert nr_inodes and nr_unused to per-cpu countersDave Chinner
The number of inodes allocated does not need to be tied to the addition or removal of an inode to/from a list. If we are not tied to a list lock, we could update the counters when inodes are initialised or destroyed, but to do that we need to convert the counters to be per-cpu (i.e. independent of a lock). This means that we have the freedom to change the list/locking implementation without needing to care about the counters. Based on a patch originally from Eric Dumazet. [AV: cleaned up a bit, fixed build breakage on weird configs Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-26fs: allow for more than 2^31 filesEric Dumazet
Andrew, Could you please review this patch, you probably are the right guy to take it, because it crosses fs and net trees. Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt depend on previous patch (sysctl: fix min/max handling in __do_proc_doulongvec_minmax()) Thanks ! [PATCH V4] fs: allow for more than 2^31 files Robin Holt tried to boot a 16TB system and found af_unix was overflowing a 32bit value : <quote> We were seeing a failure which prevented boot. The kernel was incapable of creating either a named pipe or unix domain socket. This comes down to a common kernel function called unix_create1() which does: atomic_inc(&unix_nr_socks); if (atomic_read(&unix_nr_socks) > 2 * get_max_files()) goto out; The function get_max_files() is a simple return of files_stat.max_files. files_stat.max_files is a signed integer and is computed in fs/file_table.c's files_init(). n = (mempages * (PAGE_SIZE / 1024)) / 10; files_stat.max_files = n; In our case, mempages (total_ram_pages) is approx 3,758,096,384 (0xe0000000). That leaves max_files at approximately 1,503,238,553. This causes 2 * get_max_files() to integer overflow. </quote> Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long integers, and change af_unix to use an atomic_long_t instead of atomic_t. get_max_files() is changed to return an unsigned long. get_nr_files() is changed to return a long. unix_nr_socks is changed from atomic_t to atomic_long_t, while not strictly needed to address Robin problem. Before patch (on a 64bit kernel) : # echo 2147483648 >/proc/sys/fs/file-max # cat /proc/sys/fs/file-max -18446744071562067968 After patch: # echo 2147483648 >/proc/sys/fs/file-max # cat /proc/sys/fs/file-max 2147483648 # cat /proc/sys/fs/file-nr 704 0 2147483648 Reported-by: Robin Holt <holt@sgi.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David Miller <davem@davemloft.net> Reviewed-by: Robin Holt <holt@sgi.com> Tested-by: Robin Holt <holt@sgi.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-07sysctl: fix min/max handling in __do_proc_doulongvec_minmax()Eric Dumazet
When proc_doulongvec_minmax() is used with an array of longs, and no min/max check requested (.extra1 or .extra2 being NULL), we dereference a NULL pointer for the second element of the array. Noticed while doing some changes in network stack for the "16TB problem" Fix is to not change min & max pointers in __do_proc_doulongvec_minmax(), so that all elements of the vector share an unique min/max limit, like proc_dointvec_minmax(). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-05gcc-4.6: kernel/*: Fix unused but set warningsAndi Kleen
No real bugs I believe, just some dead code. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: andi@firstfloor.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-08-10Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notifyLinus Torvalds
* 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits) fanotify: use both marks when possible fsnotify: pass both the vfsmount mark and inode mark fsnotify: walk the inode and vfsmount lists simultaneously fsnotify: rework ignored mark flushing fsnotify: remove global fsnotify groups lists fsnotify: remove group->mask fsnotify: remove the global masks fsnotify: cleanup should_send_event fanotify: use the mark in handler functions audit: use the mark in handler functions dnotify: use the mark in handler functions inotify: use the mark in handler functions fsnotify: send fsnotify_mark to groups in event handling functions fsnotify: Exchange list heads instead of moving elements fsnotify: srcu to protect read side of inode and vfsmount locks fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called fsnotify: use _rcu functions for mark list traversal fsnotify: place marks on object in order of group memory address vfs/fsnotify: fsnotify_close can delay the final work in fput fsnotify: store struct file not struct path ... Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.
2010-08-10oom: move sysctl declarations to oom.hDavid Rientjes
The three oom killer sysctl variables (sysctl_oom_dump_tasks, sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better declared in include/linux/oom.h rather than kernel/sysctl.c. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-07Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (55 commits) workqueue: mark init_workqueues() as early_initcall() workqueue: explain for_each_*cwq_cpu() iterators fscache: fix build on !CONFIG_SYSCTL slow-work: kill it gfs2: use workqueue instead of slow-work drm: use workqueue instead of slow-work cifs: use workqueue instead of slow-work fscache: drop references to slow-work fscache: convert operation to use workqueue instead of slow-work fscache: convert object to use workqueue instead of slow-work workqueue: fix how cpu number is stored in work->data workqueue: fix mayday_mask handling on UP workqueue: fix build problem on !CONFIG_SMP workqueue: fix locking in retry path of maybe_create_worker() async: use workqueue for worker pool workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND instead workqueue: implement unbound workqueue workqueue: prepare for WQ_UNBOUND implementation libata: take advantage of cmwq and remove concurrency limitations workqueue: fix worker management invocation without pending works ... Fixed up conflicts in fs/cifs/* as per Tejun. Other trivial conflicts in include/linux/workqueue.h, kernel/trace/Kconfig and kernel/workqueue.c
2010-08-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (28 commits) driver core: device_rename's new_name can be const sysfs: Remove owner field from sysfs struct attribute powerpc/pci: Remove owner field from attribute initialization in PCI bridge init regulator: Remove owner field from attribute initialization in regulator core driver leds: Remove owner field from attribute initialization in bd2802 driver scsi: Remove owner field from attribute initialization in ARCMSR driver scsi: Remove owner field from attribute initialization in LPFC driver cgroupfs: create /sys/fs/cgroup to mount cgroupfs on Driver core: Add BUS_NOTIFY_BIND_DRIVER driver core: fix memory leak on one error path in bus_register() debugfs: no longer needs to depend on SYSFS sysfs: Fix one more signature discrepancy between sysfs implementation and docs. sysfs: fix discrepancies between implementation and documentation dcdbas: remove a redundant smi_data_buf_free in dcdbas_exit dmi-id: fix a memory leak in dmi_id_init error path sysfs: sysfs_chmod_file's attr can be const firmware: Update hotplug script Driver core: move platform device creation helpers to .init.text (if MODULE=n) Driver core: reduce duplicated code for platform_device creation Driver core: use kmemdup in platform_device_add_resources ...
2010-08-05hotplug: Support kernel/hotplug sysctl variable when !CONFIG_NETIan Abbott
The kernel/hotplug sysctl variable (/proc/sys/kernel/hotplug file) was made conditional on CONFIG_NET by commit f743ca5e10f4145e0b3e6d11b9b46171e16af7ce (applied in 2.6.18) to fix problems with undefined references in 2.6.16 when CONFIG_HOTPLUG=y && !CONFIG_NET, but this restriction is no longer needed. This patch makes the kernel/hotplug sysctl variable depend only on CONFIG_HOTPLUG. Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Acked-by: Randy Dunlap <randy.dunlap@oracle.COM> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-05Merge branch 'perf/nmi' into perf/coreIngo Molnar
Conflicts: kernel/Makefile Merge reason: Add the now complete topic, fix the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-28sysctl extern cleanup: inotifyDave Young
Extern declarations in sysctl.c should be move to their own head file, and then include them in relavant .c files. Move inotify_table extern declaration to linux/inotify.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28dnotify: move dir_notify_enable declarationAlexey Dobriyan
Move dir_notify_enable declaration to where it belongs -- dnotify.h . Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-23slow-work: kill itTejun Heo
slow-work doesn't have any user left. Kill it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Howells <dhowells@redhat.com>
2010-06-03pipe: change /proc/sys/fs/pipe-max-pages to byte sized interfaceJens Axboe
This changes the interface to be based on bytes instead. The API matches that of F_SETPIPE_SZ in that it rounds up the passed in size so that the resulting page array is a power-of-2 in size. The proc file is renamed to /proc/sys/fs/pipe-max-size to reflect this change. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-05-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (63 commits) drivers/net/usb/asix.c: Fix pointer cast. be2net: Bug fix to avoid disabling bottom half during firmware upgrade. proc_dointvec: write a single value hso: add support for new products Phonet: fix potential use-after-free in pep_sock_close() ath9k: remove VEOL support for ad-hoc ath9k: change beacon allocation to prefer the first beacon slot sock.h: fix kernel-doc warning cls_cgroup: Fix build error when built-in macvlan: do proper cleanup in macvlan_common_newlink() V2 be2net: Bug fix in init code in probe net/dccp: expansion of error code size ath9k: Fix rx of mcast/bcast frames in PS mode with auto sleep wireless: fix sta_info.h kernel-doc warnings wireless: fix mac80211.h kernel-doc warnings iwlwifi: testing the wrong variable in iwl_add_bssid_station() ath9k_htc: rare leak in ath9k_hif_usb_alloc_tx_urbs() ath9k_htc: dereferencing before check in hif_usb_tx_cb() rt2x00: Fix rt2800usb TX descriptor writing. rt2x00: Fix failed SLEEP->AWAKE and AWAKE->SLEEP transitions. ...
2010-05-25proc_dointvec: write a single valueJ. R. Okajima
The commit 00b7c3395aec3df43de5bd02a3c5a099ca51169f "sysctl: refactor integer handling proc code" modified the behaviour of writing to /proc. Before the commit, write("1\n") to /proc/sys/kernel/printk succeeded. But now it returns EINVAL. This commit supports writing a single value to a multi-valued entry. Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp> Reviewed-and-tested-by: WANG Cong <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-25mm: compaction: add a tunable that decides when memory should be compacted ↵Mel Gorman
and when it should be reclaimed The kernel applies some heuristics when deciding if memory should be compacted or reclaimed to satisfy a high-order allocation. One of these is based on the fragmentation. If the index is below 500, memory will not be compacted. This choice is arbitrary and not based on data. To help optimise the system and set a sensible default for this value, this patch adds a sysctl extfrag_threshold. The kernel will only compact memory if the fragmentation index is above the extfrag_threshold. [randy.dunlap@oracle.com: Fix build errors when proc fs is not configured] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mm: compaction: add /proc trigger for memory compactionMel Gorman
Add a proc file /proc/sys/vm/compact_memory. When an arbitrary value is written to the file, all zones are compacted. The expected user of such a trigger is a job scheduler that prepares the system before the target application runs. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>