summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2011-05-28sched: Fix ->min_vruntime calculation in dequeue_entity()Peter Zijlstra
Dima Zavin <dima@android.com> reported: "After pulling the thread off the run-queue during a cgroup change, the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime then gets normalized to this new value. This can then lead to the thread getting an unfair boost in the new group if the vruntime of the next task in the old run-queue was way further ahead." Reported-by: Dima Zavin <dima@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Recalls-having-tested-once-upon-a-time-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-28sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSWPeter Zijlstra
Marc reported that e4a52bcb9 (sched: Remove rq->lock from the first half of ttwu()) broke his ARM-SMP machine. Now ARM is one of the few __ARCH_WANT_INTERRUPTS_ON_CTXSW users, so that exception in the ttwu() code was suspect. Yong found that the interrupt could hit after context_switch() changes current but before it clears p->on_cpu, if that interrupt were to attempt a wake-up of p we would indeed find ourselves spinning in IRQ context. Fix this by reverting to the old behaviour for this situation and perform a full remote wake-up. Cc: Frank Rowand <frank.rowand@am.sony.com> Cc: Yong Zhang <yong.zhang0@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Reported-by: Marc Zyngier <Marc.Zyngier@arm.com> Tested-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-28sched: More sched_domain iterations fixesXiaotian Feng
sched_domain iterations needs to be protected by rcu_read_lock() now, this patch adds another two places which needs the rcu lock, which is spotted by following suspicious rcu_dereference_check() usage warnings. kernel/sched_rt.c:1244 invoked rcu_dereference_check() without protection! kernel/sched_stats.h:41 invoked rcu_dereference_check() without protection! Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1303469634-11678-1-git-send-email-dfeng@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-27Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM: Fix PM QOS's user mode interface to work with ASCII input PM / Hibernate: Update kerneldoc comments in hibernate.c PM / Hibernate: Remove arch_prepare_suspend() PM / Hibernate: Update some comments in core hibernate code
2011-05-27Merge branch 'docs-move' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rdunlap/linux-docs * 'docs-move' of git://git.kernel.org/pub/scm/linux/kernel/git/rdunlap/linux-docs: Create Documentation/security/, move LSM-, credentials-, and keys-related files from Documentation/ to Documentation/security/, add Documentation/security/00-INDEX, and update all occurrences of Documentation/<moved_file> to Documentation/security/<moved_file>.
2011-05-27Merge branch 'tip/perf/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/urgent
2011-05-27kernel/profile.c: remove some duplicate code from profile_hits()Rakib Mullick
profile_hits() has a common check for prof_on and prof_buffer regardless of SMP or !SMP. So, remove some duplicate code by splitting profile_hits into two. [akpm@linux-foundation.org: make do_profile_hits static] Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-27mm: extract exe_file handling from procfsJiri Slaby
Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/. This was because exe_file was needed only for /proc/<pid>/exe. Since we will need the exe_file functionality also for core dumps (so core name can contain full binary path), built this functionality always into the kernel. To achieve that move that out of proc FS to the kernel/ where in fact it should belong. By doing that we can make dup_mm_exe_file static. Also we can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-27cgroup: remove the ns_cgroupDaniel Lezcano
The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and leads to some problems: * cgroup creation is out-of-control * cgroup name can conflict when pids are looping * it is not possible to have a single process handling a lot of namespaces without falling in a exponential creation time * we may want to create a namespace without creating a cgroup The ns_cgroup was replaced by a compatibility flag 'clone_children', where a newly created cgroup will copy the parent cgroup values. The userspace has to manually create a cgroup and add a task to the 'tasks' file. This patch removes the ns_cgroup as suggested in the following thread: https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html The 'cgroup_clone' function is removed because it is no longer used. This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a printk warning users that the feature is planned for removal. Since that time we have heard from XXX users who were affected by this. Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jamal Hadi Salim <hadi@cyberus.ca> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-27cgroups: use flex_array in attach_procBen Blum
Convert cgroup_attach_proc to use flex_array. The cgroup_attach_proc implementation requires a pre-allocated array to store task pointers to atomically move a thread-group, but asking for a monolithic array with kmalloc() may be unreliable for very large groups. Using flex_array provides the same functionality with less risk of failure. This is a post-patch for cgroup-procs-write.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-27cgroups: make procs file writableBen Blum
Make procs file writable to move all threads by tgid at once. Add functionality that enables users to move all threads in a threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs' file. This current implementation makes use of a per-threadgroup rwsem that's taken for reading in the fork() path to prevent newly forking threads within the threadgroup from "escaping" while the move is in progress. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-27cgroups: add per-thread subsystem callbacksBen Blum
Add cgroup subsystem callbacks for per-thread attachment in atomic contexts Add can_attach_task(), pre_attach(), and attach_task() as new callbacks for cgroups's subsystem interface. Unlike can_attach and attach, these are for per-thread operations, to be called potentially many times when attaching an entire threadgroup. Also, the old "bool threadgroup" interface is removed, as replaced by this. All subsystems are modified for the new interface - of note is cpuset, which requires from/to nodemasks for attach to be globally scoped (though per-cpuset would work too) to persist from its pre_attach to attach_task and attach. This is a pre-patch for cgroup-procs-writable.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-27cgroups: read-write lock CLONE_THREAD forking per threadgroupBen Blum
Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup Add an rwsem that lives in a threadgroup's signal_struct that's taken for reading in the fork path, under CONFIG_CGROUPS. If another part of the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS ifdefs should be changed to a higher-up flag that CGROUPS and the other system would both depend on. This is a pre-patch for cgroup-procs-write.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26PM: Fix PM QOS's user mode interface to work with ASCII inputRafael J. Wysocki
Make pm_qos_power_write() accept values passed to it in the ASCII hex format either with or without an ending newline. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Mark Gross <markgross@thegnar.org>
2011-05-26rcu: Decrease memory-barrier usage based on semi-formal proofPaul E. McKenney
(Note: this was reverted, and is now being re-applied in pieces, with this being the fifth and final piece. See below for the reason that it is now felt to be safe to re-apply this.) Commit d09b62d fixed grace-period synchronization, but left some smp_mb() invocations in rcu_process_callbacks() that are no longer needed, but sheer paranoia prevented them from being removed. This commit removes them and provides a proof of correctness in their absence. It also adds a memory barrier to rcu_report_qs_rsp() immediately before the update to rsp->completed in order to handle the theoretical possibility that the compiler or CPU might move massive quantities of code into a lock-based critical section. This also proves that the sheer paranoia was not entirely unjustified, at least from a theoretical point of view. In addition, the old dyntick-idle synchronization depended on the fact that grace periods were many milliseconds in duration, so that it could be assumed that no dyntick-idle CPU could reorder a memory reference across an entire grace period. Unfortunately for this design, the addition of expedited grace periods breaks this assumption, which has the unfortunate side-effect of requiring atomic operations in the functions that track dyntick-idle state for RCU. (There is some hope that the algorithms used in user-level RCU might be applied here, but some work is required to handle the NMIs that user-space applications can happily ignore. For the short term, better safe than sorry.) This proof assumes that neither compiler nor CPU will allow a lock acquisition and release to be reordered, as doing so can result in deadlock. The proof is as follows: 1. A given CPU declares a quiescent state under the protection of its leaf rcu_node's lock. 2. If there is more than one level of rcu_node hierarchy, the last CPU to declare a quiescent state will also acquire the ->lock of the next rcu_node up in the hierarchy, but only after releasing the lower level's lock. The acquisition of this lock clearly cannot occur prior to the acquisition of the leaf node's lock. 3. Step 2 repeats until we reach the root rcu_node structure. Please note again that only one lock is held at a time through this process. The acquisition of the root rcu_node's ->lock must occur after the release of that of the leaf rcu_node. 4. At this point, we set the ->completed field in the rcu_state structure in rcu_report_qs_rsp(). However, if the rcu_node hierarchy contains only one rcu_node, then in theory the code preceding the quiescent state could leak into the critical section. We therefore precede the update of ->completed with a memory barrier. All CPUs will therefore agree that any updates preceding any report of a quiescent state will have happened before the update of ->completed. 5. Regardless of whether a new grace period is needed, rcu_start_gp() will propagate the new value of ->completed to all of the leaf rcu_node structures, under the protection of each rcu_node's ->lock. If a new grace period is needed immediately, this propagation will occur in the same critical section that ->completed was set in, but courtesy of the memory barrier in #4 above, is still seen to follow any pre-quiescent-state activity. 6. When a given CPU invokes __rcu_process_gp_end(), it becomes aware of the end of the old grace period and therefore makes any RCU callbacks that were waiting on that grace period eligible for invocation. If this CPU is the same one that detected the end of the grace period, and if there is but a single rcu_node in the hierarchy, we will still be in the single critical section. In this case, the memory barrier in step #4 guarantees that all callbacks will be seen to execute after each CPU's quiescent state. On the other hand, if this is a different CPU, it will acquire the leaf rcu_node's ->lock, and will again be serialized after each CPU's quiescent state for the old grace period. On the strength of this proof, this commit therefore removes the memory barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp(). The effect is to reduce the number of memory barriers by one and to reduce the frequency of execution from about once per scheduling tick per CPU to once per grace period. This was reverted do to hangs found during testing by Yinghai Lu and Ingo Molnar. Frederic Weisbecker supplied Yinghai with tracing that located the underlying problem, and Frederic also provided the fix. The underlying problem was that the HARDIRQ_ENTER() macro from lib/locking-selftest.c invoked irq_enter(), which in turn invokes rcu_irq_enter(), but HARDIRQ_EXIT() invoked __irq_exit(), which does not invoke rcu_irq_exit(). This situation resulted in calls to rcu_irq_enter() that were not balanced by the required calls to rcu_irq_exit(). Therefore, after these locking selftests completed, RCU's dyntick-idle nesting count was a large number (for example, 72), which caused RCU to to conclude that the affected CPU was not in dyntick-idle mode when in fact it was. RCU would therefore incorrectly wait for this dyntick-idle CPU, resulting in hangs. In contrast, with Frederic's patch, which replaces the irq_enter() in HARDIRQ_ENTER() with an __irq_enter(), these tests don't ever call either rcu_irq_enter() or rcu_irq_exit(), which works because the CPU running the test is already marked as not being in dyntick-idle mode. This means that the rcu_irq_enter() and rcu_irq_exit() calls and RCU then has no problem working out which CPUs are in dyntick-idle mode and which are not. The reason that the imbalance was not noticed before the barrier patch was applied is that the old implementation of rcu_enter_nohz() ignored the nesting depth. This could still result in delays, but much shorter ones. Whenever there was a delay, RCU would IPI the CPU with the unbalanced nesting level, which would eventually result in rcu_enter_nohz() being called, which in turn would force RCU to see that the CPU was in dyntick-idle mode. The reason that very few people noticed the problem is that the mismatched irq_enter() vs. __irq_exit() occured only when the kernel was built with CONFIG_DEBUG_LOCKING_API_SELFTESTS. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-26rcu: Make rcu_enter_nohz() pay attention to nestingPaul E. McKenney
The old version of rcu_enter_nohz() forced RCU into nohz mode even if the nesting count was non-zero. This change causes rcu_enter_nohz() to hold off for non-zero nesting counts. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-05-26rcu: Don't do reschedule unless in irqPaul E. McKenney
Condition the set_need_resched() in rcu_irq_exit() on in_irq(). This should be a no-op, because rcu_irq_exit() should only be called from irq. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-05-26rcu: Remove old memory barriers from rcu_process_callbacks()Paul E. McKenney
Second step of partitioning of commit e59fb3120b. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-05-26rcu: Add memory barriersPaul E. McKenney
Add the memory barriers added by e59fb3120b. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-05-26Merge branch 'linus' into perf/urgentIngo Molnar
Merge reason: Linus applied an overlapping commit: 5f2e8e2b0bf0: kernel/watchdog.c: Use proper ANSI C prototypes So merge it in to make sure we can iterate the file without conflicts. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-26irq: Remove smp_affinity_list when unregister irq procYinghai Lu
commit 4b06042(bitmap, irq: add smp_affinity_list interface to /proc/irq) causes the following warning: [ 274.239500] WARNING: at fs/proc/generic.c:850 remove_proc_entry+0x24c/0x27a() [ 274.251761] remove_proc_entry: removing non-empty directory 'irq/184', leaking at least 'smp_affinity_list' Remove the new file in the exit path. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Mike Travis <travis@sgi.com> Link: http://lkml.kernel.org/r/4DDDE094.6050505@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-05-26ftrace: Add internal recursive checksSteven Rostedt
Witold reported a reboot caused by the selftests of the dynamic function tracer. He sent me a config and I used ktest to do a config_bisect on it (as my config did not cause the crash). It pointed out that the problem config was CONFIG_PROVE_RCU. What happened was that if multiple callbacks are attached to the function tracer, we iterate a list of callbacks. Because the list is managed by synchronize_sched() and preempt_disable, the access to the pointers uses rcu_dereference_raw(). When PROVE_RCU is enabled, the rcu_dereference_raw() calls some debugging functions, which happen to be traced. The tracing of the debug function would then call rcu_dereference_raw() which would then call the debug function and then... well you get the idea. I first wrote two different patches to solve this bug. 1) add a __rcu_dereference_raw() that would not do any checks. 2) add notrace to the offending debug functions. Both of these patches worked. Talking with Paul McKenney on IRC, he suggested to add recursion detection instead. This seemed to be a better solution, so I decided to implement it. As the task_struct already has a trace_recursion to detect recursion in the ring buffer, and that has a very small number it allows, I decided to use that same variable to add flags that can detect the recursion inside the infrastructure of the function tracer. I plan to change it so that the task struct bit can be checked in mcount, but as that requires changes to all archs, I will hold that off to the next merge window. Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1306348063.1465.116.camel@gandalf.stny.rr.com Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-26tracing: Add __print_symbolic_u64 to avoid warnings on 32bit machineliubo
Filesystem, like Btrfs, has some "ULL" macros, and when these macros are passed to tracepoints'__print_symbolic(), there will be 64->32 truncate WARNINGS during compiling on 32bit box. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Link: http://lkml.kernel.org/r/4DACE6E0.7000507@cn.fujitsu.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-26ftrace: Set ops->flag to enabled even on static function tracingSteven Rostedt
When dynamic ftrace is not configured, the ops->flags still needs to have its FTRACE_OPS_FL_ENABLED bit set in ftrace_startup(). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-26tracing: Have event with function tracer check error returnSteven Rostedt
The self tests for event tracer does not check if the function tracing was successfully activated. It needs to before it continues the tests, otherwise the wrong errors may be reported. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-26ftrace: Have ftrace_startup() return failure codeSteven Rostedt
The register_ftrace_function() returns an error code on failure except if the call to ftrace_startup() fails. Add a error return to ftrace_startup() if it fails to start, allowing register_ftrace_funtion() to return a proper error value. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfdLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd: net: fix get_net_ns_by_fd for !CONFIG_NET_NS ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry. ns: Declare sys_setns in syscalls.h net: Allow setting the network namespace by fd ns proc: Add support for the ipc namespace ns proc: Add support for the uts namespace ns proc: Add support for the network namespace. ns: Introduce the setns syscall ns: proc files for namespace naming policy.
2011-05-25jump_label: Check entries limit in __jump_label_updateJiri Olsa
When iterating the jump_label entries array (core or modules), the __jump_label_update function peeks over the last entry. The reason is that the end of the for loop depends on the key value of the processed entry. Thus when going through the last array entry, we will touch the memory behind the array limit. This bug probably will never be triggered, since most likely the memory behind the jump_label entries will be accesable and the entry->key will be different than the expected value. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Acked-by: Jason Baron <jbaron@redhat.com> Link: http://lkml.kernel.org/r/20110510104346.GC1899@jolsa.brq.redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-05-25Merge branch 'for-2.6.40' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc * 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: signal: sys_pause() should check signal_pending() ptrace: ptrace_resume() shouldn't wake up !TASK_TRACED thread
2011-05-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tileLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile: (26 commits) arch/tile: prefer "tilepro" as the name of the 32-bit architecture compat: include aio_abi.h for aio_context_t arch/tile: cleanups for tilegx compat mode arch/tile: allocate PCI IRQs later in boot arch/tile: support signal "exception-trace" hook arch/tile: use better definitions of xchg() and cmpxchg() include/linux/compat.h: coding-style fixes tile: add an RTC driver for the Tilera hypervisor arch/tile: finish enabling support for TILE-Gx 64-bit chip compat: fixes to allow working with tile arch arch/tile: update defconfig file to something more useful tile: do_hardwall_trap: do not play with task->sighand tile: replace mm->cpu_vm_mask with mm_cpumask() tile,mn10300: add device parameter to dma_cache_sync() audit: support the "standard" <asm-generic/unistd.h> arch/tile: clarify flush_buffer()/finv_buffer() function names arch/tile: kernel-related cleanups from removing static page size arch/tile: various header improvements for building drivers arch/tile: disable GX prefetcher during cache flush arch/tile: tolerate disabling CONFIG_BLK_DEV_INITRD ...
2011-05-25hrtimers: Fix typo causing erratic timersThomas Gleixner
commit 9ec2690758a5 ("timerfd: Manage cancelable timers in timerfd") introduced a CONFIG_HIGHRES_TIMERS (should be CONFIG_HIGH_RES_TIMERS) typo, which caused applications depending on CLOCK_REALTIME timers to become sluggy due to the fact that the time base of the realtime timers was not updated when the wall clock time was set. This causes anything from 100% CPU use for some applications to odd delays and hickups. Reported-bisected-and-tested-by: Anca Emanuel <anca.emanuel@gmail.com> Tested-by: Linus Torvalds <torvalds@linux-foundation.org> Fatfingered-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25signal: sys_pause() should check signal_pending()Oleg Nesterov
ERESTART* is always wrong without TIF_SIGPENDING. Teach sys_pause() to handle the spurious wakeup correctly. Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2011-05-25ptrace: ptrace_resume() shouldn't wake up !TASK_TRACED threadOleg Nesterov
It is not clear why ptrace_resume() does wake_up_process(). Unless the caller is PTRACE_KILL the tracee should be TASK_TRACED so we can use wake_up_state(__TASK_TRACED). If sys_ptrace() races with SIGKILL we do not need the extra and potentionally spurious wakeup. If the caller is PTRACE_KILL, wake_up_process() is even more wrong. The tracee can sleep in any state in any place, and if we have a buggy code which doesn't handle a spurious wakeup correctly PTRACE_KILL can be used to exploit it. For example: int main(void) { int child, status; child = fork(); if (!child) { int ret; assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0); ret = pause(); printf("pause: %d %m\n", ret); return 0x23; } sleep(1); assert(ptrace(PTRACE_KILL, child, 0,0) == 0); assert(child == wait(&status)); printf("wait: %x\n", status); return 0; } prints "pause: -1 Unknown error 514", -ERESTARTNOHAND leaks to the userland. In this case sys_pause() is buggy as well and should be fixed. I do not know what was the original rationality behind PTRACE_KILL. The man page is simply wrong and afaics it was always wrong. Imho it should be deprecated, or may be it should do send_sig(SIGKILL) as Denys suggests, but in any case I do not think that the current behaviour was intentional. Note: there is another problem, ptrace_resume() changes ->exit_code and this can race with SIGKILL too. Eventually we should change ptrace to not use ->exit_code. Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2011-05-25Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: posix-timers: RCU conversion
2011-05-25printk: allocate kernel log buffer earlierMike Travis
On larger systems, because of the numerous ACPI, Bootmem and EFI messages, the static log buffer overflows before the larger one specified by the log_buf_len param is allocated. Minimize the overflow by allocating the new log buffer as soon as possible. On kernels without memblock, a later call to setup_log_buf from kernel/init.c is the fallback. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix CONFIG_PRINTK=n build] Signed-off-by: Mike Travis <travis@sgi.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25bitmap, irq: add smp_affinity_list interface to /proc/irqMike Travis
Manually adjusting the smp_affinity for IRQ's becomes unwieldy when the cpu count is large. Setting smp affinity to cpus 256 to 263 would be: echo 000000ff,00000000,00000000,00000000,00000000,00000000,00000000,00000000 > smp_affinity instead of: echo 256-263 > smp_affinity_list Think about what it looks like for cpus around say, 4088 to 4095. We already have many alternate "list" interfaces: /sys/devices/system/cpu/cpuX/indexY/shared_cpu_list /sys/devices/system/cpu/cpuX/topology/thread_siblings_list /sys/devices/system/cpu/cpuX/topology/core_siblings_list /sys/devices/system/node/nodeX/cpulist /sys/devices/pci***/***/local_cpulist Add a companion interface, smp_affinity_list to use cpu lists instead of cpu maps. This conforms to other companion interfaces where both a map and a list interface exists. This required adding a bitmap_parselist_user() function in a manner similar to the bitmap_parse_user() function. [akpm@linux-foundation.org: make __bitmap_parselist() static] Signed-off-by: Mike Travis <travis@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jack Steiner <steiner@sgi.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: convert mm->cpu_vm_cpumask into cpumask_var_tKOSAKI Motohiro
cpumask_t is very big struct and cpu_vm_mask is placed wrong position. It might lead to reduce cache hit ratio. This patch has two change. 1) Move the place of cpumask into last of mm_struct. Because usually cpumask is accessed only front bits when the system has cpu-hotplug capability 2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory footprint if cpumask_size() will use nr_cpumask_bits properly in future. In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var. It may help to detect out of tree cpu_vm_mask users. This patch has no functional change. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Howells <dhowells@redhat.com> Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: Convert i_mmap_lock to a mutexPeter Zijlstra
Straightforward conversion of i_mmap_lock to a mutex. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: Remove i_mmap_lock lockbreakPeter Zijlstra
Hugh says: "The only significant loser, I think, would be page reclaim (when concurrent with truncation): could spin for a long time waiting for the i_mmap_mutex it expects would soon be dropped? " Counter points: - cpu contention makes the spin stop (need_resched()) - zap pages should be freeing pages at a higher rate than reclaim ever can I think the simplification of the truncate code is definitely worth it. Effectively reverts: 2aa15890f3c ("mm: prevent concurrent unmap_mapping_range() on the same inode") and takes out the code that caused its problem. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25lockdep, mutex: provide mutex_lock_nest_lockPeter Zijlstra
In order to convert i_mmap_lock to a mutex we need a mutex equivalent to spin_lock_nest_lock(), thus provide the mutex_lock_nest_lock() annotation. As with spin_lock_nest_lock(), mutex_lock_nest_lock() allows annotation of the locking pattern where an outer lock serializes the acquisition order of nested locks. That is, if every time you lock multiple locks A, say A1 and A2 you first acquire N, the order of acquiring A1 and A2 is irrelevant. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-24PM / Hibernate: Update kerneldoc comments in hibernate.cRafael J. Wysocki
Some of the kerneldoc comments in kernel/power/hibernate.c are outdated and some of them don't adhere to the kernel's standards. Update them and make them look in a consistent way. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
2011-05-24PM / Hibernate: Remove arch_prepare_suspend()Rafael J. Wysocki
All architectures supporting hibernation define arch_prepare_suspend() as an empty function, so remove it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-05-24Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (43 commits) TOMOYO: Fix wrong domainname validation. SELINUX: add /sys/fs/selinux mount point to put selinuxfs CRED: Fix load_flat_shared_library() to initialise bprm correctly SELinux: introduce path_has_perm flex_array: allow 0 length elements flex_arrays: allow zero length flex arrays flex_array: flex_array_prealloc takes a number of elements, not an end SELinux: pass last path component in may_create SELinux: put name based create rules in a hashtable SELinux: generic hashtab entry counter SELinux: calculate and print hashtab stats with a generic function SELinux: skip filename trans rules if ttype does not match parent dir SELinux: rename filename_compute_type argument to *type instead of *con SELinux: fix comment to state filename_compute_type takes an objname not a qstr SMACK: smack_file_lock can use the struct path LSM: separate LSM_AUDIT_DATA_DENTRY from LSM_AUDIT_DATA_PATH LSM: split LSM_AUDIT_DATA_FS into _PATH and _INODE SELINUX: Make selinux cache VFS RCU walks safe SECURITY: Move exec_permission RCU checks into security modules SELinux: security_read_policy should take a size_t not ssize_t ...
2011-05-24Merge branch 'for-2.6.40' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: Unify input section names percpu: Avoid extra NOP in percpu_cmpxchg16b_double percpu: Cast away printk format warning percpu: Always align percpu output section to PAGE_SIZE Fix up fairly trivial conflict in arch/x86/include/asm/percpu.h as per Tejun
2011-05-24Merge branch 'next' into for-linusJames Morris
2011-05-24posix-timers: RCU conversionEric Dumazet
Ben Nagy reported a scalability problem with KVM/QEMU that hit very hard a single spinlock (idr_lock) in posix-timers code, on its 48 core machine. Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time used in ticket_spin_lock, from lock_timer Ref: http://www.spinics.net/lists/kvm/msg51526.html Switching to RCU is quite easy, IDR being already RCU ready. idr_lock should be locked only for an insert/delete, not a lookup. Benchmark on a 2x4x2 machine, 16 processes calling timer_gettime(). Before : real 1m18.669s user 0m1.346s sys 1m17.180s After : real 0m3.296s user 0m1.366s sys 0m1.926s Reported-by: Ben Nagy <ben@iagu.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Ben Nagy <ben@iagu.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Avi Kivity <avi@redhat.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Richard Cochran <richard.cochran@omicron.at> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-05-24Merge branch 'fixes-2.6.39' into for-2.6.40Tejun Heo
2011-05-24Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf tools: Fix sample type size calculation in 32 bits archs profile: Use vzalloc() rather than vmalloc() & memset()
2011-05-24kernel/watchdog.c: Use proper ANSI C prototypesLinus Torvalds
We try to enforce it by using -Wstrict-prototypes, but apparently they sometimes get through. Introduced by 4eec42f39204 ("watchdog: Change the default timeout and configure nmi watchdog period based"). Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-24watchdog: Fix non-standard prototype of get_softlockup_thresh()Ingo Molnar
This build warning slipped through: kernel/watchdog.c:102: warning: function declaration isn't a prototype As reported by Stephen Rothwell. Also address an unused variable warning that GCC 4.6.0 reports: we cannot do anything about failed watchdog ops during CPU hotplug (it's not serious enough to return an error from the notifier), so ignore them. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20110524134129.8da27016.sfr@canb.auug.org.au Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <20110517071642.GF22305@elte.hu>