summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2009-02-17Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: doc: mmiotrace.txt, buffer size control change trace: mmiotrace to the tracer menu in Kconfig mmiotrace: count events lost due to not recording
2009-02-17Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: iommu: fix Intel IOMMU write-buffer flushing futex: fix reference leak Trivial conflicts fixed manually in drivers/pci/intel-iommu.c
2009-02-15trace: mmiotrace to the tracer menu in KconfigPekka Paalanen
Impact: cosmetic change in Kconfig menu layout This patch was originally suggested by Peter Zijlstra, but seems it was forgotten. CONFIG_MMIOTRACE and CONFIG_MMIOTRACE_TEST were selectable directly under the Kernel hacking / debugging menu in the kernel configuration system. They were present only for x86 and x86_64. Other tracers that use the ftrace tracing framework are in their own sub-menu. This patch moves the mmiotrace configuration options there. Since the Kconfig file, where the tracer menu is, is not architecture specific, HAVE_MMIOTRACE_SUPPORT is introduced and provided only by x86/x86_64. CONFIG_MMIOTRACE now depends on it. Signed-off-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15mmiotrace: count events lost due to not recordingPekka Paalanen
Impact: enhances lost events counting in mmiotrace The tracing framework, or the ring buffer facility it uses, has a switch to stop recording data. When recording is off, the trace events will be lost. The framework does not count these, so mmiotrace has to count them itself. Signed-off-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-13User namespaces: Only put the userns when we unhash the uidSerge E. Hallyn
uids in namespaces other than init don't get a sysfs entry. For those in the init namespace, while we're waiting to remove the sysfs entry for the uid the uid is still hashed, and alloc_uid() may re-grab that uid without getting a new reference to the user_ns, which we've already put in free_user before scheduling remove_user_sysfs_dir(). Reported-and-tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11cgroups: fix lockdep subclasses overflowLi Zefan
I enabled all cgroup subsystems when compiling kernel, and then: # mount -t cgroup -o net_cls xxx /mnt # mkdir /mnt/0 This showed up immediately: BUG: MAX_LOCKDEP_SUBCLASSES too low! turning off the locking correctness validator. It's caused by the cgroup hierarchy lock: for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { struct cgroup_subsys *ss = subsys[i]; if (ss->root == root) mutex_lock_nested(&ss->hierarchy_mutex, i); } Now we have 9 cgroup subsystems, and the above 'i' for net_cls is 8, but MAX_LOCKDEP_SUBCLASSES is 8. This patch uses different lockdep keys for different subsystems. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11mm: fix dirty_bytes/dirty_background_bytes sysctls on 64bit archesSven Wegener
We need to pass an unsigned long as the minimum, because it gets casted to an unsigned long in the sysctl handler. If we pass an int, we'll access four more bytes on 64bit arches, resulting in a random minimum value. [rientjes@google.com: fix type of `old_bytes'] Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11futex: fix reference leakPeter Zijlstra
Catalin noticed that (38d47c1b7075: futex: rely on get_user_pages() for shared futexes) caused an mm_struct leak. Some tracing with the function graph tracer quickly pointed out that futex_wait() has exit paths with unbalanced reference counts. This regression was discovered by kmemleak. Reported-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com> Tested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: revert recent sync wakeup changes
2009-02-11Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timers: fix TIMER_ABSTIME for process wide cpu timers timers: split process wide cpu clocks/timers, fix x86: clean up hpet timer reinit timers: split process wide cpu clocks/timers, remove spurious warning timers: split process wide cpu clocks/timers signal: re-add dead task accumulation stats. x86: fix hpet timer reinit for x86_64 sched: fix nohz load balancer on cpu offline
2009-02-11Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ptrace, x86: fix the usage of ptrace_fork() i8327: fix outb() parameter order x86: fix math_emu register frame access x86: math_emu info cleanup x86: include correct %gs in a.out core dump x86, vmi: put a missing paravirt_release_pmd in pgd_dtor x86: find nr_irqs_gsi with mp_ioapic_routing x86: add clflush before monitor for Intel 7400 series x86: disable intel_iommu support by default x86: don't apply __supported_pte_mask to non-present ptes x86: fix grammar in user-visible BIOS warning x86/Kconfig.cpu: make Kconfig help readable in the console x86, 64-bit: print DMI info in the oops trace
2009-02-11sched: revert recent sync wakeup changesPeter Zijlstra
Intel reported a 10% regression (mysql+sysbench) on a 16-way machine with these patches: 1596e29: sched: symmetric sync vs avg_overlap d942fb6: sched: fix sync wakeups Revert them. Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Bisected-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11timers: fix TIMER_ABSTIME for process wide cpu timersPeter Zijlstra
The POSIX timer interface allows for absolute time expiry values through the TIMER_ABSTIME flag, therefore we have to synchronize the timer to the clock every time we start it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11timers: split process wide cpu clocks/timers, fixPeter Zijlstra
To decrease the chance of a missed enable, always enable the timer when we sample it, we'll always disable it when we find that there are no active timers in the jiffy tick. This fixes a flood of warnings reported by Mike Galbraith. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11ptrace, x86: fix the usage of ptrace_fork()Oleg Nesterov
I noticed by pure accident we have ptrace_fork() and friends. This was added by "x86, bts: add fork and exit handling", commit bf53de907dfdaac178c92d774aae7370d7b97d20. I can't test this, ds_request_bts() returns -EOPNOTSUPP, but I strongly believe this needs the fix. I think something like this program int main(void) { int pid = fork(); if (!pid) { ptrace(PTRACE_TRACEME, 0, NULL, NULL); kill(getpid(), SIGSTOP); fork(); } else { struct ptrace_bts_config bts = { .flags = PTRACE_BTS_O_ALLOC, .size = 4 * 4096, }; wait(NULL); ptrace(PTRACE_SETOPTIONS, pid, NULL, PTRACE_O_TRACEFORK); ptrace(PTRACE_BTS_CONFIG, pid, &bts, sizeof(bts)); ptrace(PTRACE_CONT, pid, NULL, NULL); sleep(1); } return 0; } should crash the kernel. If the task is traced by its natural parent ptrace_reparented() returns 0 but we should clear ->btsxxx anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Markus Metzger <markus.t.metzger@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-09profiling: fix broken profiling regressionHugh Dickins
Impact: fix broken /proc/profile on UP machines Commit c309b917cab55799ea489d7b5f1b77025d9f8462 "cpumask: convert kernel/profile.c" broke profiling. prof_cpu_mask was previously initialized to CPU_MASK_ALL, but left uninitialized in that commit. We need to copy cpu_possible_mask (cpu_online_mask is not enough). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-08async: use list_move_tailStefan Richter
list.h provides a dedicated primitive for "list_del followed by list_add_tail"... list_move_tail. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
2009-02-08async: Rename _special -> _domain for clarity.Cornelia Huck
Rename the async_*_special() functions to async_*_domain(), which describes the purpose of these functions much better. [Broke up long lines to silence checkpatch] Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2009-02-08async: Add some documentation.Cornelia Huck
Add some kerneldoc to the async interface. Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2009-02-08async: Handle kthread_run() return codes.Cornelia Huck
If we fail to create the manager thread, fall back to non-fastboot. If we fail to create an async thread, try again after waiting for a bit. Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2009-02-08async: Fix running list handling.Cornelia Huck
async_schedule() should pass in async_running as the running list, and run_one_entry() should put the entry to be run on the provided running list instead of always on the generic one. Reported-by: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2009-02-07Merge branches 'release', 'asus', 'bugzilla-12450', 'cpuidle', 'debug', ↵Len Brown
'ec', 'misc', 'printk' and 'processor' into release
2009-02-06fork.c: fix NULL pointer dereference when nr_threads == threads-maxLi Zefan
I happened to forked lots of processes, and hit NULL pointer dereference. It is because in copy_process() after checking max_threads, 0 is returned but not -EAGAIN. The bug is introduced by "CRED: Detach the credentials from task_struct" (commit f1752eec6145c97163dbce62d17cf5d928e28a27). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05wait: prevent exclusive waiter starvationJohannes Weiner
With exclusive waiters, every process woken up through the wait queue must ensure that the next waiter down the line is woken when it has finished. Interruptible waiters don't do that when aborting due to a signal. And if an aborting waiter is concurrently woken up through the waitqueue, noone will ever wake up the next waiter. This has been observed with __wait_on_bit_lock() used by lock_page_killable(): the first contender on the queue was aborting when the actual lock holder woke it up concurrently. The aborted contender didn't acquire the lock and therefor never did an unlock followed by waking up the next waiter. Add abort_exclusive_wait() which removes the process' wait descriptor from the waitqueue, iff still queued, or wakes up the next waiter otherwise. It does so under the waitqueue lock. Racing with a wake up means the aborting process is either already woken (removed from the queue) and will wake up the next waiter, or it will remove itself from the queue and the concurrent wake up will apply to the next waiter after it. Use abort_exclusive_wait() in __wait_event_interruptible_exclusive() and __wait_on_bit_lock() when they were interrupted by other means than a wake up through the queue. [akpm@linux-foundation.org: coding-style fixes] Reported-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Mentored-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Chuck Lever <cel@citi.umich.edu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> ["after some testing"] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05revert "rlimit: permit setting RLIMIT_NOFILE to RLIM_INFINITY"Andrew Morton
Revert commit 0c2d64fb6cae9aae480f6a46cfe79f8d7d48b59f because it causes (arguably poorly designed) existing userspace to spend interminable periods closing billions of not-open file descriptors. We could bring this back, with some sort of opt-in tunable in /proc, which defaults to "off". Peter's alanysis follows: : I spent several hours trying to get to the bottom of a serious : performance issue that appeared on one of our servers after upgrading to : 2.6.28. In the end it's what could be considered a userspace bug that : was triggered by a change in 2.6.28. Since this might also affect other : people I figured I'd at least document what I found here, and maybe we : can even do something about it: : : : So, I upgraded some of debian.org's machines to 2.6.28.1 and immediately : the team maintaining our ftp archive complained that one of their : scripts that previously ran in a few minutes still hadn't even come : close to being done after an hour or so. Downgrading to 2.6.27 fixed : that. : : Turns out that script is forking a lot and something in it or python or : whereever closes all the file descriptors it doesn't want to pass on. : That is, it starts at zero and goes up to ulimit -n/RLIMIT_NOFILE and : closes them all with a few exceptions. : : Turns out that takes a long time when your limit -n is now 2^20 (1048576). : : With 2.6.27.* the ulimit -n was the standard 1024, but with 2.6.28 it is : now a thousand times that. : : 2.6.28 included a patch titled "rlimit: permit setting RLIMIT_NOFILE to : RLIM_INFINITY" (0c2d64fb6cae9aae480f6a46cfe79f8d7d48b59f)[1] that : allows, as the title implies, to set the limit for number of files to : infinity. : : Closer investigation showed that the broken default ulimit did not apply : to "system" processes (like stuff started from init). In the end I : could establish that all processes that passed through pam_limit at one : point had the bad resource limit. : : Apparently the pam library in Debian etch (4.0) initializes the limits : to some default values when it doesn't have any settings in limit.conf : to override them. Turns out that for nofiles this is RLIM_INFINITY. : Commenting out "case RLIMIT_NOFILE" in pam_limit.c:267 of our pam : package version 0.79-5 fixes that - tho I'm not sure what side effects : that has. : : Debian lenny (the upcoming 5.0 version) doesn't have this issue as it : uses a different pam (version). Reported-by: Peter Palfrader <weasel@debian.org> Cc: Adam Tkac <vonsch@gmail.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <stable@kernel.org> [2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05kernel/async.c: fix printk warningsAndrew Morton
alpha: kernel/async.c: In function 'run_one_entry': kernel/async.c:141: warning: format '%lli' expects type 'long long int', but argument 2 has type 'async_cookie_t' kernel/async.c:149: warning: format '%lli' expects type 'long long int', but argument 2 has type 'async_cookie_t' kernel/async.c:149: warning: format '%lld' expects type 'long long int', but argument 4 has type 's64' kernel/async.c: In function 'async_synchronize_cookie_special': kernel/async.c:250: warning: format '%lli' expects type 'long long int', but argument 3 has type 's64' Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05timers: split process wide cpu clocks/timersPeter Zijlstra
Change the process wide cpu timers/clocks so that we: 1) don't mess up the kernel with too many threads, 2) don't have a per-cpu allocation for each process, 3) have no impact when not used. In order to accomplish this we're going to split it into two parts: - clocks; which can take all the time they want since they run from user context -- ie. sys_clock_gettime(CLOCK_PROCESS_CPUTIME_ID) - timers; which need constant time sampling but since they're explicity used, the user can pay the overhead. The clock readout will go back to a full sum of the thread group, while the timers will run of a global 'clock' that only runs when needed, so only programs that make use of the facility pay the price. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-05signal: re-add dead task accumulation stats.Peter Zijlstra
We're going to split the process wide cpu accounting into two parts: - clocks; which can take all the time they want since they run from user context. - timers; which need constant time tracing but can affort the overhead because they're default off -- and rare. The clock readout will go back to a full sum of the thread group, for this we need to re-add the exit stats that were removed in the initial itimer rework (f06febc9: timers: fix itimer/many thread hang). Furthermore, since that full sum can be rather slow for large thread groups and we have the complete dead task stats, revert the do_notify_parent time computation. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-04Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: APIC: enable workaround on AMD Fam10h CPUs xen: disable interrupts before saving in percpu x86: add x86@kernel.org to MAINTAINERS x86: push old stack address on irqstack for unwinder irq, x86: fix lock status with numa_migrate_irq_desc x86: add cache descriptors for Intel Core i7 x86/Voyager: make it build and boot
2009-02-04sched: fix nohz load balancer on cpu offlineSuresh Siddha
Christian Borntraeger reports: > After a logical cpu offline, even on a complete idle system, there > is one cpu with full ticks. It turns out that nohz.cpu_mask has the > the offlined cpu still set. > > In select_nohz_load_balancer() we check if the system is completely > idle to turn of load balancing. We compare cpu_online_map with > nohz.cpu_mask. Since cpu_online_map is updated on cpu unplug, > but nohz.cpu_mask is not, the check fails and the scheduler believes > that we need an "idle load balancer" even on a fully idle system. > Since the ilb cpu does not deactivate the timer tick this breaks NOHZ. Fix the select_nohz_load_balancer() to not set the nohz.cpu_mask while a cpu is going offline. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-04Merge branch 'core/xen' into x86/urgentIngo Molnar
2009-02-03ftrace: do_each_pid_task() needs rcu lockOleg Nesterov
"ftrace: use struct pid" commit 978f3a45d9499c7a447ca7615455cefb63d44165 converted ftrace_pid_trace to "struct pid*". But we can't use do_each_pid_task() without rcu_read_lock() even if we know the pid itself can't go away (it was pinned in ftrace_pid_write). The exiting task can detach itself from this pid at any moment. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-03Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched_rt: don't use first_cpu on cpumask created with cpumask_and sched: fix buddie group latency sched: clear buddies more aggressively sched: symmetric sync vs avg_overlap sched: fix sync wakeups cpuset: fix possible deadlock in async_rebuild_sched_domains
2009-02-03modules: Use a better scheme for refcountingEric Dumazet
Current refcounting for modules (done if CONFIG_MODULE_UNLOAD=y) is using a lot of memory. Each 'struct module' contains an [NR_CPUS] array of full cache lines. This patch uses existing infrastructure (percpu_modalloc() & percpu_modfree()) to allocate percpu space for the refcount storage. Instead of wasting NR_CPUS*128 bytes (on i386), we now use nr_cpu_ids*sizeof(local_t) bytes. On a typical distro, where NR_CPUS=8, shiping 2000 modules, we reduce size of module files by about 2 Mbytes. (1Kb per module) Instead of having all refcounters in the same memory node - with TLB misses because of vmalloc() - this new implementation permits to have better NUMA properties, since each CPU will use storage on its preferred node, thanks to percpu storage. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-01irq, x86: fix lock status with numa_migrate_irq_descYinghai Lu
Eric Paris reported: > I have an hp dl785g5 which is unable to successfully run > 2.6.29-0.66.rc3.fc11.x86_64 or 2.6.29-rc2-next-20090126. During bootup > (early in userspace daemons starting) I get the below BUG, which quickly > renders the machine dead. I assume it is because sparse_irq_lock never > gets released when the BUG kills that task. Adjust lock sequence when migrating a descriptor with CONFIG_NUMA_MIGRATE_IRQ_DESC enabled. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-01sched_rt: don't use first_cpu on cpumask created with cpumask_andRusty Russell
cpumask_and() only initializes nr_cpu_ids bits, so the (deprecated) first_cpu() might find one of those uninitialized bits if nr_cpu_ids is less than NR_CPUS (as it can be for CONFIG_CPUMASK_OFFSTACK). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-01sched: fix buddie group latencyPeter Zijlstra
Similar to the previous patch, by not clearing buddies we can select entities past their run quota, which can increase latency. This means we have to clear group buddies as well. Do not use the group clear for pick_next_task(), otherwise that'll get O(n^2). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-01sched: clear buddies more aggressivelyMike Galbraith
It was noticed that a task could get re-elected past its run quota due to buddy affinities. This could increase latency a little. Cure it by more aggresively clearing buddy state. We do so in two situations: - when we force preempt - when we select a buddy to run Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-01sched: symmetric sync vs avg_overlapPeter Zijlstra
Reinstate the weakening of the sync hint if set. This yields a more symmetric usage of avg_overlap. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-01sched: fix sync wakeupsPeter Zijlstra
Pawel Dziekonski reported that the openssl benchmark and his quantum chemistry application both show slowdowns due to the scheduler under-parallelizing execution. The reason are pipe wakeups still doing 'sync' wakeups which overrides the normal buddy wakeup logic - even if waker and wakee are loosely coupled. Fix an inversion of logic in the buddy wakeup code. Reported-by: Pawel Dziekonski <dzieko@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-31Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: generic-ipi: use per cpu data for single cpu ipi calls cpumask: convert lib/smp_processor_id to new cpumask ops signals, debug: fix BUG: using smp_processor_id() in preemptible code in print_fatal_signal()
2009-01-31Merge branch 'irq-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: irq: export __set_irq_handler() and handle_level_irq()
2009-01-31Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimer: prevent negative expiry value after clock_was_set() hrtimers: allow the hot-unplugging of all cpus hrtimers: increase clock min delta threshold while interrupt hanging
2009-01-31Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, ds, bts: cleanup/fix DS configuration ring-buffer: reset timestamps when ring buffer is reset trace: set max latency variable to zero on default trace: stop all recording to ring buffer on ftrace_dump trace: print ftrace_dump at KERN_EMERG log level ring_buffer: reset write when reserve buffer fail tracing/function-graph-tracer: fix a regression while suspend to disk ring-buffer: fix alignment problem
2009-01-30hrtimer: prevent negative expiry value after clock_was_set()Thomas Gleixner
Impact: prevent false positive WARN_ON() in clockevents_program_event() clock_was_set() changes the base->offset of CLOCK_REALTIME and enforces the reprogramming of the clockevent device to expire timers which are based on CLOCK_REALTIME. If the clock change is large enough then the subtraction of the timer expiry value and base->offset can become negative which triggers the warning in clockevents_program_event(). Check the subtraction result and set a negative value to 0. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-01-30hrtimers: allow the hot-unplugging of all cpusSebastien Dugue
Impact: fix CPU hotplug hang on Power6 testbox On architectures that support offlining all cpus (at least powerpc/pseries), hot-unpluging the tick_do_timer_cpu can result in a system hang. This comes from the fact that if the cpu going down happens to be the cpu doing the tick, then as the tick_do_timer_cpu handover happens after the cpu is dead (via the CPU_DEAD notification), we're left without ticks, jiffies are frozen and any task relying on timers (msleep, ...) is stuck. That's particularly the case for the cpu looping in __cpu_die() waiting for the dying cpu to be dead. This patch addresses this by having the tick_do_timer_cpu handover happen earlier during the CPU_DYING notification. For this, a new clockevent notification type is introduced (CLOCK_EVT_NOTIFY_CPU_DYING) which is triggered in hrtimer_cpu_notify(). Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-30hrtimers: increase clock min delta threshold while interrupt hangingFrederic Weisbecker
Impact: avoid timer IRQ hanging slow systems While using the function graph tracer on a virtualized system, the hrtimer_interrupt can hang the system on an infinite loop. This can be caused in several situations: - the hardware is very slow and HZ is set too high - something intrusive is slowing the system down (tracing under emulation) ... and the next clock events to program are always before the current time. This patch implements a reasonable compromise: if such a situation is detected, we share the CPUs time in 1/4 to process the hrtimer interrupts. This is enough to let the system running without serious starvation. It has been successfully tested under VirtualBox with 1000 HZ and 100 HZ with function graph tracer launched. On both cases, the clock events were increased until about 25 ms periodic ticks, which means 40 HZ. So we change a hard to debug hang into a warning message and a system that still manages to limp along. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-30generic-ipi: use per cpu data for single cpu ipi callsSteven Rostedt
The smp_call_function can be passed a wait parameter telling it to wait for all the functions running on other CPUs to complete before returning, or to return without waiting. Unfortunately, this is currently just a suggestion and not manditory. That is, the smp_call_function can decide not to return and wait instead. The reason for this is because it uses kmalloc to allocate storage to send to the called CPU and that CPU will free it when it is done. But if we fail to allocate the storage, the stack is used instead. This means we must wait for the called CPU to finish before continuing. Unfortunatly, some callers do no abide by this hint and act as if the non-wait option is mandatory. The MTRR code for instance will deadlock if the smp_call_function is set to wait. This is because the smp_call_function will wait for the other CPUs to finish their called functions, but those functions are waiting on the caller to continue. This patch changes the generic smp_call_function code to use per cpu variables if the allocation of the data fails for a single CPU call. The smp_call_function_many will fall back to the smp_call_function_single if it fails its alloc. The smp_call_function_single is modified to not force the wait state. Since we now are using a single data per cpu we must synchronize the callers to prevent a second caller modifying the data before the first called IPI functions complete. To do so, I added a flag to the call_single_data called CSD_FLAG_LOCK. When the single CPU is called (which can be called when a many call fails an alloc), we set the LOCK bit on this per cpu data. When the caller finishes it clears the LOCK bit. The caller must wait till the LOCK bit is cleared before setting it. When it is cleared, there is no IPI function using it. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Jens Axboe <jens.axboe@oracle.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-30cgroup: fix root_count when mount fails due to busy subsystemPaul Menage
root_count was being incremented in cgroup_get_sb() after all error checking was complete, but decremented in cgroup_kill_sb(), which can be called on a superblock that we gave up on due to an error. This patch changes cgroup_kill_sb() to only decrement root_count if the root was previously linked into the list of roots. Signed-off-by: Paul Menage <menage@google.com> Tested-by: Serge Hallyn <serue@us.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-30cgroups: add cpu_relax() calls in css_tryget() and cgroup_clear_css_refs()Paul Menage
css_tryget() and cgroup_clear_css_refs() contain polling loops; these loops should have cpu_relax calls in them to reduce cross-cache traffic. Signed-off-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>