summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2011-10-06sched: Wrap scheduler p->cpus_allowed accessPeter Zijlstra
This task is preparatory for the migrate_disable() implementation, but stands on its own and provides a cleanup. It currently only converts those sites required for task-placement. Kosaki-san once mentioned replacing cpus_allowed with a proper cpumask_t instead of the NR_CPUS sized array it currently is, that would also require something like this. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Link: http://lkml.kernel.org/n/tip-e42skvaddos99psip0vce41o@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-06sched: Request for idle balance during nohz idle load balanceSuresh Siddha
rq's idle_at_tick is set to idle/busy during the timer tick depending on the cpu was idle or not. This will be used later in the load balance that will be done in the softirq context (which is a process context in -RT kernels). For nohz kernels, for the cpu doing nohz idle load balance on behalf of all the idle cpu's, its rq->idle_at_tick might have a stale value (which is recorded when it got the timer tick presumably when it is busy). As the nohz idle load balancing is also being done at the same place as the regular load balancing, nohz idle load balancing was bailing out when it sees rq's idle_at_tick not set. Thus leading to poor system utilization. Rename rq's idle_at_tick to idle_balance and set it when someone requests for nohz idle balance on an idle cpu. Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111003220934.892350549@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-06sched: Use resched IPI to kick off the nohz idle balanceSuresh Siddha
Current use of smp call function to kick the nohz idle balance can deadlock in this scenario. 1. cpu-A did a generic_exec_single() to cpu-B and after queuing its call single data (csd) to the call single queue, cpu-A took a timer interrupt. Actual IPI to cpu-B to process the call single queue is not yet sent. 2. As part of the timer interrupt handler, cpu-A decided to kick cpu-B for the idle load balancing (sets cpu-B's rq->nohz_balance_kick to 1) and __smp_call_function_single() with nowait will queue the csd to the cpu-B's queue. But the generic_exec_single() won't send an IPI to cpu-B as the call single queue was not empty. 3. cpu-A is busy with lot of interrupts 4. Meanwhile cpu-B is entering and exiting idle and noticed that it has it's rq->nohz_balance_kick set to '1'. So it will go ahead and do the idle load balancer and clear its rq->nohz_balance_kick. 5. At this point, csd queued as part of the step-2 above is still locked and waiting to be serviced on cpu-B. 6. cpu-A is still busy with interrupt load and now it got another timer interrupt and as part of it decided to kick cpu-B for another idle load balancing (as it finds cpu-B's rq->nohz_balance_kick cleared in step-4 above) and does __smp_call_function_single() with the same csd that is still locked. 7. And we get a deadlock waiting for the csd_lock() in the __smp_call_function_single(). Main issue here is that cpu-B can service the idle load balancer kick request from cpu-A even with out receiving the IPI and this lead to doing multiple __smp_call_function_single() on the same csd leading to deadlock. To kick a cpu, scheduler already has the reschedule vector reserved. Use that mechanism (kick_process()) instead of using the generic smp call function mechanism to kick off the nohz idle load balancing and avoid the deadlock. [ This issue is present from 2.6.35+ kernels, but marking it -stable only from v3.0+ as the proposed fix depends on the scheduler_ipi() that is introduced recently. ] Reported-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: stable@kernel.org # v3.0+ Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111003220934.834943260@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-05rtmutex: Add missing rcu_read_unlock() in debug_rt_mutex_print_deadlock()Thomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-10-04genirq: Fix fatfinered fixup reallyThomas Gleixner
Putting the argument inside the quote does not really help. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-10-04sched: Fix idle_cpu()Thomas Gleixner
On -rt we observed hackbench waking all 400 tasks to a single cpu. This is because of select_idle_sibling()'s interaction with the new ipi based wakeup scheme. The existing idle_cpu() test only checks to see if the current task on that cpu is the idle task, it does not take already queued tasks into account, nor does it take queued to be woken tasks into account. If the remote wakeup IPIs come hard enough, there won't be time to schedule away from the idle task, and would thus keep thinking the cpu was in fact idle, regardless of the fact that there were already several hundred tasks runnable. We couldn't reproduce on mainline, but there's no reason it couldn't happen. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-3o30p18b2paswpc9ohy2gltp@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-04sched: Convert to struct llistPeter Zijlstra
Use the generic llist primitives. We had a private lockless list implementation in the scheduler in the wake-list code, now that we have a generic llist implementation that provides all required operations, switch to it. This patch is not expected to change any behavior. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Huang Ying <ying.huang@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1315836353.26517.42.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-04llist: Add llist_next()Peter Zijlstra
So we don't have to expose the struct list_node member. Cc: Huang Ying <ying.huang@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1315836348.26517.41.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-04irq_work: Use llist in the struct irq_work logicHuang Ying
Use llist in irq_work instead of the lock-less linked list implementation in irq_work to avoid the code duplication. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1315461646-1379-6-git-send-email-ying.huang@intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-04Merge branch 'linus' into sched/coreIngo Molnar
Merge reason: pick up the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-03ipv4: NET_IPV4_ROUTE_GC_INTERVAL removalVasily Averin
removing obsoleted sysctl, ip_rt_gc_interval variable no longer used since 2.6.38 Signed-off-by: Vasily Averin <vvs@sw.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-03genirq: percpu: allow interrupt type to be set at enable timeMarc Zyngier
As request_percpu_irq() doesn't allow for a percpu interrupt to have its type configured (it is generally impossible to configure it on all CPUs at once), add a 'type' argument to enable_percpu_irq(). This allows some low-level, board specific init code to be switched to a generic API. [ tglx: Added WARN_ON argument ] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Abhijeet Dharmapurikar <adharmap@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-10-03genirq: Add support for per-cpu dev_id interruptsMarc Zyngier
The ARM GIC interrupt controller offers per CPU interrupts (PPIs), which are usually used to connect local timers to each core. Each CPU has its own private interface to the GIC, and only sees the PPIs that are directly connect to it. While these timers are separate devices and have a separate interrupt line to a core, they all use the same IRQ number. For these devices, request_irq() is not the right API as it assumes that an IRQ number is visible by a number of CPUs (through the affinity setting), but makes it very awkward to express that an IRQ number can be handled by all CPUs, and yet be a different interrupt line on each CPU, requiring a different dev_id cookie to be passed back to the handler. The *_percpu_irq() functions is designed to overcome these limitations, by providing a per-cpu dev_id vector: int request_percpu_irq(unsigned int irq, irq_handler_t handler, const char *devname, void __percpu *percpu_dev_id); void free_percpu_irq(unsigned int, void __percpu *); int setup_percpu_irq(unsigned int irq, struct irqaction *new); void remove_percpu_irq(unsigned int irq, struct irqaction *act); void enable_percpu_irq(unsigned int irq); void disable_percpu_irq(unsigned int irq); The API has a number of limitations: - no interrupt sharing - no threading - common handler across all the CPUs Once the interrupt is requested using setup_percpu_irq() or request_percpu_irq(), it must be enabled by each core that wishes its local interrupt to be delivered. Based on an initial patch by Thomas Gleixner. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1316793788-14500-2-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-10-03writeback: per task dirty rate limitWu Fengguang
Add two fields to task_struct. 1) account dirtied pages in the individual tasks, for accuracy 2) per-task balance_dirty_pages() call intervals, for flexibility The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will scale near-sqrt to the safety gap between dirty pages and threshold. The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying pages at exactly the same time, each task will be assigned a large initial nr_dirtied_pause, so that the dirty threshold will be exceeded long before each task reached its nr_dirtied_pause and hence call balance_dirty_pages(). The solution is to watch for the number of pages dirtied on each CPU in between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages (3% dirty threshold), force call balance_dirty_pages() for a chance to set bdi->dirty_exceeded. In normal situations, this safeguarding condition is not expected to trigger at all. On the sqrt in dirty_poll_interval(): It will serve as an initial guess when dirty pages are still in the freerun area. When dirty pages are floating inside the dirty control scope [freerun, limit], a followup patch will use some refined dirty poll interval to get the desired pause time. thresh-dirty (MB) sqrt 1 16 2 22 4 32 8 45 16 64 32 90 64 128 128 181 256 256 512 362 1024 512 The above table means, given 1MB (or 1GB) gap and the dd tasks polling balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't be exceeded as long as there are less than 16 (or 512) concurrent dd's. So sqrt naturally leads to less overheads and more safe concurrent tasks for large memory servers, which have large (thresh-freerun) gaps. peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Andrea Righi <andrea@betterlinux.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-01Merge branches 'irq-urgent-for-linus', 'x86-urgent-for-linus' and ↵Linus Torvalds
'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip * 'irq-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: irq: Fix check for already initialized irq_domain in irq_domain_add irq: Add declaration of irq_domain_simple_ops to irqdomain.h * 'x86-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: x86/rtc: Don't recursively acquire rtc_lock * 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: posix-cpu-timers: Cure SMP wobbles sched: Fix up wchan borkage sched/rt: Migrate equal priority tasks to available CPUs
2011-10-01Merge branch 'rcu/next' of git://github.com/paulmckrcu/linux into core/rcuIngo Molnar
2011-09-30posix-cpu-timers: Cure SMP wobblesPeter Zijlstra
David reported: Attached below is a watered-down version of rt/tst-cpuclock2.c from GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or similar. Run it several times, and you will see cases where the main thread will measure a process clock difference before and after the nanosleep which is smaller than the cpu-burner thread's individual thread clock difference. This doesn't make any sense since the cpu-burner thread is part of the top-level process's thread group. I've reproduced this on both x86-64 and sparc64 (using both 32-bit and 64-bit binaries). For example: [davem@boricha build-x86_64-linux]$ ./test process: before(0.001221967) after(0.498624371) diff(497402404) thread: before(0.000081692) after(0.498316431) diff(498234739) self: before(0.001223521) after(0.001240219) diff(16698) [davem@boricha build-x86_64-linux]$ The diff of 'process' should always be >= the diff of 'thread'. I make sure to wrap the 'thread' clock measurements the most tightly around the nanosleep() call, and that the 'process' clock measurements are the outer-most ones. --- #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <fcntl.h> #include <string.h> #include <errno.h> #include <pthread.h> static pthread_barrier_t barrier; static void *chew_cpu(void *arg) { pthread_barrier_wait(&barrier); while (1) __asm__ __volatile__("" : : : "memory"); return NULL; } int main(void) { clockid_t process_clock, my_thread_clock, th_clock; struct timespec process_before, process_after; struct timespec me_before, me_after; struct timespec th_before, th_after; struct timespec sleeptime; unsigned long diff; pthread_t th; int err; err = clock_getcpuclockid(0, &process_clock); if (err) return 1; err = pthread_getcpuclockid(pthread_self(), &my_thread_clock); if (err) return 1; pthread_barrier_init(&barrier, NULL, 2); err = pthread_create(&th, NULL, chew_cpu, NULL); if (err) return 1; err = pthread_getcpuclockid(th, &th_clock); if (err) return 1; pthread_barrier_wait(&barrier); err = clock_gettime(process_clock, &process_before); if (err) return 1; err = clock_gettime(my_thread_clock, &me_before); if (err) return 1; err = clock_gettime(th_clock, &th_before); if (err) return 1; sleeptime.tv_sec = 0; sleeptime.tv_nsec = 500000000; nanosleep(&sleeptime, NULL); err = clock_gettime(th_clock, &th_after); if (err) return 1; err = clock_gettime(my_thread_clock, &me_after); if (err) return 1; err = clock_gettime(process_clock, &process_after); if (err) return 1; diff = process_after.tv_nsec - process_before.tv_nsec; printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", process_before.tv_sec, process_before.tv_nsec, process_after.tv_sec, process_after.tv_nsec, diff); diff = th_after.tv_nsec - th_before.tv_nsec; printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", th_before.tv_sec, th_before.tv_nsec, th_after.tv_sec, th_after.tv_nsec, diff); diff = me_after.tv_nsec - me_before.tv_nsec; printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", me_before.tv_sec, me_before.tv_nsec, me_after.tv_sec, me_after.tv_nsec, diff); return 0; } This is due to us using p->se.sum_exec_runtime in thread_group_cputime() where we iterate the thread group and sum all data. This does not take time since the last schedule operation (tick or otherwise) into account. We can cure this by using task_sched_runtime() at the cost of having to take locks. This also means we can (and must) do away with thread_group_sched_runtime() since the modified thread_group_cputime() is now more accurate and would deadlock when called from thread_group_sched_runtime(). Aside of that it makes the function safe on 32 bit systems. The old code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a 64bit value and could be changed on another cpu at the same time. Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins Tested-by: David Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-30Resource: fix wrong resource window calculationRam Pai
__find_resource() incorrectly returns a resource window which overlaps an existing allocated window. This happens when the parent's resource-window spans 0x00000000 to 0xffffffff and is entirely allocated to all its children resource-windows. __find_resource() looks for gaps in resource allocation among the children resource windows. When it encounters the last child window it blindly tries the range next to one allocated to the last child. Since the last child's window ends at 0xffffffff the calculation overflows, leading the algorithm to believe that any window in the range 0x0000000 to 0xfffffff is available for allocation. This leads to a conflicting window allocation. Michal Ludvig reported this issue seen on his platform. The following patch fixes the problem and has been verified by Michal. I believe this bug has been there for ages. It got exposed by git commit 2bbc6942273b ("PCI : ability to relocate assigned pci-resources") Signed-off-by: Ram Pai <linuxram@us.ibm.com> Tested-by: Michal Ludvig <mludvig@logix.net.nz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-29user namespace: usb: make usb urbs user namespace aware (v2)Serge Hallyn
Add to the dev_state and alloc_async structures the user namespace corresponding to the uid and euid. Pass these to kill_pid_info_as_uid(), which can then implement a proper, user-namespace-aware uid check. Changelog: Sep 20: Per Oleg's suggestion: Instead of caching and passing user namespace, uid, and euid each separately, pass a struct cred. Sep 26: Address Alan Stern's comments: don't define a struct cred at usbdev_open(), and take and put a cred at async_completed() to ensure it lasts for the duration of kill_pid_info_as_cred(). Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-09-29PM / Tracing: build rpm-traces.c only if CONFIG_PM_RUNTIME is setMing Lei
Do not build kernel/trace/rpm-traces.c if CONFIG_PM_RUNTIME is not set, which avoids a build failure. [rjw: Added the changelog and modified the subject slightly.] Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-09-29rcu: Move propagation of ->completed from rcu_start_gp() to rcu_report_qs_rsp()Paul E. McKenney
It is possible for the CPU that noted the end of the prior grace period to not need a new one, and therefore to decide to propagate ->completed throughout the rcu_node tree without starting another grace period. However, in so doing, it releases the root rcu_node structure's lock, which can allow some other CPU to start another grace period. The first CPU will be propagating ->completed in parallel with the second CPU initializing the rcu_node tree for the new grace period. In theory this is harmless, but in practice we need to keep things simple. This commit therefore moves the propagation of ->completed to rcu_report_qs_rsp(), and refrains from marking the old grace period as having been completed until it has finished doing this. This prevents anyone from starting a new grace period concurrently with marking the old grace period as having been completed. Of course, the optimization where a CPU needing a new grace period doesn't bother marking the old one completed is still in effect: In that case, the marking happens implicitly as part of initializing the new grace period. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Remove rcu_needs_cpu_flush() to avoid false quiescent statesPaul E. McKenney
The purpose of rcu_needs_cpu_flush() was to iterate on pushing the current grace period in order to help the current CPU enter dyntick-idle mode. However, this can result in failures if the CPU starts entering dyntick-idle mode, but then backs out. In this case, the call to rcu_pending() from rcu_needs_cpu_flush() might end up announcing a non-existing quiescent state. This commit therefore removes rcu_needs_cpu_flush() in favor of letting the dyntick-idle machinery at the end of the softirq handler push the loop along via its call to rcu_pending(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Wire up RCU_BOOST_PRIO for rcutreeMike Galbraith
RCU boost threads start life at RCU_BOOST_PRIO, while others remain at RCU_KTHREAD_PRIO. While here, change thread names to match other kthreads, and adjust rcu_yield() to not override the priority set by the user. This last change sets the stage for runtime changes to priority in the -rt tree. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Make rcu_torture_boost() exit loops at end of testPaul E. McKenney
One of the loops in rcu_torture_boost() fails to check kthread_should_stop(), and thus might be slowing or even stopping completion of rcutorture tests at rmmod time. This commit adds the kthread_should_stop() check to the offending loop. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Make rcu_torture_fqs() exit loops at end of testPaul E. McKenney
The rcu_torture_fqs() function can prevent the rcutorture tests from completing, resulting in a hang. This commit therefore ensures that rcu_torture_fqs() will exit its inner loops at the end of the test, and also applies the newish ULONG_CMP_LT() macro to time comparisons. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Permit rt_mutex_unlock() with irqs disabledPaul E. McKenney
Create a separate lockdep class for the rt_mutex used for RCU priority boosting and enable use of rt_mutex_lock() with irqs disabled. This prevents RCU priority boosting from falling prey to deadlocks when someone begins an RCU read-side critical section in preemptible state, but releases it with an irq-disabled lock held. Unfortunately, the scheduler's runqueue and priority-inheritance locks still must either completely enclose or be completely enclosed by any overlapping RCU read-side critical section. This version removes a redundant local_irq_restore() noted by Yong Zhang. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Avoid having just-onlined CPU resched itself when RCU is idlePaul E. McKenney
CPUs set rdp->qs_pending when coming online to resolve races with grace-period start. However, this means that if RCU is idle, the just-onlined CPU might needlessly send itself resched IPIs. Adjust the online-CPU initialization to avoid this, and also to correctly cause the CPU to respond to the current grace period if needed. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Josh Boyer <jwboyer@redhat.com> Tested-by: Christian Hoffmann <email@christianhoffmann.info>
2011-09-29rcu: Suppress NMI backtraces when stall ends before dumpPaul E. McKenney
It is possible for an RCU CPU stall to end just as it is detected, in which case the current code will uselessly dump all CPU's stacks. This commit therefore checks for this condition and refrains from sending needless NMIs. And yes, the stall might also end just after we checked all CPUs and tasks, but in that case we would at least have given some clue as to which CPU/task was at fault. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Prohibit grace periods during early bootPaul E. McKenney
Greater use of RCU during early boot (before the scheduler is operating) is causing RCU to attempt to start grace periods during that time, which in turn is resulting in both RCU and the callback functions attempting to use the scheduler before it is ready. This commit prevents these problems by prohibiting RCU grace periods until after the scheduler has spawned the first non-idle task. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Simplify unboosting checksPaul E. McKenney
Commit 7765be (Fix RCU_BOOST race handling current->rcu_read_unlock_special) introduced a new ->rcu_boosted field in the task structure. This is redundant because the existing ->rcu_boost_mutex will be non-NULL at any time that ->rcu_boosted is nonzero. Therefore, this commit removes ->rcu_boosted and tests ->rcu_boost_mutex instead. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Prevent early boot set_need_resched() from __rcu_pending()Paul E. McKenney
There isn't a whole lot of point in poking the scheduler before there are other tasks to switch to. This commit therefore adds a check for rcu_scheduler_fully_active in __rcu_pending() to suppress any pre-scheduler calls to set_need_resched(). The downside of this approach is additional runtime overhead in a reasonably hot code path. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Dump local stack if cannot dump all CPUs' stacksPaul E. McKenney
The trigger_all_cpu_backtrace() function is a no-op in architectures that do not define arch_trigger_all_cpu_backtrace. On such architectures, RCU CPU stall warning messages contain no stack trace information, which makes debugging quite difficult. This commit therefore substitutes dump_stack() for architectures that do not define arch_trigger_all_cpu_backtrace, so that at least the local CPU's stack is dumped as part of the RCU CPU stall warning message. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Move __rcu_read_unlock()'s barrier() within if-statementPaul E. McKenney
We only need to constrain the compiler if we are actually exiting the top-level RCU read-side critical section. This commit therefore moves the first barrier() cal in __rcu_read_unlock() to inside the "if" statement, thus avoiding needless register flushes for inner rcu_read_unlock() calls. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Make rcu_implicit_dynticks_qs() locals be correct sizePaul E. McKenney
When the ->dynticks field in the rcu_dynticks structure changed to an atomic_t, its size on 64-bit systems changed from 64 bits to 32 bits. The local variables in rcu_implicit_dynticks_qs() need to change as well, hence this commit. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Eliminate in_irq() checks in rcu_enter_nohz()Paul E. McKenney
The in_irq() check in rcu_enter_nohz() is redundant because if we really are in an interrupt, the attempt to re-enter dyntick-idle mode will invoke rcu_needs_cpu() in any case, which will force the check for RCU callbacks. So this commit removes the check along with the set_need_resched(). Suggested-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29nohz: Remove nohz_cpu_maskShi, Alex
RCU no longer uses this global variable, nor does anyone else. This commit therefore removes this variable. This reduces memory footprint and also removes some atomic instructions and memory barriers from the dyntick-idle path. Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Allow rcutorture's stat_interval parameter to be changed at runtimePaul E. McKenney
When rcutorture is compiled directly into the kernel (instead of separately as a module), it is necessary to specify rcutorture.stat_interval as a kernel command-line parameter, otherwise, the rcu_torture_stats kthread is never started. However, when working with the system after it has booted, it is convenient to be able to change the time between statistic printing, particularly when logged into the console. This commit therefore allows the stat_interval parameter to be changed at runtime. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Simplify quiescent-state accountingPaul E. McKenney
There is often a delay between the time that a CPU passes through a quiescent state and the time that this quiescent state is reported to the RCU core. It is quite possible that the grace period ended before the quiescent state could be reported, for example, some other CPU might have deduced that this CPU passed through dyntick-idle mode. It is critically important that quiescent state be counted only against the grace period that was in effect at the time that the quiescent state was detected. Previously, this was handled by recording the number of the last grace period to complete when passing through a quiescent state. The RCU core then checks this number against the current value, and rejects the quiescent state if there is a mismatch. However, one additional possibility must be accounted for, namely that the quiescent state was recorded after the prior grace period completed but before the current grace period started. In this case, the RCU core must reject the quiescent state, but the recorded number will match. This is handled when the CPU becomes aware of a new grace period -- at that point, it invalidates any prior quiescent state. This works, but is a bit indirect. The new approach records the current grace period, and the RCU core checks to see (1) that this is still the current grace period and (2) that this grace period has not yet ended. This approach simplifies reasoning about correctness, and this commit changes over to this new approach. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Add grace-period, quiescent-state, and call_rcu trace eventsPaul E. McKenney
Add trace events to record grace-period start and end, quiescent states, CPUs noticing grace-period start and end, grace-period initialization, call_rcu() invocation, tasks blocking in RCU read-side critical sections, tasks exiting those same critical sections, force_quiescent_state() detection of dyntick-idle and offline CPUs, CPUs entering and leaving dyntick-idle mode (except from NMIs), CPUs coming online and going offline, and CPUs being kicked for staying in dyntick-idle mode for too long (as in many weeks, even on 32-bit systems). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> rcu: Add the rcu flavor to callback trace events The earlier trace events for registering RCU callbacks and for invoking them did not include the RCU flavor (rcu_bh, rcu_preempt, or rcu_sched). This commit adds the RCU flavor to those trace events. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Make TINY_RCU also use softirq for RCU_BOOST=nPaul E. McKenney
This patch #ifdefs TINY_RCU kthreads out of the kernel unless RCU_BOOST=y, thus eliminating context-switch overhead if RCU priority boosting has not been configured. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Add event-trace markers to TREE_RCU kthreadsPaul E. McKenney
Add event-trace markers to TREE_RCU kthreads to allow including these kthread's CPU time in the utilization calculations. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Move RCU_BOOST declarations to allow compiler checkingPaul E. McKenney
Andi Kleen noticed that one of the RCU_BOOST data declarations was out of sync with the definition. Move the declarations so that the compiler can do the checking in the future. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Update comments to reflect softirqs vs. kthreadsPaul E. McKenney
We now have kthreads only for flavors of RCU that support boosting, so update the now-misleading comments accordingly. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Add RCU type to callback-invocation tracingPaul E. McKenney
Add a string to the rcu_batch_start() and rcu_batch_end() trace messages that indicates the RCU type ("rcu_sched", "rcu_bh", or "rcu_preempt"). The trace messages for the actual invocations themselves are not marked, as it should be clear from the rcu_batch_start() and rcu_batch_end() events before and after. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Put names into TINY_RCU structures under RCU_TRACEPaul E. McKenney
In order to allow event tracing to distinguish between flavors of RCU, we need those names in the relevant RCU data structures. TINY_RCU has avoided them for memory-footprint reasons, so add them only if CONFIG_RCU_TRACE=y. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Event-trace markers for computing RCU CPU utilizationPaul E. McKenney
This commit adds the trace_rcu_utilization() marker that is to be used to allow postprocessing scripts compute RCU's CPU utilization, give or take event-trace overhead. Note that we do not include RCU's dyntick-idle interface because event tracing requires RCU protection, which is not available in dyntick-idle mode. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Add event-tracing for RCU callback invocationPaul E. McKenney
There was recently some controversy about the overhead of invoking RCU callbacks. Add TRACE_EVENT()s to obtain fine-grained timings for the start and stop of a batch of callbacks and also for each callback invoked. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Don't destroy rcu_torture_boost() callback until it is donePaul E. McKenney
The rcu_torture_boost() cleanup code destroyed debug-objects state before waiting for the last RCU callback to be invoked, resulting in rare but very real debug-objects warnings. Move the destruction to after the waiting to fix this problem. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Catch rcutorture up to new RCU API additionsPaul E. McKenney
Now that the RCU API contains synchronize_rcu_bh(), synchronize_sched(), call_rcu_sched(), and rcu_bh_expedited()... Make rcutorture test synchronize_rcu_bh(), getting rid of the old rcu_bh_torture_synchronize() workaround. Similarly, make rcutorture test synchronize_sched(), getting rid of the old sched_torture_synchronize() workaround. Make rcutorture test call_rcu_sched() instead of wrappering synchronize_sched(). Also add testing of rcu_bh_expedited(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-09-29rcu: Abstract common code for RCU grace-period-wait primitivesPaul E. McKenney
Pull the code that waits for an RCU grace period into a single function, which is then called by synchronize_rcu() and friends in the case of TREE_RCU and TREE_PREEMPT_RCU, and from rcu_barrier() and friends in the case of TINY_RCU and TINY_PREEMPT_RCU. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>