summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2015-04-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "Three fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: numa: disable change protection for vma(VM_HUGETLB) include/linux/dmapool.h: declare struct device mm: move zone lock to a different cache line than order-0 free page lists
2015-04-08Copy the kernel module data from user space in chunksLinus Torvalds
Unlike most (all?) other copies from user space, kernel module loading is almost unlimited in size. So we do a potentially huge "copy_from_user()" when we copy the module data from user space to the kernel buffer, which can be a latency concern when preemption is disabled (or voluntary). Also, because 'copy_from_user()' clears the tail of the kernel buffer on failures, even a *failed* copy can end up wasting a lot of time. Normally neither of these are concerns in real life, but they do trigger when doing stress-testing with trinity. Running in a VM seems to add its own overheadm causing trinity module load testing to even trigger the watchdog. The simple fix is to just chunk up the module loading, so that it never tries to copy insanely big areas in one go. That bounds the latency, and also the amount of (unnecessarily, in this case) cleared memory for the failure case. Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-08genirq: Allow the irqchip state of an IRQ to be save/restoredMarc Zyngier
There is a number of cases where a kernel subsystem may want to introspect the state of an interrupt at the irqchip level: - When a peripheral is shared between virtual machines, its interrupt state becomes part of the guest's state, and must be switched accordingly. KVM on arm/arm64 requires this for its guest-visible timer - Some GPIO controllers seem to require peeking into the interrupt controller they are connected to to report their internal state This seem to be a pattern that is common enough for the core code to try and support this without too many horrible hacks. Introduce a pair of accessors (irq_get_irqchip_state/irq_set_irqchip_state) to retrieve the bits that can be of interest to another subsystem: pending, active, and masked. - irq_get_irqchip_state returns the state of the interrupt according to a parameter set to IRQCHIP_STATE_PENDING, IRQCHIP_STATE_ACTIVE, IRQCHIP_STATE_MASKED or IRQCHIP_STATE_LINE_LEVEL. - irq_set_irqchip_state similarly sets the state of the interrupt. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Bjorn Andersson <bjorn.andersson@sonymobile.com> Tested-by: Bjorn Andersson <bjorn.andersson@sonymobile.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Abhijeet Dharmapurikar <adharmap@codeaurora.org> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Phong Vo <pvo@apm.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Tin Huynh <tnhuynh@apm.com> Cc: Y Vo <yvo@apm.com> Cc: Toan Le <toanle@apm.com> Cc: Bjorn Andersson <bjorn@kryo.se> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Arnd Bergmann <arnd@arndb.de> Link: http://lkml.kernel.org/r/1426676484-21812-2-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-08genirq: MSI: Fix freeing of unallocated MSIMarc Zyngier
While debugging an unrelated issue with the GICv3 ITS driver, the following trace triggered: WARNING: CPU: 1 PID: 1 at kernel/irq/irqdomain.c:1121 irq_domain_free_irqs+0x160/0x17c() NULL pointer, cannot free irq Modules linked in: CPU: 1 PID: 1 Comm: swapper/0 Tainted: G W 3.19.0-rc6+ #3690 Hardware name: FVP Base (DT) Call trace: [<ffffffc000089398>] dump_backtrace+0x0/0x13c [<ffffffc0000894e4>] show_stack+0x10/0x1c [<ffffffc00066d134>] dump_stack+0x74/0x94 [<ffffffc0000a92f8>] warn_slowpath_common+0x9c/0xd4 [<ffffffc0000a938c>] warn_slowpath_fmt+0x5c/0x80 [<ffffffc0000ee04c>] irq_domain_free_irqs+0x15c/0x17c [<ffffffc0000ef918>] msi_domain_free_irqs+0x58/0x74 [<ffffffc000386f58>] free_msi_irqs+0xb4/0x1c0 // The msi_prepare callback fails here [<ffffffc0003872c0>] pci_enable_msix+0x25c/0x3d4 [<ffffffc00038746c>] pci_enable_msix_range+0x34/0x80 [<ffffffc0003924ac>] vp_try_to_find_vqs+0xec/0x528 [<ffffffc000392954>] vp_find_vqs+0x6c/0xa8 [<ffffffc0003ee2a8>] init_vq+0x120/0x248 [<ffffffc0003eefb0>] virtblk_probe+0xb0/0x6bc [<ffffffc00038fc34>] virtio_dev_probe+0x17c/0x214 [<ffffffc0003d4a04>] driver_probe_device+0x7c/0x23c [<ffffffc0003d4cb0>] __driver_attach+0x98/0xa0 [<ffffffc0003d2c60>] bus_for_each_dev+0x60/0xb4 [<ffffffc0003d455c>] driver_attach+0x1c/0x28 [<ffffffc0003d41b0>] bus_add_driver+0x150/0x208 [<ffffffc0003d54c0>] driver_register+0x64/0x130 [<ffffffc00038f9e8>] register_virtio_driver+0x24/0x68 [<ffffffc00091320c>] init+0x70/0xac [<ffffffc0000828f0>] do_one_initcall+0x94/0x1d0 [<ffffffc0008e9b00>] kernel_init_freeable+0x144/0x1e4 [<ffffffc00066a434>] kernel_init+0xc/0xd8 ---[ end trace f9ee562a77cc7bae ]--- The ITS msi_prepare callback having failed, we end-up trying to free MSIs that have never been allocated. Oddly enough, the kernel is pretty upset about it. It turns out that this behaviour was expected before the MSI domain was introduced (and dealt with in arch_teardown_msi_irqs). The obvious fix is to detect this early enough and bail out. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Jiang Liu <jiang.liu@linux.intel.com> Link: http://lkml.kernel.org/r/1422299419-6051-1-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-08Merge branch 'linus' into irq/core to get the GIC updates whichThomas Gleixner
conflict with pending GIC changes. Conflicts: drivers/usb/isp1760/isp1760-core.c
2015-04-08tracing: Add enum_map file to show enums that have been mappedSteven Rostedt (Red Hat)
Add a enum_map file in the tracing directory to see what enums have been saved to convert in the print fmt files. As this requires the enum mapping to be persistent in memory, it is only created if the new config option CONFIG_TRACE_ENUM_MAP_FILE is enabled. This is for debugging and will increase the persistent memory footprint of the kernel. Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08tracing: Allow for modules to convert their enums to valuesSteven Rostedt (Red Hat)
Update the infrastructure such that modules that declare TRACE_DEFINE_ENUM() will have those enums converted into their values in the tracepoint print fmt strings. Link: http://lkml.kernel.org/r/87vbhjp74q.fsf@rustcorp.com.au Acked-by: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-08tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their valuesSteven Rostedt (Red Hat)
Several tracepoints use the helper functions __print_symbolic() or __print_flags() and pass in enums that do the mapping between the binary data stored and the value to print. This works well for reading the ASCII trace files, but when the data is read via userspace tools such as perf and trace-cmd, the conversion of the binary value to a human string format is lost if an enum is used, as userspace does not have access to what the ENUM is. For example, the tracepoint trace_tlb_flush() has: __print_symbolic(REC->reason, { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" }, { TLB_REMOTE_SHOOTDOWN, "remote shootdown" }, { TLB_LOCAL_SHOOTDOWN, "local shootdown" }, { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" }) Which maps the enum values to the strings they represent. But perf and trace-cmd do no know what value TLB_LOCAL_MM_SHOOTDOWN is, and would not be able to map it. With TRACE_DEFINE_ENUM(), developers can place these in the event header files and ftrace will convert the enums to their values: By adding: TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH); TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN); TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN); TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN); $ cat /sys/kernel/debug/tracing/events/tlb/tlb_flush/format [...] __print_symbolic(REC->reason, { 0, "flush on task switch" }, { 1, "remote shootdown" }, { 2, "local shootdown" }, { 3, "local mm shootdown" }) The above is what userspace expects to see, and tools do not need to be modified to parse them. Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org Cc: Guilherme Cox <cox@computer.org> Cc: Tony Luck <tony.luck@gmail.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-07mm: numa: disable change protection for vma(VM_HUGETLB)Naoya Horiguchi
Currently when a process accesses a hugetlb range protected with PROTNONE, unexpected COWs are triggered, which finally puts the hugetlb subsystem into a broken/uncontrollable state, where for example h->resv_huge_pages is subtracted too much and wraps around to a very large number, and the free hugepage pool is no longer maintainable. This patch simply stops changing protection for vma(VM_HUGETLB) to fix the problem. And this also allows us to avoid useless overhead of minor faults. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Suggested-by: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-06Revert "PM / hibernate: avoid unsafe pages in e820 reserved regions"Rafael J. Wysocki
Commit 84c91b7ae07c (PM / hibernate: avoid unsafe pages in e820 reserved regions) is reported to make resume from hibernation on Lenovo x230 unreliable, so revert it. We will revisit the issue the commit in question was supposed to fix in the future. Link: https://bugzilla.kernel.org/show_bug.cgi?id=96111 Reported-by: rhn <kebuac.rhn@porcupinefactory.org> Cc: 3.17+ <stable@vger.kernel.org> # 3.17+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-04-06workqueue: Reorder sysfs codeFrederic Weisbecker
The sysfs code usually belongs to the botom of the file since it deals with high level objects. In the workqueue code it's misplaced and such that we'll need to work around functions references to allow the sysfs code to call APIs like apply_workqueue_attrs(). Lets move that block further in the file, almost the botom. And declare workqueue_sysfs_unregister() just before destroy_workqueue() which reference it. tj: Moved workqueue_sysfs_unregister() forward declaration where other forward declarations are. Suggested-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Kevin Hilman <khilman@linaro.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Mike Galbraith <bitbucket@online.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-04-03timers/PM: Drop unnecessary braces from tick_freeze()Rafael J. Wysocki
Some braces in tick_freeze() are not necessary, so drop them. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: peterz@infradead.org Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1534128.H5hN3KBFB4@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03timers/PM: Fix up tick_unfreeze()Rafael J. Wysocki
A recent conflict resolution has left tick_resume() in tick_unfreeze() which leads to an unbalanced execution of tick_resume_broadcast() every time that function runs. Fix that by replacing the tick_resume() in tick_unfreeze() with tick_resume_local() as appropriate. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: boris.ostrovsky@oracle.com Cc: david.vrabel@citrix.com Cc: konrad.wilk@oracle.com Cc: peterz@infradead.org Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/8099075.V0LvN3pQAV@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03sched/core: Drop debugging leftover trace_printk callBorislav Petkov
Commit: 3c18d447b3b3 ("sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()") forgot a trace_printk() debugging piece in and Steve's banner screamed in dmesg. Remove it. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1428050570-21041-1-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03timekeeping: Get rid of stale commentThomas Gleixner
Arch specific management of xtime/jiffies/wall_to_monotonic is gone for quite a while. Zap the stale comment. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/2422730.dmO29q661S@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03clockevents: Cleanup dead cpu explicitelyThomas Gleixner
clockevents_notify() is a leftover from the early design of the clockevents facility. It's really not a notification mechanism, it's a multiplex call. We are way better off to have explicit calls instead of this monstrosity. Split out the cleanup function for a dead cpu and invoke it directly from the cpu down code. Make it conditional on CPU_HOTPLUG as well. Temporary change, will be refined in the future. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [ Rebased, added clockevents_notify() removal ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1735025.raBZdQHM3m@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03clockevents: Make tick handover explicitThomas Gleixner
clockevents_notify() is a leftover from the early design of the clockevents facility. It's really not a notification mechanism, it's a multiplex call. We are way better off to have explicit calls instead of this monstrosity. Split out the tick_handover call and invoke it explicitely from the hotplug code. Temporary solution will be cleaned up in later patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [ Rebase ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1658173.RkEEILFiQZ@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03clockevents: Remove broadcast oneshot control leftoversRafael J. Wysocki
Now that all users are converted over to explicit calls into the clockevents state machine, remove the notification chain leftovers. Original-from: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/14018863.NQUzkFuafr@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03sched/idle: Use explicit broadcast oneshot control functionThomas Gleixner
Replace the clockevents_notify() call with an explicit function call. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/6422336.RMm7oUHcXh@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03clockevents: Provide explicit broadcast oneshot control functionsThomas Gleixner
clockevents_notify() is a leftover from the early design of the clockevents facility. It's really not a notification mechanism, it's a multiplex call. We are way better off to have explicit calls instead of this monstrosity. Split out the broadcast oneshot control into a separate function and provide inline helpers. Switch clockevents_notify() over. This will go away once all callers are converted. This also gets rid of the nested locking of clockevents_lock and broadcast_lock. The broadcast oneshot control functions do not require clockevents_lock. Only the managing functions (setup/shutdown/suspend/resume of the broadcast device require clockevents_lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Alexandre Courbot <gnurou@gmail.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Len Brown <lenb@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Thierry Reding <thierry.reding@gmail.com> Cc: Tony Lindgren <tony@atomide.com> Link: http://lkml.kernel.org/r/13000649.8qZuEDV0OA@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03clockevents: Remove the broadcast control leftoversThomas Gleixner
All users converted. Remove the notify leftovers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/2076318.76XJZ8QYP3@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03clockevents: Provide explicit broadcast control functionsThomas Gleixner
clockevents_notify() is a leftover from the early design of the clockevents facility. It's really not a notification mechanism, it's a multiplex call. We are way better off to have explicit calls instead of this monstrosity. Split out the broadcast control into a separate function and provide inline helpers. Switch clockevents_notify() over. This will go away once all callers are converted. This also gets rid of the nested locking of clockevents_lock and broadcast_lock. The broadcast control functions do not require clockevents_lock. Only the managing functions (setup/shutdown/suspend/resume of the broadcast device require clockevents_lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Len Brown <lenb@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tony Lindgren <tony@atomide.com> Link: http://lkml.kernel.org/r/8086559.ttsuS0n1Xr@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03clocksource: Improve comment explaining clocks_calc_max_nsecs()'s 50% safety ↵John Stultz
margin Ingo noted that the description of clocks_calc_max_nsecs()'s 50% safety margin was somewhat circular. So this patch tries to improve the comment to better explain what we mean by the 50% safety margin and why we need it. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1427945681-29972-20-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03time, drivers/rtc: Don't bother with rtc_resume() for the nonstop clocksourceXunlei Pang
If a system does not provide a persistent_clock(), the time will be updated on resume by rtc_resume(). With the addition of the non-stop clocksources for suspend timing, those systems set the time on resume in timekeeping_resume(), but may not provide a valid persistent_clock(). This results in the rtc_resume() logic thinking no one has set the time and it then will over-write the suspend time again, which is not necessary and only increases clock error. So, fix this for rtc_resume(). This patch also improves the name of persistent_clock_exist to make it more grammatical. Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1427945681-29972-19-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03time: Fix a bug in timekeeping_suspend() with no persistent clockXunlei Pang
When there's no persistent clock, normally timekeeping_suspend_time should always be zero, but this can break in timekeeping_suspend(). At T1, there was a system suspend, so old_delta was assigned T1. After some time, one time adjustment happened, and xtime got the value of T1-dt(0s<dt<2s). Then, there comes another system suspend soon after this adjustment, obviously we will get a small negative delta_delta, resulting in a negative timekeeping_suspend_time. This is problematic, when doing timekeeping_resume() if there is no nonstop clocksource for example, it will hit the else leg and inject the improper sleeptime which is the wrong logic. So, we can solve this problem by only doing delta related code when the persistent clock is existent. Actually the code only makes sense for persistent clock cases. Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1427945681-29972-18-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03time: Don't build timekeeping_inject_sleeptime64() if no one uses itXunlei Pang
timekeeping_inject_sleeptime64() is only used by RTC suspend/resume, so add build dependencies on the necessary RTC related macros. Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org> [ Improve commit message clarity. ] Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1427945681-29972-16-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03time: Add y2038 safe update_persistent_clock64()Xunlei Pang
As part of addressing in-kernel y2038 issues, this patch adds update_persistent_clock64() and replaces all the call sites of update_persistent_clock() with this function. This is a __weak implementation, which simply calls the existing y2038 unsafe update_persistent_clock(). This allows architecture specific implementations to be converted independently, and eventually y2038-unsafe update_persistent_clock() can be removed after all its architecture specific implementations have been converted to update_persistent_clock64(). Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1427945681-29972-4-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03time: Add y2038 safe read_persistent_clock64()Xunlei Pang
As part of addressing in-kernel y2038 issues, this patch adds read_persistent_clock64() and replaces all the call sites of read_persistent_clock() with this function. This is a __weak implementation, which simply calls the existing y2038 unsafe read_persistent_clock(). This allows architecture specific implementations to be converted independently, and eventually the y2038 unsafe read_persistent_clock() can be removed after all its architecture specific implementations have been converted to read_persistent_clock64(). Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1427945681-29972-3-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-03time: Add y2038 safe read_boot_clock64()Xunlei Pang
As part of addressing in-kernel y2038 issues, this patch adds read_boot_clock64() and replaces all the call sites of read_boot_clock() with this function. This is a __weak implementation, which simply calls the existing y2038 unsafe read_boot_clock(). This allows architecture specific implementations to be converted independently, and eventually the y2038 unsafe read_boot_clock() can be removed after all its architecture specific implementations have been converted to read_boot_clock64(). Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1427945681-29972-2-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/usb/asix_common.c drivers/net/usb/sr9800.c drivers/net/usb/usbnet.c include/linux/usb/usbnet.h net/ipv4/tcp_ipv4.c net/ipv6/tcp_ipv6.c The TCP conflicts were overlapping changes. In 'net' we added a READ_ONCE() to the socket cached RX route read, whilst in 'net-next' Eric Dumazet touched the surrounding code dealing with how mini sockets are handled. With USB, it's a case of the same bug fix first going into net-next and then I cherry picked it back into net. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02ftrace/x86: Let dynamic trampolines call ops->func even for dynamic fopsSteven Rostedt (Red Hat)
Dynamically allocated trampolines call ftrace_ops_get_func to get the function which they should call. For dynamic fops (FTRACE_OPS_FL_DYNAMIC flag is set) ftrace_ops_list_func is always returned. This is reasonable for static trampolines but goes against the main advantage of dynamic ones, that is avoidance of going through the list of all registered callbacks for functions that are only being traced by a single callback. We can fix it by returning ops->func (or recursion safe version) from ftrace_ops_get_func whenever it is possible for dynamic trampolines. Note that dynamic trampolines are not allowed for dynamic fops if CONFIG_PREEMPT=y. Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1501291023000.25445@pobox.suse.cz Link: http://lkml.kernel.org/r/1424357773-13536-1-git-send-email-mbenes@suse.cz Reported-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-04-02timer: Further simplify the SMP and HOTPLUG logicPeter Zijlstra
Remove one CONFIG_HOTPLUG_CPU #ifdef in trade for introducing one CONFIG_SMP #ifdef. The CONFIG_SMP ifdef avoids declaring the per-CPU __tvec_bases storage on UP systems since they already have boot_tvec_bases. Also (re)add a runtime check on the base alignment -- for the paranoid amongst us :-) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/fdd2d35e169bdc554ffa3fe77f77716298c75ada.1427814611.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02timer: Don't initialize 'tvec_base' on hotplugViresh Kumar
There is no need to call init_timers_cpu() on every CPU hotplug event, there is not much we need to reset. - Timer-lists are already empty at the end of migrate_timers(). - timer_jiffies will be refreshed while adding a new timer, after the CPU is online again. - active_timers and all_timers can be reset from migrate_timers(). Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/54a1c30ea7b805af55beb220cadf5a07a21b0a4d.1427814611.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02timer: Allocate per-cpu tvec_base's staticallyPeter Zijlstra
Memory for the 'tvec_base' array is allocated separately for the boot CPU (statically) and non-boot CPUs (dynamically). The reason is because __TIMER_INITIALIZER() needs to set ->base to a valid pointer (because we've made NULL special, hint: lock_timer_base()) and we cannot get a compile time pointer to per-cpu entries because we don't know where we'll map the section, even for the boot cpu. This can be simplified a bit by statically allocating per-cpu memory. The only disadvantage is that memory for one of the structures will stay unused, i.e. for the boot CPU, which uses boot_tvec_bases. This will also guarantee that tvec_base is cacheline aligned. Even though tvec_base has ____cacheline_aligned stuck on, kzalloc_node() does not actually respect that (but guarantees a minimum u64 alignment). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/17cdf560f2727f687ab159707d0aa591f8a2f82d.1427814611.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02sched/deadline: Support DL task migration during CPU hotplugWanpeng Li
I observed that DL tasks can't be migrated to other CPUs during CPU hotplug, in addition, task may/may not be running again if CPU is added back. The root cause which I found is that DL tasks will be throtted and removed from the DL rq after comsuming all their budget, which leads to the situation that stop task can't pick them up from the DL rq and migrate them to other CPUs during hotplug. The method to reproduce: schedtool -E -t 50000:100000 -e ./test Actually './test' is just a simple for loop. Then observe which CPU the test task is on and offline it: echo 0 > /sys/devices/system/cpu/cpuN/online This patch adds the DL task migration during CPU hotplug by finding a most suitable later deadline rq after DL timer fires if current rq is offline. If it fails to find a suitable later deadline rq then it falls back to any eligible online CPU in so that the deadline task will come back to us, and the push/pull mechanism should then move it around properly. Suggested-and-Acked-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1427411315-4298-1-git-send-email-wanpeng.li@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()Juri Lelli
Hotplug operations are destructive w.r.t. cpusets. In case such an operation is performed on a CPU belonging to an exlusive cpuset, the DL bandwidth information associated with the corresponding root domain is gone even if the operation fails (in sched_cpu_inactive()). For this reason we need to move the check we currently have in sched_cpu_inactive() to cpuset_cpu_inactive() to prevent useless cpusets reconfiguration in the CPU_DOWN_FAILED path. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Link: http://lkml.kernel.org/r/1427792017-7356-2-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02sched/deadline: Always enqueue on previous rq when dl_task_timer() firesJuri Lelli
dl_task_timer() may fire on a different rq from where a task was removed after throttling. Since the call path is: dl_task_timer() -> enqueue_task_dl() -> enqueue_dl_entity() -> replenish_dl_entity() and replenish_dl_entity() uses dl_se's rq, we can't use current's rq in dl_task_timer(), but we need to lock the task's previous one. Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Kirill Tkhai <ktkhai@parallels.com> Cc: Juri Lelli <juri.lelli@gmail.com> Fixes: 3960c8c0c789 ("sched: Make dl_task_time() use task_rq_lock()") Link: http://lkml.kernel.org/r/1427792017-7356-1-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02sched/core: Remove unused argument from init_[rt|dl]_rq()Abel Vesa
Obviously, 'rq' is not used in these two functions, therefore, there is no reason for it to be passed as an argument. Signed-off-by: Abel Vesa <abelvesa@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1425383427-26244-1-git-send-email-abelvesa@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02watchdog: Add watchdog enable/disable all functionsStephane Eranian
This patch adds two new functions to enable/disable the watchdog across all CPUs. This will be used by the HT PMU bug workaround code to disable/enable the NMI watchdog across quirk enablement. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Cc: maria.n.dimakopoulou@gmail.com Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1416251225-17721-12-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02Merge branch 'perf/urgent' into perf/core, before applying dependent patchesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add ITRACE_START record to indicate that tracing has startedAlexander Shishkin
For counters that generate AUX data that is bound to the context of a running task, such as instruction tracing, the decoder needs to know exactly which task is running when the event is first scheduled in, before the first sched_switch. The decoder's need to know this stems from the fact that instruction flow trace decoding will almost always require program's object code in order to reconstruct said flow and for that we need at least its pid/tid in the perf stream. To single out such instruction tracing pmus, this patch introduces ITRACE PMU capability. The reason this is not part of RECORD_AUX record is that not all pmus capable of generating AUX data need this, and the opposite is *probably* also true. While sched_switch covers for most cases, there are two problems with it: the consumer will need to process events out of order (that is, having found RECORD_AUX, it will have to skip forward to the nearest sched_switch to figure out which task it was, then go back to the actual trace to decode it) and it completely misses the case when the tracing is enabled and disabled before sched_switch, for example, via PERF_EVENT_IOC_DISABLE. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-15-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add wakeup watermark control to the AUX areaAlexander Shishkin
When AUX area gets a certain amount of new data, we want to wake up userspace to collect it. This adds a new control to specify how much data will cause a wakeup. This is then passed down to pmu drivers via output handle's "wakeup" field, so that the driver can find the nearest point where it can generate an interrupt. We repurpose __reserved_2 in the event attribute for this, even though it was never checked to be zero before, aux_watermark will only matter for new AUX-aware code, so the old code should still be fine. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-10-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Support overwrite mode for the AUX areaAlexander Shishkin
This adds support for overwrite mode in the AUX area, which means "keep collecting data till you're stopped", turning AUX area into a circular buffer, where new data overwrites old data. It does not depend on data buffer's overwrite mode, so that it doesn't lose sideband data that is instrumental for processing AUX data. Overwrite mode is enabled at mapping AUX area read only. Even though aux_tail in the buffer's user page might be user writable, it will be ignored in this mode. A PERF_RECORD_AUX with PERF_AUX_FLAG_OVERWRITE set is written to the perf data stream every time an event writes new data to the AUX area. The pmu driver might not be able to infer the exact beginning of the new data in each snapshot, some drivers will only provide the tail, which is aux_offset + aux_size in the AUX record. Consumer has to be able to tell the new data from the old one, for example, by means of time stamps if such are provided in the trace. Consumer is also responsible for disabling any events that might write to the AUX area (thus potentially racing with the consumer) before collecting the data. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-9-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add API for PMUs to write to the AUX areaAlexander Shishkin
For pmus that wish to write data to ring buffer's AUX area, provide perf_aux_output_{begin,end}() calls to initiate/commit data writes, similarly to perf_output_{begin,end}. These also use the same output handle structure. Also, similarly to software counterparts, these will direct inherited events' output to parents' ring buffers. After the perf_aux_output_begin() returns successfully, handle->size is set to the maximum amount of data that can be written wrt aux_tail pointer, so that no data that the user hasn't seen will be overwritten, therefore this should always be called before hardware writing is enabled. On success, this will return the pointer to pmu driver's private structure allocated for this aux area by pmu::setup_aux. Same pointer can also be retrieved using perf_get_aux() while hardware writing is enabled. PMU driver should pass the actual amount of data written as a parameter to perf_aux_output_end(). All hardware writes should be completed and visible before this one is called. Additionally, perf_aux_output_skip() will adjust output handle and aux_head in case some part of the buffer has to be skipped over to maintain hardware's alignment constraints. Nested writers are forbidden and guards are in place to catch such attempts. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-8-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add AUX recordAlexander Shishkin
When there's new data in the AUX space, output a record indicating its offset and size and a set of flags, such as PERF_AUX_FLAG_TRUNCATED, to mean the described data was truncated to fit in the ring buffer. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-7-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add a pmu capability for "exclusive" eventsAlexander Shishkin
Usually, pmus that do, for example, instruction tracing, would only ever be able to have one event per task per cpu (or per perf_event_context). For such pmus it makes sense to disallow creating conflicting events early on, so as to provide consistent behavior for the user. This patch adds a pmu capability that indicates such constraint on event creation. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1422613866-113186-1-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add a capability for AUX_NO_SG pmus to do software double bufferingAlexander Shishkin
For pmus that don't support scatter-gather for AUX data in hardware, it might still make sense to implement software double buffering to avoid losing data while the user is reading data out. For this purpose, add a pmu capability that guarantees multiple high-order chunks for AUX buffer, so that the pmu driver can do switchover tricks. To make use of this feature, add PERF_PMU_CAP_AUX_SW_DOUBLEBUF to your pmu's capability mask. This will make the ring buffer AUX allocation code ensure that the biggest high order allocation for the aux buffer pages is no bigger than half of the total requested buffer size, thus making sure that the buffer has at least two high order allocations. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-5-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Support high-order allocations for AUX spaceAlexander Shishkin
Some pmus (such as BTS or Intel PT without multiple-entry ToPA capability) don't support scatter-gather and will prefer larger contiguous areas for their output regions. This patch adds a new pmu capability to request higher order allocations. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-4-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add AUX area to ring buffer for raw data streamsPeter Zijlstra
This patch introduces "AUX space" in the perf mmap buffer, intended for exporting high bandwidth data streams to userspace, such as instruction flow traces. AUX space is a ring buffer, defined by aux_{offset,size} fields in the user_page structure, and read/write pointers aux_{head,tail}, which abide by the same rules as data_* counterparts of the main perf buffer. In order to allocate/mmap AUX, userspace needs to set up aux_offset to such an offset that will be greater than data_offset+data_size and aux_size to be the desired buffer size. Both need to be page aligned. Then, same aux_offset and aux_size should be passed to mmap() call and if everything adds up, you should have an AUX buffer as a result. Pages that are mapped into this buffer also come out of user's mlock rlimit plus perf_event_mlock_kb allowance. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-3-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02perf: Add data_{offset,size} to user_pageAlexander Shishkin
Currently, the actual perf ring buffer is one page into the mmap area, following the user page and the userspace follows this convention. This patch adds data_{offset,size} fields to user_page that can be used by userspace instead for locating perf data in the mmap area. This is also helpful when mapping existing or shared buffers if their size is not known in advance. Right now, it is made to follow the existing convention that data_offset == PAGE_SIZE and data_offset + data_size == mmap_size. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-2-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>