summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-09-25irqdomain: irq_domain_check_msi_remapEric Auger
This new function checks whether all MSI irq domains implement IRQ remapping. This is useful to understand whether VFIO passthrough is safe with respect to interrupts. On ARM typically an MSI controller can sit downstream to the IOMMU without preventing VFIO passthrough. As such any assigned device can write into the MSI doorbell. In case the MSI controller implements IRQ remapping, assigned devices will not be able to trigger interrupts towards the host. On the contrary, the assignment must be emphasized as unsafe with respect to interrupts. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com> Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com> Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-09-25genirq/msi: Set IRQ_DOMAIN_FLAG_MSI on MSI domain creationEric Auger
Now we have a flag value indicating an IRQ domain implements MSI, let's set it on msi_create_irq_domain(). Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com> Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-09-25irqdomain: Add irq domain MSI and MSI_REMAP flagsEric Auger
We introduce two new enum values for the irq domain flag: - IRQ_DOMAIN_FLAG_MSI indicates the irq domain corresponds to an MSI domain - IRQ_DOMAIN_FLAG_MSI_REMAP indicates the irq domain has MSI remapping capabilities. Those values will be useful to check all MSI irq domains have MSI remapping support when assessing the safety of IRQ assignment to a guest. irq_domain_hierarchical_is_msi_remap() allows to check if an irq domain or any parent implements MSI remapping. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com> Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com> Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-06-29time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accountingJohn Stultz
commit 3d88d56c5873f6eebe23e05c3da701960146b801 upstream. Due to how the MONOTONIC_RAW accumulation logic was handled, there is the potential for a 1ns discontinuity when we do accumulations. This small discontinuity has for the most part gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW in their vDSO clock_gettime implementation, we've seen failures with the inconsistency-check test in kselftest. This patch addresses the issue by using the same sub-ns accumulation handling that CLOCK_MONOTONIC uses, which avoids the issue for in-kernel users. Since the ARM64 vDSO implementation has its own clock_gettime calculation logic, this patch reduces the frequency of errors, but failures are still seen. The ARM64 vDSO will need to be updated to include the sub-nanosecond xtime_nsec values in its calculation for this issue to be completely fixed. Signed-off-by: John Stultz <john.stultz@linaro.org> Tested-by: Daniel Mentz <danielmentz@google.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Miroslav Lichvar <mlichvar@redhat.com> Link: http://lkml.kernel.org/r/1496965462-20003-3-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-29time: Fix clock->read(clock) race around clocksource changesJohn Stultz
commit ceea5e3771ed2378668455fa21861bead7504df5 upstream. In tests, which excercise switching of clocksources, a NULL pointer dereference can be observed on AMR64 platforms in the clocksource read() function: u64 clocksource_mmio_readl_down(struct clocksource *c) { return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask; } This is called from the core timekeeping code via: cycle_now = tkr->read(tkr->clock); tkr->read is the cached tkr->clock->read() function pointer. When the clocksource is changed then tkr->clock and tkr->read are updated sequentially. The code above results in a sequential load operation of tkr->read and tkr->clock as well. If the store to tkr->clock hits between the loads of tkr->read and tkr->clock, then the old read() function is called with the new clock pointer. As a consequence the read() function dereferences a different data structure and the resulting 'reg' pointer can point anywhere including NULL. This problem was introduced when the timekeeping code was switched over to use struct tk_read_base. Before that, it was theoretically possible as well when the compiler decided to reload clock in the code sequence: now = tk->clock->read(tk->clock); Add a helper function which avoids the issue by reading tk_read_base->clock once into a local variable clk and then issue the read function via clk->read(clk). This guarantees that the read() function always gets the proper clocksource pointer handed in. Since there is now no use for the tkr.read pointer, this patch also removes it, and to address stopping the fast timekeeper during suspend/resume, it introduces a dummy clocksource to use rather then just a dummy read function. Signed-off-by: John Stultz <john.stultz@linaro.org> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Daniel Mentz <danielmentz@google.com> Link: http://lkml.kernel.org/r/1496965462-20003-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-29signal: Only reschedule timers on signals timers have sentEric W. Biederman
commit 57db7e4a2d92c2d3dfbca4ef8057849b2682436b upstream. Thomas Gleixner wrote: > The CRIU support added a 'feature' which allows a user space task to send > arbitrary (kernel) signals to itself. The changelog says: > > The kernel prevents sending of siginfo with positive si_code, because > these codes are reserved for kernel. I think we can allow a task to > send such a siginfo to itself. This operation should not be dangerous. > > Quite contrary to that claim, it turns out that it is outright dangerous > for signals with info->si_code == SI_TIMER. The following code sequence in > a user space task allows to crash the kernel: > > id = timer_create(CLOCK_XXX, ..... signo = SIGX); > timer_set(id, ....); > info->si_signo = SIGX; > info->si_code = SI_TIMER: > info->_sifields._timer._tid = id; > info->_sifields._timer._sys_private = 2; > rt_[tg]sigqueueinfo(..., SIGX, info); > sigemptyset(&sigset); > sigaddset(&sigset, SIGX); > rt_sigtimedwait(sigset, info); > > For timers based on CLOCK_PROCESS_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID this > results in a kernel crash because sigwait() dequeues the signal and the > dequeue code observes: > > info->si_code == SI_TIMER && info->_sifields._timer._sys_private != 0 > > which triggers the following callchain: > > do_schedule_next_timer() -> posix_cpu_timer_schedule() -> arm_timer() > > arm_timer() executes a list_add() on the timer, which is already armed via > the timer_set() syscall. That's a double list add which corrupts the posix > cpu timer list. As a consequence the kernel crashes on the next operation > touching the posix cpu timer list. > > Posix clocks which are internally implemented based on hrtimers are not > affected by this because hrtimer_start() can handle already armed timers > nicely, but it's a reliable way to trigger the WARN_ON() in > hrtimer_forward(), which complains about calling that function on an > already armed timer. This problem has existed since the posix timer code was merged into 2.5.63. A few releases earlier in 2.5.60 ptrace gained the ability to inject not just a signal (which linux has supported since 1.0) but the full siginfo of a signal. The core problem is that the code will reschedule in response to signals getting dequeued not just for signals the timers sent but for other signals that happen to a si_code of SI_TIMER. Avoid this confusion by testing to see if the queued signal was preallocated as all timer signals are preallocated, and so far only the timer code preallocates signals. Move the check for if a timer needs to be rescheduled up into collect_signal where the preallocation check must be performed, and pass the result back to dequeue_signal where the code reschedules timers. This makes it clear why the code cares about preallocated timers. Reported-by: Thomas Gleixner <tglx@linutronix.de> History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Reference: 66dd34ad31e5 ("signal: allow to send any siginfo to itself") Reference: 1669ce53e2ff ("Add PTRACE_GETSIGINFO and PTRACE_SETSIGINFO") Fixes: db8b50ba75f2 ("[PATCH] POSIX clocks & timers") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-24alarmtimer: Rate limit periodic intervalsThomas Gleixner
commit ff86bf0c65f14346bf2440534f9ba5ac232c39a0 upstream. The alarmtimer code has another source of potentially rearming itself too fast. Interval timers with a very samll interval have a similar CPU hog effect as the previously fixed overflow issue. The reason is that alarmtimers do not implement the normal protection against this kind of problem which the other posix timer use: timer expires -> queue signal -> deliver signal -> rearm timer This scheme brings the rearming under scheduler control and prevents permanently firing timers which hog the CPU. Bringing this scheme to the alarm timer code is a major overhaul because it lacks all the necessary mechanisms completely. So for a quick fix limit the interval to one jiffie. This is not problematic in practice as alarmtimers are usually backed by an RTC for suspend which have 1 second resolution. It could be therefor argued that the resolution of this clock should be set to 1 second in general, but that's outside the scope of this fix. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kostya Serebryany <kcc@google.com> Cc: syzkaller <syzkaller@googlegroups.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Dmitry Vyukov <dvyukov@google.com> Link: http://lkml.kernel.org/r/20170530211655.896767100@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-24alarmtimer: Prevent overflow of relative timersThomas Gleixner
commit f4781e76f90df7aec400635d73ea4c35ee1d4765 upstream. Andrey reported a alartimer related RCU stall while fuzzing the kernel with syzkaller. The reason for this is an overflow in ktime_add() which brings the resulting time into negative space and causes immediate expiry of the timer. The following rearm with a small interval does not bring the timer back into positive space due to the same issue. This results in a permanent firing alarmtimer which hogs the CPU. Use ktime_add_safe() instead which detects the overflow and clamps the result to KTIME_SEC_MAX. Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kostya Serebryany <kcc@google.com> Cc: syzkaller <syzkaller@googlegroups.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Dmitry Vyukov <dvyukov@google.com> Link: http://lkml.kernel.org/r/20170530211655.802921648@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-24genirq: Release resources in __setup_irq() error pathHeiner Kallweit
commit fa07ab72cbb0d843429e61bf179308aed6cbe0dd upstream. In case __irq_set_trigger() fails the resources requested via irq_request_resources() are not released. Add the missing release call into the error handling path. Fixes: c1bacbae8192 ("genirq: Provide irq_request/release_resources chip callbacks") Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/655538f5-cb20-a892-ff15-fbd2dd1fa4ec@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-24sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()Andy Lutomirski
commit 252d2a4117bc181b287eeddf848863788da733ae upstream. idle_task_exit() can be called with IRQs on x86 on and therefore should use switch_mm(), not switch_mm_irqs_off(). This doesn't seem to cause any problems right now, but it will confuse my upcoming TLB flush changes. Nonetheless, I think it should be backported because it's trivial. There won't be any meaningful performance impact because idle_task_exit() is only used when offlining a CPU. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler") Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17kernel/watchdog: prevent false hardlockup on overloaded systemDon Zickus
[ Upstream commit b94f51183b0617e7b9b4fb4137d4cf1cab7547c2 ] On an overloaded system, it is possible that a change in the watchdog threshold can be delayed long enough to trigger a false positive. This can easily be achieved by having a cpu spinning indefinitely on a task, while another cpu updates watchdog threshold. What happens is while trying to park the watchdog threads, the hrtimers on the other cpus trigger and reprogram themselves with the new slower watchdog threshold. Meanwhile, the nmi watchdog is still programmed with the old faster threshold. Because the one cpu is blocked, it prevents the thread parking on the other cpus from completing, which is needed to shutdown the nmi watchdog and reprogram it correctly. As a result, a false positive from the nmi watchdog is reported. Fix this by setting a park_in_progress flag to block all lockups until the parking is complete. Fix provided by Ulrich Obergfell. [akpm@linux-foundation.org: s/park_in_progress/watchdog_park_in_progress/] Link: http://lkml.kernel.org/r/1481041033-192236-1-git-send-email-dzickus@redhat.com Signed-off-by: Don Zickus <dzickus@redhat.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17kernel/watchdog.c: move shared definitions to nmi.hBabu Moger
[ Upstream commit 249e52e35580fcfe5dad53a7dcd7c1252788749c ] Patch series "Clean up watchdog handlers", v2. This is an attempt to cleanup watchdog handlers. Right now, kernel/watchdog.c implements both softlockup and hardlockup detectors. Softlockup code is generic. Hardlockup code is arch specific. Some architectures don't use hardlockup detectors. They use their own watchdog detectors. To make both these combination work, we have numerous #ifdefs in kernel/watchdog.c. We are trying here to make these handlers independent of each other. Also provide an interface for architectures to implement their own handlers. watchdog_nmi_enable and watchdog_nmi_disable will be defined as weak such that architectures can override its definitions. Thanks to Don Zickus for his suggestions. Here are our previous discussions http://www.spinics.net/lists/sparclinux/msg16543.html http://www.spinics.net/lists/sparclinux/msg16441.html This patch (of 3): Move shared macros and definitions to nmi.h so that watchdog.c, new file watchdog_hld.c or any other architecture specific handler can use those definitions. Link: http://lkml.kernel.org/r/1478034826-43888-2-git-send-email-babu.moger@oracle.com Signed-off-by: Babu Moger <babu.moger@oracle.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Andi Kleen <andi@firstfloor.org> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Josh Hunt <johunt@akamai.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17kernel/watchdog.c: move hardlockup detector to separate fileBabu Moger
[ Upstream commit 73ce0511c43686095efd2f65ef564aab952e07bc ] Separate hardlockup code from watchdog.c and move it to watchdog_hld.c. It is mostly straight forward. Remove everything inside CONFIG_HARDLOCKUP_DETECTORS. This code will go to file watchdog_hld.c. Also update the makefile accordigly. Link: http://lkml.kernel.org/r/1478034826-43888-3-git-send-email-babu.moger@oracle.com Signed-off-by: Babu Moger <babu.moger@oracle.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Andi Kleen <andi@firstfloor.org> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Josh Hunt <johunt@akamai.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17kernel/ucount.c: mark user_header with kmemleak_ignore()Luis R. Rodriguez
[ Upstream commit ed5bd7dc88edf4a4a9c67130742b1b59aa017a5f ] The user_header gets caught by kmemleak with the following splat as missing a free: unreferenced object 0xffff99667a733d80 (size 96): comm "swapper/0", pid 1, jiffies 4294892317 (age 62191.468s) hex dump (first 32 bytes): a0 b6 92 b4 ff ff ff ff 00 00 00 00 01 00 00 00 ................ 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: kmemleak_alloc+0x4a/0xa0 __kmalloc+0x144/0x260 __register_sysctl_table+0x54/0x5e0 register_sysctl+0x1b/0x20 user_namespace_sysctl_init+0x17/0x34 do_one_initcall+0x52/0x1a0 kernel_init_freeable+0x173/0x200 kernel_init+0xe/0x100 ret_from_fork+0x2c/0x40 The BUG_ON()s are intended to crash so no need to clean up after ourselves on error there. This is also a kernel/ subsys_init() we don't need a respective exit call here as this is never modular, so just white list it. Link: http://lkml.kernel.org/r/20170203211404.31458-1-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Nikolay Borisov <n.borisov.lkml@gmail.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()Rafael J. Wysocki
commit 4296f23ed49a15d36949458adcc66ff993dee2a8 upstream. sugov_start() only initializes struct sugov_cpu per-CPU structures for shared policies, but it should do that for single-CPU policies too. That in particular makes the IO-wait boost mechanism work in the cases when cpufreq policies correspond to individual CPUs. Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14cpufreq: schedutil: move cached_raw_freq to struct sugov_policyViresh Kumar
commit 6c4f0fa643cb9e775dcc976e3db00d649468ff1d upstream. cached_raw_freq applies to the entire cpufreq policy and not individual CPUs. Apart from wasting per-cpu memory, it is actually wrong to keep it in struct sugov_cpu as we may end up comparing next_freq with a stale cached_raw_freq of a random CPU. Move cached_raw_freq to struct sugov_policy. Fixes: 5cbea46984d6 (cpufreq: schedutil: map raw required frequency to driver frequency) Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14cpu/hotplug: Drop the device lock on errorSebastian Andrzej Siewior
commit 40da1b11f01e43aad1aa6cea64681b6125e8a2a7 upstream. If a custom CPU target is specified and that one is not available _or_ can't be interrupted then the code returns to userland without dropping a lock as notices by lockdep: |echo 133 > /sys/devices/system/cpu/cpu7/hotplug/target | ================================================ | [ BUG: lock held when returning to user space! ] | ------------------------------------------------ | bash/503 is leaving the kernel with locks still held! | 1 lock held by bash/503: | #0: (device_hotplug_lock){+.+...}, at: [<ffffffff815b5650>] lock_device_hotplug_sysfs+0x10/0x40 So release the lock then. Fixes: 757c989b9994 ("cpu/hotplug: Make target state writeable") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170602142714.3ogo25f2wbq6fjpj@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14perf/core: Drop kernel samples even though :u is specifiedJin Yao
commit cc1582c231ea041fbc68861dfaf957eaf902b829 upstream. When doing sampling, for example: perf record -e cycles:u ... On workloads that do a lot of kernel entry/exits we see kernel samples, even though :u is specified. This is due to skid existing. This might be a security issue because it can leak kernel addresses even though kernel sampling support is disabled. The patch drops the kernel samples if exclude_kernel is specified. For example, test on Haswell desktop: perf record -e cycles:u <mgen> perf report --stdio Before patch applied: 99.77% mgen mgen [.] buf_read 0.20% mgen mgen [.] rand_buf_init 0.01% mgen [kernel.vmlinux] [k] apic_timer_interrupt 0.00% mgen mgen [.] last_free_elem 0.00% mgen libc-2.23.so [.] __random_r 0.00% mgen libc-2.23.so [.] _int_malloc 0.00% mgen mgen [.] rand_array_init 0.00% mgen [kernel.vmlinux] [k] page_fault 0.00% mgen libc-2.23.so [.] __random 0.00% mgen libc-2.23.so [.] __strcasestr 0.00% mgen ld-2.23.so [.] strcmp 0.00% mgen ld-2.23.so [.] _dl_start 0.00% mgen libc-2.23.so [.] sched_setaffinity@@GLIBC_2.3.4 0.00% mgen ld-2.23.so [.] _start We can see kernel symbols apic_timer_interrupt and page_fault. After patch applied: 99.79% mgen mgen [.] buf_read 0.19% mgen mgen [.] rand_buf_init 0.00% mgen libc-2.23.so [.] __random_r 0.00% mgen mgen [.] rand_array_init 0.00% mgen mgen [.] last_free_elem 0.00% mgen libc-2.23.so [.] vfprintf 0.00% mgen libc-2.23.so [.] rand 0.00% mgen libc-2.23.so [.] __random 0.00% mgen libc-2.23.so [.] _int_malloc 0.00% mgen libc-2.23.so [.] _IO_doallocbuf 0.00% mgen ld-2.23.so [.] do_lookup_x 0.00% mgen ld-2.23.so [.] open_verify.constprop.7 0.00% mgen ld-2.23.so [.] _dl_important_hwcaps 0.00% mgen libc-2.23.so [.] sched_setaffinity@@GLIBC_2.3.4 0.00% mgen ld-2.23.so [.] _start There are only userspace symbols. Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Cc: jolsa@kernel.org Cc: kan.liang@intel.com Cc: mark.rutland@arm.com Cc: will.deacon@arm.com Cc: yao.jin@intel.com Link: http://lkml.kernel.org/r/1495706947-3744-1-git-send-email-yao.jin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14cpuset: consider dying css as offlineTejun Heo
commit 41c25707d21716826e3c1f60967f5550610ec1c9 upstream. In most cases, a cgroup controller don't care about the liftimes of cgroups. For the controller, a css becomes online when ->css_online() is called on it and offline when ->css_offline() is called. However, cpuset is special in that the user interface it exposes cares whether certain cgroups exist or not. Combined with the RCU delay between cgroup removal and css offlining, this can lead to user visible behavior oddities where operations which should succeed after cgroup removals fail for some time period. The effects of cgroup removals are delayed when seen from userland. This patch adds css_is_dying() which tests whether offline is pending and updates is_cpuset_online() so that the function returns false also while offline is pending. This gets rid of the userland visible delays. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com> Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14cgroup: Prevent kill_css() from being called more than onceWaiman Long
commit 33c35aa4817864e056fd772230b0c6b552e36ea2 upstream. The kill_css() function may be called more than once under the condition that the css was killed but not physically removed yet followed by the removal of the cgroup that is hosting the css. This patch prevents any harmm from being done when that happens. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14ptrace: Properly initialize ptracer_cred on forkEric W. Biederman
commit c70d9d809fdeecedb96972457ee45c49a232d97f upstream. When I introduced ptracer_cred I failed to consider the weirdness of fork where the task_struct copies the old value by default. This winds up leaving ptracer_cred set even when a process forks and the child process does not wind up being ptraced. Because ptracer_cred is not set on non-ptraced processes whose parents were ptraced this has broken the ability of the enlightenment window manager to start setuid children. Fix this by properly initializing ptracer_cred in ptrace_init_task This must be done with a little bit of care to preserve the current value of ptracer_cred when ptrace carries through fork. Re-reading the ptracer_cred from the ptracing process at this point is inconsistent with how PT_PTRACE_CAP has been maintained all of these years. Tested-by: Takashi Iwai <tiwai@suse.de> Fixes: 64b875f7ac8a ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25tracing/kprobes: Enforce kprobes teardown after testingThomas Gleixner
commit 30e7d894c1478c88d50ce94ddcdbd7f9763d9cdd upstream. Enabling the tracer selftest triggers occasionally the warning in text_poke(), which warns when the to be modified page is not marked reserved. The reason is that the tracer selftest installs kprobes on functions marked __init for testing. These probes are removed after the tests, but that removal schedules the delayed kprobes_optimizer work, which will do the actual text poke. If the work is executed after the init text is freed, then the warning triggers. The bug can be reproduced reliably when the work delay is increased. Flush the optimizer work and wait for the optimizing/unoptimizing lists to become empty before returning from the kprobes tracer selftest. That ensures that all operations which were queued due to the probes removal have completed. Link: http://lkml.kernel.org/r/20170516094802.76a468bb@gandalf.local.home Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 6274de498 ("kprobes: Support delayed unoptimizing") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25genirq: Fix chained interrupt data orderingThomas Gleixner
commit 2c4569ca26986d18243f282dd727da27e9adae4c upstream. irq_set_chained_handler_and_data() sets up the chained interrupt and then stores the handler data. That's racy against an immediate interrupt which gets handled before the store of the handler data happened. The handler will dereference a NULL pointer and crash. Cure it by storing handler data before installing the chained handler. Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25stackprotector: Increase the per-task stack canary's random range from 32 ↵Daniel Micay
bits to 64 bits on 64-bit platforms commit 5ea30e4e58040cfd6434c2f33dc3ea76e2c15b05 upstream. The stack canary is an 'unsigned long' and should be fully initialized to random data rather than only 32 bits of random data. Signed-off-by: Daniel Micay <danielmicay@gmail.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Arjan van Ven <arjan@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-hardening@lists.openwall.com Link: http://lkml.kernel.org/r/20170504133209.3053-1-danielmicay@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()Kirill Tkhai
commit 3fd37226216620c1a468afa999739d5016fbc349 upstream. Imagine we have a pid namespace and a task from its parent's pid_ns, which made setns() to the pid namespace. The task is doing fork(), while the pid namespace's child reaper is dying. We have the race between them: Task from parent pid_ns Child reaper copy_process() .. alloc_pid() .. .. zap_pid_ns_processes() .. disable_pid_allocation() .. read_lock(&tasklist_lock) .. iterate over pids in pid_ns .. kill tasks linked to pids .. read_unlock(&tasklist_lock) write_lock_irq(&tasklist_lock); .. attach_pid(p, PIDTYPE_PID); .. .. .. So, just created task p won't receive SIGKILL signal, and the pid namespace will be in contradictory state. Only manual kill will help there, but does the userspace care about this? I suppose, the most users just inject a task into a pid namespace and wait a SIGCHLD from it. The patch fixes the problem. It simply checks for (pid_ns->nr_hashed & PIDNS_HASH_ADDING) in copy_process(). We do it under the tasklist_lock, and can't skip PIDNS_HASH_ADDING as noted by Oleg: "zap_pid_ns_processes() does disable_pid_allocation() and then takes tasklist_lock to kill the whole namespace. Given that copy_process() checks PIDNS_HASH_ADDING under write_lock(tasklist) they can't race; if copy_process() takes this lock first, the new child will be killed, otherwise copy_process() can't miss the change in ->nr_hashed." If allocation is disabled, we just return -ENOMEM like it's made for such cases in alloc_pid(). v2: Do not move disable_pid_allocation(), do not introduce a new variable in copy_process() and simplify the patch as suggested by Oleg Nesterov. Account the problem with double irq enabling found by Eric W. Biederman. Fixes: c876ad768215 ("pidns: Stop pid allocation when init dies") Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Ingo Molnar <mingo@kernel.org> CC: Peter Zijlstra <peterz@infradead.org> CC: Oleg Nesterov <oleg@redhat.com> CC: Mike Rapoport <rppt@linux.vnet.ibm.com> CC: Michal Hocko <mhocko@suse.com> CC: Andy Lutomirski <luto@kernel.org> CC: "Eric W. Biederman" <ebiederm@xmission.com> CC: Andrei Vagin <avagin@openvz.org> CC: Cyrill Gorcunov <gorcunov@openvz.org> CC: Serge Hallyn <serge@hallyn.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processesEric W. Biederman
commit b9a985db98961ae1ba0be169f19df1c567e4ffe0 upstream. The code can potentially sleep for an indefinite amount of time in zap_pid_ns_processes triggering the hung task timeout, and increasing the system average. This is undesirable. Sleep with a task state of TASK_INTERRUPTIBLE instead of TASK_UNINTERRUPTIBLE to remove these undesirable side effects. Apparently under heavy load this has been allowing Chrome to trigger the hung time task timeout error and cause ChromeOS to reboot. Reported-by: Vovo Yang <vovoy@google.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Fixes: 6347e9009104 ("pidns: guarantee that the pidns init will be the last pidns process reaped") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20padata: free correct variableJason A. Donenfeld
commit 07a77929ba672d93642a56dc2255dd21e6e2290b upstream. The author meant to free the variable that was just allocated, instead of the one that failed to be allocated, but made a simple typo. This patch rectifies that. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14bpf: don't let ldimm64 leak map addresses on unprivilegedDaniel Borkmann
[ Upstream commit 0d0e57697f162da4aa218b5feafe614fb666db07 ] The patch fixes two things at once: 1) It checks the env->allow_ptr_leaks and only prints the map address to the log if we have the privileges to do so, otherwise it just dumps 0 as we would when kptr_restrict is enabled on %pK. Given the latter is off by default and not every distro sets it, I don't want to rely on this, hence the 0 by default for unprivileged. 2) Printing of ldimm64 in the verifier log is currently broken in that we don't print the full immediate, but only the 32 bit part of the first insn part for ldimm64. Thus, fix this up as well; it's okay to access, since we verified all ldimm64 earlier already (including just constants) through replace_map_fd_with_map_ptr(). Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs") Fixes: cbd357008604 ("bpf: verifier (add ability to receive verification log)") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14bpf: enhance verifier to understand stack pointer arithmeticYonghong Song
[ Upstream commit 332270fdc8b6fba07d059a9ad44df9e1a2ad4529 ] llvm 4.0 and above generates the code like below: .... 440: (b7) r1 = 15 441: (05) goto pc+73 515: (79) r6 = *(u64 *)(r10 -152) 516: (bf) r7 = r10 517: (07) r7 += -112 518: (bf) r2 = r7 519: (0f) r2 += r1 520: (71) r1 = *(u8 *)(r8 +0) 521: (73) *(u8 *)(r2 +45) = r1 .... and the verifier complains "R2 invalid mem access 'inv'" for insn #521. This is because verifier marks register r2 as unknown value after #519 where r2 is a stack pointer and r1 holds a constant value. Teach verifier to recognize "stack_ptr + imm" and "stack_ptr + reg with const val" as valid stack_ptr with new offset. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-08cpu/hotplug: Serialize callback invocations properSebastian Andrzej Siewior
commit dc434e056fe1dada20df7ba07f32739d3a701adf upstream. The setup/remove_state/instance() functions in the hotplug core code are serialized against concurrent CPU hotplug, but unfortunately not serialized against themself. As a consequence a concurrent invocation of these function results in corruption of the callback machinery because two instances try to invoke callbacks on remote cpus at the same time. This results in missing callback invocations and initiator threads waiting forever on the completion. The obvious solution to replace get_cpu_online() with cpu_hotplug_begin() is not possible because at least one callsite calls into these functions from a get_online_cpu() locked region. Extend the protection scope of the cpuhp_state_mutex from solely protecting the state arrays to cover the callback invocation machinery as well. Fixes: 5b7aa87e0482 ("cpu/hotplug: Implement setup/removal interface") Reported-and-tested-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: hpa@zytor.com Cc: mingo@kernel.org Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20170314150645.g4tdyoszlcbajmna@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-03bpf: improve verifier packet range checksAlexei Starovoitov
[ Upstream commit b1977682a3858b5584ffea7cfb7bd863f68db18d ] llvm can optimize the 'if (ptr > data_end)' checks to be in the order slightly different than the original C code which will confuse verifier. Like: if (ptr + 16 > data_end) return TC_ACT_SHOT; // may be followed by if (ptr + 14 > data_end) return TC_ACT_SHOT; while llvm can see that 'ptr' is valid for all 16 bytes, the verifier could not. Fix verifier logic to account for such case and add a test. Reported-by: Huapeng Zhou <hzhou@fb.com> Fixes: 969bf05eb3ce ("bpf: direct packet access") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-27ring-buffer: Have ring_buffer_iter_empty() return true when emptySteven Rostedt (VMware)
commit 78f7a45dac2a2d2002f98a3a95f7979867868d73 upstream. I noticed that reading the snapshot file when it is empty no longer gives a status. It suppose to show the status of the snapshot buffer as well as how to allocate and use it. For example: ># cat snapshot # tracer: nop # # # * Snapshot is allocated * # # Snapshot commands: # echo 0 > snapshot : Clears and frees snapshot buffer # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated. # Takes a snapshot of the main buffer. # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free) # (Doesn't have to be '2' works with any number that # is not a '0' or '1') But instead it just showed an empty buffer: ># cat snapshot # tracer: nop # # entries-in-buffer/entries-written: 0/0 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | What happened was that it was using the ring_buffer_iter_empty() function to see if it was empty, and if it was, it showed the status. But that function was returning false when it was empty. The reason was that the iter header page was on the reader page, and the reader page was empty, but so was the buffer itself. The check only tested to see if the iter was on the commit page, but the commit page was no longer pointing to the reader page, but as all pages were empty, the buffer is also. Fixes: 651e22f2701b ("ring-buffer: Always reset iterator to reader page") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-27tracing: Allocate the snapshot buffer before enabling probeSteven Rostedt (VMware)
commit df62db5be2e5f070ecd1a5ece5945b590ee112e0 upstream. Currently the snapshot trigger enables the probe and then allocates the snapshot. If the probe triggers before the allocation, it could cause the snapshot to fail and turn tracing off. It's best to allocate the snapshot buffer first, and then enable the trigger. If something goes wrong in the enabling of the trigger, the snapshot buffer is still allocated, but it can also be freed by the user by writting zero into the snapshot buffer file. Also add a check of the return status of alloc_snapshot(). Fixes: 77fd5c15e3 ("tracing: Add snapshot trigger to function probes") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21ftrace: Fix function pid filter on instancesNamhyung Kim
commit d879d0b8c183aabeb9a65eba91f3f9e3c7e7b905 upstream. When function tracer has a pid filter, it adds a probe to sched_switch to track if current task can be ignored. The probe checks the ftrace_ignore_pid from current tr to filter tasks. But it misses to delete the probe when removing an instance so that it can cause a crash due to the invalid tr pointer (use-after-free). This is easily reproducible with the following: # cd /sys/kernel/debug/tracing # mkdir instances/buggy # echo $$ > instances/buggy/set_ftrace_pid # rmdir instances/buggy ============================================================================ BUG: KASAN: use-after-free in ftrace_filter_pid_sched_switch_probe+0x3d/0x90 Read of size 8 by task kworker/0:1/17 CPU: 0 PID: 17 Comm: kworker/0:1 Tainted: G B 4.11.0-rc3 #198 Call Trace: dump_stack+0x68/0x9f kasan_object_err+0x21/0x70 kasan_report.part.1+0x22b/0x500 ? ftrace_filter_pid_sched_switch_probe+0x3d/0x90 kasan_report+0x25/0x30 __asan_load8+0x5e/0x70 ftrace_filter_pid_sched_switch_probe+0x3d/0x90 ? fpid_start+0x130/0x130 __schedule+0x571/0xce0 ... To fix it, use ftrace_clear_pids() to unregister the probe. As instance_rmdir() already updated ftrace codes, it can just free the filter safely. Link: http://lkml.kernel.org/r/20170417024430.21194-2-namhyung@kernel.org Fixes: 0c8916c34203 ("tracing: Add rmdir to remove multibuffer instances") Cc: Ingo Molnar <mingo@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21ftrace: Fix removing of second function probeSteven Rostedt (VMware)
commit 82cc4fc2e70ec5baeff8f776f2773abc8b2cc0ae upstream. When two function probes are added to set_ftrace_filter, and then one of them is removed, the update to the function locations is not performed, and the record keeping of the function states are corrupted, and causes an ftrace_bug() to occur. This is easily reproducable by adding two probes, removing one, and then adding it back again. # cd /sys/kernel/debug/tracing # echo schedule:traceoff > set_ftrace_filter # echo do_IRQ:traceoff > set_ftrace_filter # echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter # echo do_IRQ:traceoff > set_ftrace_filter Causes: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1098 at kernel/trace/ftrace.c:2369 ftrace_get_addr_curr+0x143/0x220 Modules linked in: [...] CPU: 2 PID: 1098 Comm: bash Not tainted 4.10.0-test+ #405 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 Call Trace: dump_stack+0x68/0x9f __warn+0x111/0x130 ? trace_irq_work_interrupt+0xa0/0xa0 warn_slowpath_null+0x1d/0x20 ftrace_get_addr_curr+0x143/0x220 ? __fentry__+0x10/0x10 ftrace_replace_code+0xe3/0x4f0 ? ftrace_int3_handler+0x90/0x90 ? printk+0x99/0xb5 ? 0xffffffff81000000 ftrace_modify_all_code+0x97/0x110 arch_ftrace_update_code+0x10/0x20 ftrace_run_update_code+0x1c/0x60 ftrace_run_modify_code.isra.48.constprop.62+0x8e/0xd0 register_ftrace_function_probe+0x4b6/0x590 ? ftrace_startup+0x310/0x310 ? debug_lockdep_rcu_enabled.part.4+0x1a/0x30 ? update_stack_state+0x88/0x110 ? ftrace_regex_write.isra.43.part.44+0x1d3/0x320 ? preempt_count_sub+0x18/0xd0 ? mutex_lock_nested+0x104/0x800 ? ftrace_regex_write.isra.43.part.44+0x1d3/0x320 ? __unwind_start+0x1c0/0x1c0 ? _mutex_lock_nest_lock+0x800/0x800 ftrace_trace_probe_callback.isra.3+0xc0/0x130 ? func_set_flag+0xe0/0xe0 ? __lock_acquire+0x642/0x1790 ? __might_fault+0x1e/0x20 ? trace_get_user+0x398/0x470 ? strcmp+0x35/0x60 ftrace_trace_onoff_callback+0x48/0x70 ftrace_regex_write.isra.43.part.44+0x251/0x320 ? match_records+0x420/0x420 ftrace_filter_write+0x2b/0x30 __vfs_write+0xd7/0x330 ? do_loop_readv_writev+0x120/0x120 ? locks_remove_posix+0x90/0x2f0 ? do_lock_file_wait+0x160/0x160 ? __lock_is_held+0x93/0x100 ? rcu_read_lock_sched_held+0x5c/0xb0 ? preempt_count_sub+0x18/0xd0 ? __sb_start_write+0x10a/0x230 ? vfs_write+0x222/0x240 vfs_write+0xef/0x240 SyS_write+0xab/0x130 ? SyS_read+0x130/0x130 ? trace_hardirqs_on_caller+0x182/0x280 ? trace_hardirqs_on_thunk+0x1a/0x1c entry_SYSCALL_64_fastpath+0x18/0xad RIP: 0033:0x7fe61c157c30 RSP: 002b:00007ffe87890258 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: ffffffff8114a410 RCX: 00007fe61c157c30 RDX: 0000000000000010 RSI: 000055814798f5e0 RDI: 0000000000000001 RBP: ffff8800c9027f98 R08: 00007fe61c422740 R09: 00007fe61ca53700 R10: 0000000000000073 R11: 0000000000000246 R12: 0000558147a36400 R13: 00007ffe8788f160 R14: 0000000000000024 R15: 00007ffe8788f15c ? trace_hardirqs_off_caller+0xc0/0x110 ---[ end trace 99fa09b3d9869c2c ]--- Bad trampoline accounting at: ffffffff81cc3b00 (do_IRQ+0x0/0x150) Fixes: 59df055f1991 ("ftrace: trace different functions with a different tracer") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21cgroup, kthread: close race window where new kthreads can be migrated to ↵Tejun Heo
non-root cgroups commit 77f88796cee819b9c4562b0b6b44691b3b7755b1 upstream. Creation of a kthread goes through a couple interlocked stages between the kthread itself and its creator. Once the new kthread starts running, it initializes itself and wakes up the creator. The creator then can further configure the kthread and then let it start doing its job by waking it up. In this configuration-by-creator stage, the creator is the only one that can wake it up but the kthread is visible to userland. When altering the kthread's attributes from userland is allowed, this is fine; however, for cases where CPU affinity is critical, kthread_bind() is used to first disable affinity changes from userland and then set the affinity. This also prevents the kthread from being migrated into non-root cgroups as that can affect the CPU affinity and many other things. Unfortunately, the cgroup side of protection is racy. While the PF_NO_SETAFFINITY flag prevents further migrations, userland can win the race before the creator sets the flag with kthread_bind() and put the kthread in a non-root cgroup, which can lead to all sorts of problems including incorrect CPU affinity and starvation. This bug got triggered by userland which periodically tries to migrate all processes in the root cpuset cgroup to a non-root one. Per-cpu workqueue workers got caught while being created and ended up with incorrected CPU affinity breaking concurrency management and sometimes stalling workqueue execution. This patch adds task->no_cgroup_migration which disallows the task to be migrated by userland. kthreadd starts with the flag set making every child kthread start in the root cgroup with migration disallowed. The flag is cleared after the kthread finishes initialization by which time PF_NO_SETAFFINITY is set if the kthread should stay in the root cgroup. It'd be better to wait for the initialization instead of failing but I couldn't think of a way of implementing that without adding either a new PF flag, or sleeping and retrying from waiting side. Even if userland depends on changing cgroup membership of a kthread, it either has to be synchronized with kthread_create() or periodically repeat, so it's unlikely that this would break anything. v2: Switch to a simpler implementation using a new task_struct bit field suggested by Oleg. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-and-debugged-by: Chris Mason <clm@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-12ring-buffer: Fix return value check in test_ringbuffer()Wei Yongjun
commit 62277de758b155dc04b78f195a1cb5208c37b2df upstream. In case of error, the function kthread_run() returns ERR_PTR() and never returns NULL. The NULL test in the return value check should be replaced with IS_ERR(). Link: http://lkml.kernel.org/r/1466184839-14927-1-git-send-email-weiyj_lk@163.com Fixes: 6c43e554a ("ring-buffer: Add ring buffer startup selftest") Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-12ptrace: fix PTRACE_LISTEN race corrupting task->statebsegall@google.com
commit 5402e97af667e35e54177af8f6575518bf251d51 upstream. In PT_SEIZED + LISTEN mode STOP/CONT signals cause a wakeup against __TASK_TRACED. If this races with the ptrace_unfreeze_traced at the end of a PTRACE_LISTEN, this can wake the task /after/ the check against __TASK_TRACED, but before the reset of state to TASK_TRACED. This causes it to instead clobber TASK_WAKING, allowing a subsequent wakeup against TRACED while the task is still on the rq wake_list, corrupting it. Oleg said: "The kernel can crash or this can lead to other hard-to-debug problems. In short, "task->state = TASK_TRACED" in ptrace_unfreeze_traced() assumes that nobody else can wake it up, but PTRACE_LISTEN breaks the contract. Obviusly it is very wrong to manipulate task->state if this task is already running, or WAKING, or it sleeps again" [akpm@linux-foundation.org: coding-style fixes] Fixes: 9899d11f ("ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL") Link: http://lkml.kernel.org/r/xm26y3vfhmkp.fsf_-_@bsegall-linux.mtv.corp.google.com Signed-off-by: Ben Segall <bsegall@google.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-08padata: avoid race in reorderingJason A. Donenfeld
commit de5540d088fe97ad583cc7d396586437b32149a5 upstream. Under extremely heavy uses of padata, crashes occur, and with list debugging turned on, this happens instead: [87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33 __list_add+0xae/0x130 [87487.301868] list_add corruption. prev->next should be next (ffffb17abfc043d0), but was ffff8dba70872c80. (prev=ffff8dba70872b00). [87487.339011] [<ffffffff9a53d075>] dump_stack+0x68/0xa3 [87487.342198] [<ffffffff99e119a1>] ? console_unlock+0x281/0x6d0 [87487.345364] [<ffffffff99d6b91f>] __warn+0xff/0x140 [87487.348513] [<ffffffff99d6b9aa>] warn_slowpath_fmt+0x4a/0x50 [87487.351659] [<ffffffff9a58b5de>] __list_add+0xae/0x130 [87487.354772] [<ffffffff9add5094>] ? _raw_spin_lock+0x64/0x70 [87487.357915] [<ffffffff99eefd66>] padata_reorder+0x1e6/0x420 [87487.361084] [<ffffffff99ef0055>] padata_do_serial+0xa5/0x120 padata_reorder calls list_add_tail with the list to which its adding locked, which seems correct: spin_lock(&squeue->serial.lock); list_add_tail(&padata->list, &squeue->serial.list); spin_unlock(&squeue->serial.lock); This therefore leaves only place where such inconsistency could occur: if padata->list is added at the same time on two different threads. This pdata pointer comes from the function call to padata_get_next(pd), which has in it the following block: next_queue = per_cpu_ptr(pd->pqueue, cpu); padata = NULL; reorder = &next_queue->reorder; if (!list_empty(&reorder->list)) { padata = list_entry(reorder->list.next, struct padata_priv, list); spin_lock(&reorder->lock); list_del_init(&padata->list); atomic_dec(&pd->reorder_objects); spin_unlock(&reorder->lock); pd->processed++; goto out; } out: return padata; I strongly suspect that the problem here is that two threads can race on reorder list. Even though the deletion is locked, call to list_entry is not locked, which means it's feasible that two threads pick up the same padata object and subsequently call list_add_tail on them at the same time. The fix is thus be hoist that lock outside of that block. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-31sched/rt: Add a missing rescheduling pointSebastian Andrzej Siewior
commit 619bd4a71874a8fd78eb6ccf9f272c5e98bcc7b7 upstream. Since the change in commit: fd7a4bed1835 ("sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks") ... we don't reschedule a task under certain circumstances: Lets say task-A, SCHED_OTHER, is running on CPU0 (and it may run only on CPU0) and holds a PI lock. This task is removed from the CPU because it used up its time slice and another SCHED_OTHER task is running. Task-B on CPU1 runs at RT priority and asks for the lock owned by task-A. This results in a priority boost for task-A. Task-B goes to sleep until the lock has been made available. Task-A is already runnable (but not active), so it receives no wake up. The reality now is that task-A gets on the CPU once the scheduler decides to remove the current task despite the fact that a high priority task is enqueued and waiting. This may take a long time. The desired behaviour is that CPU0 immediately reschedules after the priority boost which made task-A the task with the lowest priority. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: fd7a4bed1835 ("sched, rt: Convert switched_{from, to}_rt() prio_changed_rt() to balance callbacks") Link: http://lkml.kernel.org/r/20170124144006.29821-1-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-26cgroup/pids: remove spurious suspicious RCU usage warningTejun Heo
commit 1d18c2747f937f1b5ec65ce6bf4ccb9ca1aea9e8 upstream. pids_can_fork() is special in that the css association is guaranteed to be stable throughout the function and thus doesn't need RCU protection around task_css access. When determining the css to charge the pid, task_css_check() is used to override the RCU sanity check. While adding a warning message on fork rejection from pids limit, 135b8b37bd91 ("cgroup: Add pids controller event when fork fails because of pid limit") incorrectly added a task_css access which is neither RCU protected or explicitly annotated. This triggers the following suspicious RCU usage warning when RCU debugging is enabled. cgroup: fork rejected by pids controller in =============================== [ ERR: suspicious RCU usage. ] 4.10.0-work+ #1 Not tainted ------------------------------- ./include/linux/cgroup.h:435 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 0 1 lock held by bash/1748: #0: (&cgroup_threadgroup_rwsem){+++++.}, at: [<ffffffff81052c96>] _do_fork+0xe6/0x6e0 stack backtrace: CPU: 3 PID: 1748 Comm: bash Not tainted 4.10.0-work+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014 Call Trace: dump_stack+0x68/0x93 lockdep_rcu_suspicious+0xd7/0x110 pids_can_fork+0x1c7/0x1d0 cgroup_can_fork+0x67/0xc0 copy_process.part.58+0x1709/0x1e90 _do_fork+0xe6/0x6e0 SyS_clone+0x19/0x20 do_syscall_64+0x5c/0x140 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f7853fab93a RSP: 002b:00007ffc12d05c90 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7853fab93a RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 RBP: 00007ffc12d05cc0 R08: 0000000000000000 R09: 00007f78548db700 R10: 00007f78548db9d0 R11: 0000000000000246 R12: 00000000000006d4 R13: 0000000000000001 R14: 0000000000000000 R15: 000055e3ebe2c04d /asdf There's no reason to dereference task_css again here when the associated css is already available. Fix it by replacing the task_cgroup() call with css->cgroup. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Mike Galbraith <efault@gmx.de> Fixes: 135b8b37bd91 ("cgroup: Add pids controller event when fork fails because of pid limit") Cc: Kenny Yu <kennyyu@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-26perf/core: Fix event inheritance on fork()Peter Zijlstra
commit e7cc4865f0f31698ef2f7aac01a50e78968985b7 upstream. While hunting for clues to a use-after-free, Oleg spotted that perf_event_init_context() can loose an error value with the result that fork() can succeed even though we did not fully inherit the perf event context. Spotted-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: oleg@redhat.com Fixes: 889ff0150661 ("perf/core: Split context's event group list into pinned and non-pinned lists") Link: http://lkml.kernel.org/r/20170316125823.190342547@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-26perf/core: Fix use-after-free in perf_release()Peter Zijlstra
commit e552a8389aa409e257b7dcba74f67f128f979ccc upstream. Dmitry reported syzcaller tripped a use-after-free in perf_release(). After much puzzlement Oleg spotted the below scenario: Task1 Task2 fork() perf_event_init_task() /* ... */ goto bad_fork_$foo; /* ... */ perf_event_free_task() mutex_lock(ctx->lock) perf_free_event(B) perf_event_release_kernel(A) mutex_lock(A->child_mutex) list_for_each_entry(child, ...) { /* child == B */ ctx = B->ctx; get_ctx(ctx); mutex_unlock(A->child_mutex); mutex_lock(A->child_mutex) list_del_init(B->child_list) mutex_unlock(A->child_mutex) /* ... */ mutex_unlock(ctx->lock); put_ctx() /* >0 */ free_task(); mutex_lock(ctx->lock); mutex_lock(A->child_mutex); /* ... */ mutex_unlock(A->child_mutex); mutex_unlock(ctx->lock) put_ctx() /* 0 */ ctx->task && !TOMBSTONE put_task_struct() /* UAF */ This patch closes the hole by making perf_event_free_task() destroy the task <-> ctx relation such that perf_event_release_kernel() will no longer observe the now dead task. Spotted-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: fweisbec@gmail.com Cc: oleg@redhat.com Fixes: c6e5b73242d2 ("perf: Synchronously clean up child events") Link: http://lkml.kernel.org/r/20170314155949.GE32474@worktop Link: http://lkml.kernel.org/r/20170316125823.140295131@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=yNiklas Cassel
commit 17fcbd590d0c3e35bd9646e2215f86586378bc42 upstream. We hang if SIGKILL has been sent, but the task is stuck in down_read() (after do_exit()), even though no task is doing down_write() on the rwsem in question: INFO: task libupnp:21868 blocked for more than 120 seconds. libupnp D 0 21868 1 0x08100008 ... Call Trace: __schedule() schedule() __down_read() do_exit() do_group_exit() __wake_up_parent() This bug has already been fixed for CONFIG_RWSEM_XCHGADD_ALGORITHM=y in the following commit: 04cafed7fc19 ("locking/rwsem: Fix down_write_killable()") ... however, this bug also exists for CONFIG_RWSEM_GENERIC_SPINLOCK=y. Signed-off-by: Niklas Cassel <niklas.cassel@axis.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <mhocko@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Niklas Cassel <niklass@axis.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: d47996082f52 ("locking/rwsem: Introduce basis for down_write_killable()") Link: http://lkml.kernel.org/r/1487981873-12649-1-git-send-email-niklass@axis.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22futex: Add missing error handling to FUTEX_REQUEUE_PIPeter Zijlstra
commit 9bbb25afeb182502ca4f2c4f3f88af0681b34cae upstream. Thomas spotted that fixup_pi_state_owner() can return errors and we fail to unlock the rt_mutex in that case. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Darren Hart <dvhart@linux.intel.com> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170304093558.867401760@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22futex: Fix potential use-after-free in FUTEX_REQUEUE_PIPeter Zijlstra
commit c236c8e95a3d395b0494e7108f0d41cf36ec107c upstream. While working on the futex code, I stumbled over this potential use-after-free scenario. Dmitry triggered it later with syzkaller. pi_mutex is a pointer into pi_state, which we drop the reference on in unqueue_me_pi(). So any access to that pointer after that is bad. Since other sites already do rt_mutex_unlock() with hb->lock held, see for example futex_lock_pi(), simply move the unlock before unqueue_me_pi(). Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Darren Hart <dvhart@linux.intel.com> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170304093558.801744246@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22bpf: fix mark_reg_unknown_value for spilled regs on map value markingDaniel Borkmann
[ Upstream commit 6760bf2ddde8ad64f8205a651223a93de3a35494 ] Martin reported a verifier issue that hit the BUG_ON() for his test case in the mark_reg_unknown_value() function: [ 202.861380] kernel BUG at kernel/bpf/verifier.c:467! [...] [ 203.291109] Call Trace: [ 203.296501] [<ffffffff811364d5>] mark_map_reg+0x45/0x50 [ 203.308225] [<ffffffff81136558>] mark_map_regs+0x78/0x90 [ 203.320140] [<ffffffff8113938d>] do_check+0x226d/0x2c90 [ 203.331865] [<ffffffff8113a6ab>] bpf_check+0x48b/0x780 [ 203.343403] [<ffffffff81134c8e>] bpf_prog_load+0x27e/0x440 [ 203.355705] [<ffffffff8118a38f>] ? handle_mm_fault+0x11af/0x1230 [ 203.369158] [<ffffffff812d8188>] ? security_capable+0x48/0x60 [ 203.382035] [<ffffffff811351a4>] SyS_bpf+0x124/0x960 [ 203.393185] [<ffffffff810515f6>] ? __do_page_fault+0x276/0x490 [ 203.406258] [<ffffffff816db320>] entry_SYSCALL_64_fastpath+0x13/0x94 This issue got uncovered after the fix in a08dd0da5307 ("bpf: fix regression on verifier pruning wrt map lookups"). The reason why it wasn't noticed before was, because as mentioned in a08dd0da5307, mark_map_regs() was doing the id matching incorrectly based on the uncached regs[regno].id. So, in the first loop, we walked all regs and as soon as we found regno == i, then this reg's id was cleared when calling mark_reg_unknown_value() thus that every subsequent register was probed against id of 0 (which, in combination with the PTR_TO_MAP_VALUE_OR_NULL type is an invalid condition that no other register state can hold), and therefore wasn't type transitioned such as in the spilled register case for the second loop. Now since that got fixed, it turned out that 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") used mark_reg_unknown_value() incorrectly for the spilled regs, and thus hitting the BUG_ON() in some cases due to regno >= MAX_BPF_REG. Although spilled regs have the same type as the non-spilled regs for the verifier state, that is, struct bpf_reg_state, they are semantically different from the non-spilled regs. In other words, there can be up to 64 (MAX_BPF_STACK / BPF_REG_SIZE) spilled regs in the stack, for example, register R<x> could have been spilled by the program to stack location X, Y, Z, and in mark_map_regs() we need to scan these stack slots of type STACK_SPILL for potential registers that we have to transition from PTR_TO_MAP_VALUE_OR_NULL. Therefore, depending on the location, the spilled_regs regno can be a lot higher than just MAX_BPF_REG's value since we operate on stack instead. The reset in mark_reg_unknown_value() itself is just fine, only that the BUG_ON() was inappropriate for this. Fix it by making a __mark_reg_unknown_value() version that can be called from mark_map_reg() generically; we know for the non-spilled case that the regno is always < MAX_BPF_REG anyway. Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") Reported-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22bpf: fix regression on verifier pruning wrt map lookupsDaniel Borkmann
[ Upstream commit a08dd0da5307ba01295c8383923e51e7997c3576 ] Commit 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") introduced a regression where existing programs stopped loading due to reaching the verifier's maximum complexity limit, whereas prior to this commit they were loading just fine; the affected program has roughly 2k instructions. What was found is that state pruning couldn't be performed effectively anymore due to mismatches of the verifier's register state, in particular in the id tracking. It doesn't mean that 57a09bf0a416 is incorrect per se, but rather that verifier needs to perform a lot more work for the same program with regards to involved map lookups. Since commit 57a09bf0a416 is only about tracking registers with type PTR_TO_MAP_VALUE_OR_NULL, the id is only needed to follow registers until they are promoted through pattern matching with a NULL check to either PTR_TO_MAP_VALUE or UNKNOWN_VALUE type. After that point, the id becomes irrelevant for the transitioned types. For UNKNOWN_VALUE, id is already reset to 0 via mark_reg_unknown_value(), but not so for PTR_TO_MAP_VALUE where id is becoming stale. It's even transferred further into other types that don't make use of it. Among others, one example is where UNKNOWN_VALUE is set on function call return with RET_INTEGER return type. states_equal() will then fall through the memcmp() on register state; note that the second memcmp() uses offsetofend(), so the id is part of that since d2a4dd37f6b4 ("bpf: fix state equivalence"). But the bisect pointed already to 57a09bf0a416, where we really reach beyond complexity limit. What I found was that states_equal() often failed in this case due to id mismatches in spilled regs with registers in type PTR_TO_MAP_VALUE. Unlike non-spilled regs, spilled regs just perform a memcmp() on their reg state and don't have any other optimizations in place, therefore also id was relevant in this case for making a pruning decision. We can safely reset id to 0 as well when converting to PTR_TO_MAP_VALUE. For the affected program, it resulted in a ~17 fold reduction of complexity and let the program load fine again. Selftest suite also runs fine. The only other place where env->id_gen is used currently is through direct packet access, but for these cases id is long living, thus a different scenario. Also, the current logic in mark_map_regs() is not fully correct when marking NULL branch with UNKNOWN_VALUE. We need to cache the destination reg's id in any case. Otherwise, once we marked that reg as UNKNOWN_VALUE, it's id is reset and any subsequent registers that hold the original id and are of type PTR_TO_MAP_VALUE_OR_NULL won't be marked UNKNOWN_VALUE anymore, since mark_map_reg() reuses the uncached regs[regno].id that was just overridden. Note, we don't need to cache it outside of mark_map_regs(), since it's called once on this_branch and the other time on other_branch, which are both two independent verifier states. A test case for this is added here, too. Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22bpf: fix state equivalenceAlexei Starovoitov
[ Upstream commit d2a4dd37f6b41fbcad76efbf63124eb3126c66fe ] Commmits 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") and 484611357c19 ("bpf: allow access into map value arrays") by themselves are correct, but in combination they make state equivalence ignore 'id' field of the register state which can lead to accepting invalid program. Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") Fixes: 484611357c19 ("bpf: allow access into map value arrays") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registersThomas Graf
[ Upstream commit 57a09bf0a416700676e77102c28f9cfcb48267e0 ] A BPF program is required to check the return register of a map_elem_lookup() call before accessing memory. The verifier keeps track of this by converting the type of the result register from PTR_TO_MAP_VALUE_OR_NULL to PTR_TO_MAP_VALUE after a conditional jump ensures safety. This check is currently exclusively performed for the result register 0. In the event the compiler reorders instructions, BPF_MOV64_REG instructions may be moved before the conditional jump which causes them to keep their type PTR_TO_MAP_VALUE_OR_NULL to which the verifier objects when the register is accessed: 0: (b7) r1 = 10 1: (7b) *(u64 *)(r10 -8) = r1 2: (bf) r2 = r10 3: (07) r2 += -8 4: (18) r1 = 0x59c00000 6: (85) call 1 7: (bf) r4 = r0 8: (15) if r0 == 0x0 goto pc+1 R0=map_value(ks=8,vs=8) R4=map_value_or_null(ks=8,vs=8) R10=fp 9: (7a) *(u64 *)(r4 +0) = 0 R4 invalid mem access 'map_value_or_null' This commit extends the verifier to keep track of all identical PTR_TO_MAP_VALUE_OR_NULL registers after a map_elem_lookup() by assigning them an ID and then marking them all when the conditional jump is observed. Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Josef Bacik <jbacik@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>