summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2013-12-12irq: Enable all irqs unconditionally in irq_resumeLaxman Dewangan
commit ac01810c9d2814238f08a227062e66a35a0e1ea2 upstream. When the system enters suspend, it disables all interrupts in suspend_device_irqs(), including the interrupts marked EARLY_RESUME. On the resume side things are different. The EARLY_RESUME interrupts are reenabled in sys_core_ops->resume and the non EARLY_RESUME interrupts are reenabled in the normal system resume path. When suspend_noirq() failed or suspend is aborted for any other reason, we might omit the resume side call to sys_core_ops->resume() and therefor the interrupts marked EARLY_RESUME are not reenabled and stay disabled forever. To solve this, enable all irqs unconditionally in irq_resume() regardless whether interrupts marked EARLY_RESUMEhave been already enabled or not. This might try to reenable already enabled interrupts in the non failure case, but the only affected platform is XEN and it has been confirmed that it does not cause any side effects. [ tglx: Massaged changelog. ] Signed-off-by: Laxman Dewangan <ldewangan@nvidia.com> Acked-by-and-tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Heiko Stuebner <heiko@sntech.de> Reviewed-by: Pavel Machek <pavel@ucw.cz> Cc: <ian.campbell@citrix.com> Cc: <rjw@rjwysocki.net> Cc: <len.brown@intel.com> Cc: <gregkh@linuxfoundation.org> Link: http://lkml.kernel.org/r/1385388587-16442-1-git-send-email-ldewangan@nvidia.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-12time: Fix 1ns/tick drift w/ GENERIC_TIME_VSYSCALL_OLDMartin Schwidefsky
commit 4be77398ac9d948773116b6be4a3c91b3d6ea18c upstream. Since commit 1e75fa8be9f (time: Condense timekeeper.xtime into xtime_sec - merged in v3.6), there has been an problem with the error accounting in the timekeeping code, such that when truncating to nanoseconds, we round up to the next nsec, but the balancing adjustment to the ntp_error value was dropped. This causes 1ns per tick drift forward of the clock. In 3.7, this logic was isolated to only GENERIC_TIME_VSYSCALL_OLD architectures (s390, ia64, powerpc). The fix is simply to balance the accounting and to subtract the added nanosecond from ntp_error. This allows the internal long-term clock steering to keep the clock accurate. While this fix removes the regression added in 1e75fa8be9f, the ideal solution is to move away from GENERIC_TIME_VSYSCALL_OLD and use the new VSYSCALL method, which avoids entirely the nanosecond granular rounding, and the resulting short-term clock adjustment oscillation needed to keep long term accurate time. [ jstultz: Many thanks to Martin for his efforts identifying this subtle bug, and providing the fix. ] Originally-from: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Paul Turner <pjt@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1385149491-20307-1-git-send-email-john.stultz@linaro.org Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08ntp: Make periodic RTC update more reliableMiroslav Lichvar
commit a97ad0c4b447a132a322cedc3a5f7fa4cab4b304 upstream. The current code requires that the scheduled update of the RTC happens in the closest tick to the half of the second. This seems to be difficult to achieve reliably. The scheduled work may be missing the target time by a tick or two and be constantly rescheduled every second. Relax the limit to 10 ticks. As a typical RTC drifts in the 11-minute update interval by several milliseconds, this shouldn't affect the overall accuracy of the RTC much. Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04cpuset: Fix memory allocator deadlockPeter Zijlstra
commit 0fc0287c9ed1ffd3706f8b4d9b314aa102ef1245 upstream. Juri hit the below lockdep report: [ 4.303391] ====================================================== [ 4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ] [ 4.303394] 3.12.0-dl-peterz+ #144 Not tainted [ 4.303395] ------------------------------------------------------ [ 4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 4.303399] (&p->mems_allowed_seq){+.+...}, at: [<ffffffff8114e63c>] new_slab+0x6c/0x290 [ 4.303417] [ 4.303417] and this task is already holding: [ 4.303418] (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff812d2dfb>] blk_execute_rq_nowait+0x5b/0x100 [ 4.303431] which would create a new lock dependency: [ 4.303432] (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...} [ 4.303436] [ 4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock: [ 4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 { [ 4.303922] HARDIRQ-ON-W at: [ 4.303923] [<ffffffff8108ab9a>] __lock_acquire+0x65a/0x1ff0 [ 4.303926] [<ffffffff8108cbe3>] lock_acquire+0x93/0x140 [ 4.303929] [<ffffffff81063dd6>] kthreadd+0x86/0x180 [ 4.303931] [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0 [ 4.303933] SOFTIRQ-ON-W at: [ 4.303933] [<ffffffff8108abcc>] __lock_acquire+0x68c/0x1ff0 [ 4.303935] [<ffffffff8108cbe3>] lock_acquire+0x93/0x140 [ 4.303940] [<ffffffff81063dd6>] kthreadd+0x86/0x180 [ 4.303955] [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0 [ 4.303959] INITIAL USE at: [ 4.303960] [<ffffffff8108a884>] __lock_acquire+0x344/0x1ff0 [ 4.303963] [<ffffffff8108cbe3>] lock_acquire+0x93/0x140 [ 4.303966] [<ffffffff81063dd6>] kthreadd+0x86/0x180 [ 4.303969] [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0 [ 4.303972] } Which reports that we take mems_allowed_seq with interrupts enabled. A little digging found that this can only be from cpuset_change_task_nodemask(). This is an actual deadlock because an interrupt doing an allocation will hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin forever waiting for the write side to complete. Cc: John Stultz <john.stultz@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Reported-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@gmail.com> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04cgroup: fix cgroup_subsys_state leak for seq_filesTejun Heo
commit e605b36575e896edd8161534550c9ea021b03bc0 upstream. If a cgroup file implements either read_map() or read_seq_string(), such file is served using seq_file by overriding file->f_op to cgroup_seqfile_operations, which also overrides the release method to single_release() from cgroup_file_release(). Because cgroup_file_open() didn't use to acquire any resources, this used to be fine, but since f7d58818ba42 ("cgroup: pin cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open() pins the css (cgroup_subsys_state) which is put by cgroup_file_release(). The patch forgot to update the release path for seq_files and each open/release cycle leaks a css reference. Fix it by updating cgroup_file_release() to also handle seq_files and using it for seq_file release path too. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04cgroup: use a dedicated workqueue for cgroup destructionTejun Heo
commit e5fca243abae1445afbfceebda5f08462ef869d3 upstream. Since be44562613851 ("cgroup: remove synchronize_rcu() from cgroup_diput()"), cgroup destruction path makes use of workqueue. css freeing is performed from a work item from that point on and a later commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two steps"), moves css offlining to workqueue too. As cgroup destruction isn't depended upon for memory reclaim, the destruction work items were put on the system_wq; unfortunately, some controller may block in the destruction path for considerable duration while holding cgroup_mutex. As large part of destruction path is synchronized through cgroup_mutex, when combined with high rate of cgroup removals, this has potential to fill up system_wq's max_active of 256. Also, it turns out that memcg's css destruction path ends up queueing and waiting for work items on system_wq through work_on_cpu(). If such operation happens while system_wq is fully occupied by cgroup destruction work items, work_on_cpu() can't make forward progress because system_wq is full and other destruction work items on system_wq can't make forward progress because the work item waiting for work_on_cpu() is holding cgroup_mutex, leading to deadlock. This can be fixed by queueing destruction work items on a separate workqueue. This patch creates a dedicated workqueue - cgroup_destroy_wq - for this purpose. As these work items shouldn't have inter-dependencies and mostly serialized by cgroup_mutex anyway, giving high concurrency level doesn't buy anything and the workqueue's @max_active is set to 1 so that destruction work items are executed one by one on each CPU. Hugh Dickins: Because cgroup_init() is run before init_workqueues(), cgroup_destroy_wq can't be allocated from cgroup_init(). Do it from a separate core_initcall(). In the future, we probably want to reorder so that workqueue init happens before cgroup_init(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com> Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04workqueue: fix ordered workqueues in NUMA setupsTejun Heo
commit 8a2b75384444488fc4f2cbb9f0921b6a0794838f upstream. An ordered workqueue implements execution ordering by using single pool_workqueue with max_active == 1. On a given pool_workqueue, work items are processed in FIFO order and limiting max_active to 1 enforces the queued work items to be processed one by one. Unfortunately, 4c16bd327c ("workqueue: implement NUMA affinity for unbound workqueues") accidentally broke this guarantee by applying NUMA affinity to ordered workqueues too. On NUMA setups, an ordered workqueue would end up with separate pool_workqueues for different nodes. Each pool_workqueue still limits max_active to 1 but multiple work items may be executed concurrently and out of order depending on which node they are queued to. Fix it by using dedicated ordered_wq_attrs[] when creating ordered workqueues. The new attrs match the unbound ones except that no_numa is always set thus forcing all NUMA nodes to share the default pool_workqueue. While at it, add sanity check in workqueue creation path which verifies that an ordered workqueues has only the default pool_workqueue. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Libin <huawei.libin@huawei.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04ftrace: Fix function graph with loading of modulesSteven Rostedt (Red Hat)
commit 8a56d7761d2d041ae5e8215d20b4167d8aa93f51 upstream. Commit 8c4f3c3fa9681 "ftrace: Check module functions being traced on reload" fixed module loading and unloading with respect to function tracing, but it missed the function graph tracer. If you perform the following # cd /sys/kernel/debug/tracing # echo function_graph > current_tracer # modprobe nfsd # echo nop > current_tracer You'll get the following oops message: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 2910 at /linux.git/kernel/trace/ftrace.c:1640 __ftrace_hash_rec_update.part.35+0x168/0x1b9() Modules linked in: nfsd exportfs nfs_acl lockd ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt CPU: 2 PID: 2910 Comm: bash Not tainted 3.13.0-rc1-test #7 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 0000000000000668 ffff8800787efcf8 ffffffff814fe193 ffff88007d500000 0000000000000000 ffff8800787efd38 ffffffff8103b80a 0000000000000668 ffffffff810b2b9a ffffffff81a48370 0000000000000001 ffff880037aea000 Call Trace: [<ffffffff814fe193>] dump_stack+0x4f/0x7c [<ffffffff8103b80a>] warn_slowpath_common+0x81/0x9b [<ffffffff810b2b9a>] ? __ftrace_hash_rec_update.part.35+0x168/0x1b9 [<ffffffff8103b83e>] warn_slowpath_null+0x1a/0x1c [<ffffffff810b2b9a>] __ftrace_hash_rec_update.part.35+0x168/0x1b9 [<ffffffff81502f89>] ? __mutex_lock_slowpath+0x364/0x364 [<ffffffff810b2cc2>] ftrace_shutdown+0xd7/0x12b [<ffffffff810b47f0>] unregister_ftrace_graph+0x49/0x78 [<ffffffff810c4b30>] graph_trace_reset+0xe/0x10 [<ffffffff810bf393>] tracing_set_tracer+0xa7/0x26a [<ffffffff810bf5e1>] tracing_set_trace_write+0x8b/0xbd [<ffffffff810c501c>] ? ftrace_return_to_handler+0xb2/0xde [<ffffffff811240a8>] ? __sb_end_write+0x5e/0x5e [<ffffffff81122aed>] vfs_write+0xab/0xf6 [<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85 [<ffffffff81122dbd>] SyS_write+0x59/0x82 [<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85 [<ffffffff8150a2d2>] system_call_fastpath+0x16/0x1b ---[ end trace 940358030751eafb ]--- The above mentioned commit didn't go far enough. Well, it covered the function tracer by adding checks in __register_ftrace_function(). The problem is that the function graph tracer circumvents that (for a slight efficiency gain when function graph trace is running with a function tracer. The gain was not worth this). The problem came with ftrace_startup() which should always be called after __register_ftrace_function(), if you want this bug to be completely fixed. Anyway, this solution moves __register_ftrace_function() inside of ftrace_startup() and removes the need to call them both. Reported-by: Dave Wysochanski <dwysocha@redhat.com> Fixes: ed926f9b35cd ("ftrace: Use counters to enable functions to trace") Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04audit: log the audit_names record typeJeff Layton
commit d3aea84a4ace5ff9ce7fb7714cee07bebef681c2 upstream. ...to make it clear what the intent behind each record's operation was. In many cases you can infer this, based on the context of the syscall and the result. In other cases it's not so obvious. For instance, in the case where you have a file being renamed over another, you'll have two different records with the same filename but different inode info. By logging this information we can clearly tell which one was created and which was deleted. This fixes what was broken in commit bfcec708. Commit 79f6530c should also be backported to stable v3.7+. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04audit: fix info leak in AUDIT_GET requestsMathias Krause
commit 64fbff9ae0a0a843365d922e0057fc785f23f0e3 upstream. We leak 4 bytes of kernel stack in response to an AUDIT_GET request as we miss to initialize the mask member of status_set. Fix that. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04audit: use nlmsg_len() to get message payload lengthMathias Krause
commit 4d8fe7376a12bf4524783dd95cbc00f1fece6232 upstream. Using the nlmsg_len member of the netlink header to test if the message is valid is wrong as it includes the size of the netlink header itself. Thereby allowing to send short netlink messages that pass those checks. Use nlmsg_len() instead to test for the right message length. The result of nlmsg_len() is guaranteed to be non-negative as the netlink message already passed the checks of nlmsg_ok(). Also switch to min_t() to please checkpatch.pl. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04audit: printk USER_AVC messages when audit isn't enabledTyler Hicks
commit 0868a5e150bc4c47e7a003367cd755811eb41e0b upstream. When the audit=1 kernel parameter is absent and auditd is not running, AUDIT_USER_AVC messages are being silently discarded. AUDIT_USER_AVC messages should be sent to userspace using printk(), as mentioned in the commit message of 4a4cd633 ("AUDIT: Optimise the audit-disabled case for discarding user messages"). When audit_enabled is 0, audit_receive_msg() discards all user messages except for AUDIT_USER_AVC messages. However, audit_log_common_recv_msg() refuses to allocate an audit_buffer if audit_enabled is 0. The fix is to special case AUDIT_USER_AVC messages in both functions. It looks like commit 50397bd1 ("[AUDIT] clean up audit_receive_msg()") introduced this bug. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Cc: linux-audit@redhat.com Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04PM / Hibernate: Do not crash kernel in free_basic_memory_bitmaps()Rafael J. Wysocki
commit 6a0c7cd33075f6b7f1d80145bb19812beb3fc5c9 upstream. I have received a report about the BUG_ON() in free_basic_memory_bitmaps() triggering mysteriously during an aborted s2disk hibernation attempt. The only way I can explain that is that /dev/snapshot was first opened for writing (resume mode), then closed and then opened again for reading and closed again without freezing tasks. In that case the first invocation of snapshot_open() would set the free_bitmaps flag in snapshot_state, which is a static variable. That flag wouldn't be cleared later and the second invocation of snapshot_open() would just leave it like that, so the subsequent snapshot_release() would see data->frozen set and free_basic_memory_bitmaps() would be called unnecessarily. To prevent that from happening clear data->free_bitmaps in snapshot_open() when the file is being opened for reading (hibernate mode). In addition to that, replace the BUG_ON() in free_basic_memory_bitmaps() with a WARN_ON() as the kernel can continue just fine if the condition checked by that macro occurs. Fixes: aab172891542 (PM / hibernate: Fix user space driven resume regression) Reported-by: Oliver Lorenz <olli@olorenz.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04PM / hibernate: Avoid overflow in hibernate_preallocate_memory()Aaron Lu
commit fd432b9f8c7c88428a4635b9f5a9c6e174df6e36 upstream. When system has a lot of highmem (e.g. 16GiB using a 32 bits kernel), the code to calculate how much memory we need to preallocate in normal zone may cause overflow. As Leon has analysed: It looks that during computing 'alloc' variable there is overflow: alloc = (3943404 - 1970542) - 1978280 = -5418 (signed) And this function goes to err_out. Fix this by avoiding that overflow. References: https://bugzilla.kernel.org/show_bug.cgi?id=60817 Reported-and-tested-by: Leon Drugi <eyak@wp.pl> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04alarmtimer: return EINVAL instead of ENOTSUPP if rtcdev doesn't existKOSAKI Motohiro
commit 98d6f4dd84a134d942827584a3c5f67ffd8ec35f upstream. Fedora Ruby maintainer reported latest Ruby doesn't work on Fedora Rawhide on ARM. (http://bugs.ruby-lang.org/issues/9008) Because of, commit 1c6b39ad3f (alarmtimers: Return -ENOTSUPP if no RTC device is present) intruduced to return ENOTSUPP when clock_get{time,res} can't find a RTC device. However this is incorrect. First, ENOTSUPP isn't exported to userland (ENOTSUP or EOPNOTSUP are the closest userland equivlents). Second, Posix and Linux man pages agree that clock_gettime and clock_getres should return EINVAL if clk_id argument is invalid. While the arugment that the clockid is valid, but just not supported on this hardware could be made, this is just a technicality that doesn't help userspace applicaitons, and only complicates error handling. Thus, this patch changes the code to use EINVAL. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Reported-by: Vit Ondruch <v.ondruch@tiscali.cz> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [jstultz: Tweaks to commit message to include full rational] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04genirq: Set the irq thread policy without checking CAP_SYS_NICEThomas Pfaff
commit bbfe65c219c638e19f1da5adab1005b2d68ca810 upstream. In commit ee23871389 ("genirq: Set irq thread to RT priority on creation") we moved the assigment of the thread's priority from the thread's function into __setup_irq(). That function may run in user context for instance if the user opens an UART node and then driver calls requests in the ->open() callback. That user may not have CAP_SYS_NICE and so the irq thread won't run with the SCHED_OTHER policy. This patch uses sched_setscheduler_nocheck() so we omit the CAP_SYS_NICE check which is otherwise required for the SCHED_OTHER policy. [bigeasy: Rewrite the changelog] Signed-off-by: Thomas Pfaff <tpfaff@pcs.com> Cc: Ivo Sieben <meltedpianoman@gmail.com> Link: http://lkml.kernel.org/r/1381489240-29626-1-git-send-email-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29exec/ptrace: fix get_dumpable() incorrect testsKees Cook
commit d049f74f2dbe71354d43d393ac3a188947811348 upstream. The get_dumpable() return value is not boolean. Most users of the function actually want to be testing for non-SUID_DUMP_USER(1) rather than SUID_DUMP_DISABLE(0). The SUID_DUMP_ROOT(2) is also considered a protected state. Almost all places did this correctly, excepting the two places fixed in this patch. Wrong logic: if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ } or if (dumpable == 0) { /* be protective */ } or if (!dumpable) { /* be protective */ } Correct logic: if (dumpable != SUID_DUMP_USER) { /* be protective */ } or if (dumpable != 1) { /* be protective */ } Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a user was able to ptrace attach to processes that had dropped privileges to that user. (This may have been partially mitigated if Yama was enabled.) The macros have been moved into the file that declares get/set_dumpable(), which means things like the ia64 code can see them too. CVE-2013-2929 Reported-by: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29perf/ftrace: Fix paranoid level for enabling function tracerSteven Rostedt
commit 12ae030d54ef250706da5642fc7697cc60ad0df7 upstream. The current default perf paranoid level is "1" which has "perf_paranoid_kernel()" return false, and giving any operations that use it, access to normal users. Unfortunately, this includes function tracing and normal users should not be allowed to enable function tracing by default. The proper level is defined at "-1" (full perf access), which "perf_paranoid_tracepoint_raw()" will only give access to. Use that check instead for enabling function tracing. Reported-by: Dave Jones <davej@redhat.com> Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> CVE: CVE-2013-2930 Fixes: ced39002f5ea ("ftrace, perf: Add support to use function tracepoint in perf") Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29sched, idle: Fix the idle polling state logicPeter Zijlstra
commit ea8117478918a4734586d35ff530721b682425be upstream. Mike reported that commit 7d1a9417 ("x86: Use generic idle loop") regressed several workloads and caused excessive reschedule interrupts. The patch in question failed to notice that the x86 code had an inverted sense of the polling state versus the new generic code (x86: default polling, generic: default !polling). Fix the two prominent x86 mwait based idle drivers and introduce a few new generic polling helpers (fixing the wrong smp_mb__after_clear_bit usage). Also switch the idle routines to using tif_need_resched() which is an immediate TIF_NEED_RESCHED test as opposed to need_resched which will end up being slightly different. Reported-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: lenb@kernel.org Cc: tglx@linutronix.de Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-20tracing: Fix potential out-of-bounds in trace_get_user()Steven Rostedt
commit 057db8488b53d5e4faa0cedb2f39d4ae75dfbdbb upstream. Andrey reported the following report: ERROR: AddressSanitizer: heap-buffer-overflow on address ffff8800359c99f3 ffff8800359c99f3 is located 0 bytes to the right of 243-byte region [ffff8800359c9900, ffff8800359c99f3) Accessed by thread T13003: #0 ffffffff810dd2da (asan_report_error+0x32a/0x440) #1 ffffffff810dc6b0 (asan_check_region+0x30/0x40) #2 ffffffff810dd4d3 (__tsan_write1+0x13/0x20) #3 ffffffff811cd19e (ftrace_regex_release+0x1be/0x260) #4 ffffffff812a1065 (__fput+0x155/0x360) #5 ffffffff812a12de (____fput+0x1e/0x30) #6 ffffffff8111708d (task_work_run+0x10d/0x140) #7 ffffffff810ea043 (do_exit+0x433/0x11f0) #8 ffffffff810eaee4 (do_group_exit+0x84/0x130) #9 ffffffff810eafb1 (SyS_exit_group+0x21/0x30) #10 ffffffff81928782 (system_call_fastpath+0x16/0x1b) Allocated by thread T5167: #0 ffffffff810dc778 (asan_slab_alloc+0x48/0xc0) #1 ffffffff8128337c (__kmalloc+0xbc/0x500) #2 ffffffff811d9d54 (trace_parser_get_init+0x34/0x90) #3 ffffffff811cd7b3 (ftrace_regex_open+0x83/0x2e0) #4 ffffffff811cda7d (ftrace_filter_open+0x2d/0x40) #5 ffffffff8129b4ff (do_dentry_open+0x32f/0x430) #6 ffffffff8129b668 (finish_open+0x68/0xa0) #7 ffffffff812b66ac (do_last+0xb8c/0x1710) #8 ffffffff812b7350 (path_openat+0x120/0xb50) #9 ffffffff812b8884 (do_filp_open+0x54/0xb0) #10 ffffffff8129d36c (do_sys_open+0x1ac/0x2c0) #11 ffffffff8129d4b7 (SyS_open+0x37/0x50) #12 ffffffff81928782 (system_call_fastpath+0x16/0x1b) Shadow bytes around the buggy address: ffff8800359c9700: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd ffff8800359c9780: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa ffff8800359c9800: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa ffff8800359c9880: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa ffff8800359c9900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>ffff8800359c9980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00[03]fb ffff8800359c9a00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa ffff8800359c9a80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa ffff8800359c9b00: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00 ffff8800359c9b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8800359c9c00: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap redzone: fa Heap kmalloc redzone: fb Freed heap region: fd Shadow gap: fe The out-of-bounds access happens on 'parser->buffer[parser->idx] = 0;' Although the crash happened in ftrace_regex_open() the real bug occurred in trace_get_user() where there's an incrementation to parser->idx without a check against the size. The way it is triggered is if userspace sends in 128 characters (EVENT_BUF_SIZE + 1), the loop that reads the last character stores it and then breaks out because there is no more characters. Then the last character is read to determine what to do next, and the index is incremented without checking size. Then the caller of trace_get_user() usually nulls out the last character with a zero, but since the index is equal to the size, it writes a nul character after the allocated space, which can corrupt memory. Luckily, only root user has write access to this file. Link: http://lkml.kernel.org/r/20131009222323.04fd1a0d@gandalf.local.home Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-29perf: Fix perf ring buffer memory orderingPeter Zijlstra
The PPC64 people noticed a missing memory barrier and crufty old comments in the perf ring buffer code. So update all the comments and add the missing barrier. When the architecture implements local_t using atomic_long_t there will be double barriers issued; but short of introducing more conditional barrier primitives this is the best we can do. Reported-by: Victor Kaplansky <victork@il.ibm.com> Tested-by: Victor Kaplansky <victork@il.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: michael@ellerman.id.au Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Michael Neuling <mikey@neuling.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: anton@samba.org Cc: benh@kernel.crashing.org Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-27Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "This tree contains a clockevents regression fix for certain ARM subarchitectures" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clockevents: Sanitize ticks to nsec conversion
2013-10-27Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "The tree contains three fixes: - Two tooling fixes - Reversal of the new 'MMAP2' extended mmap record ABI, introduced in this merge window. (Patches were proposed to fix it but it was all a bit late and we felt it's safer to just delay the ABI one more kernel release and do it right)" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Disable PERF_RECORD_MMAP2 support perf scripting perl: Fix build error on Fedora 12 perf probe: Fix to initialize fname always before use it
2013-10-27Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "This tree fixes a boot crash in CONFIG_DEBUG_MUTEXES=y kernels, on kernels built with GCC 3.x (there are still such distros)" Side note: it's not just a fix for old gcc versions, it's also removing an incredibly broken/subtle check that LLVM had issues with, and that made no sense. * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mutex: Avoid gcc version dependent __builtin_constant_p() usage
2013-10-26Merge tag 'pm+acpi-3.12-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management fixes from "These fix two bugs in the intel_pstate driver, a hibernate bug leading to nasty resume failures sometimes and acpi-cpufreq initialization bug that causes problems to happen during module unload when intel_pstate is in use. Specifics: - Fix for rounding errors in intel_pstate causing CPU utilization to be underestimated from Brennan Shacklett. - intel_pstate fix to always use the correct max pstate value when computing the min pstate from Dirk Brandewie. - Hibernation fix for deadlocking resume in cases when the probing of the device containing the image is deferred from Russ Dill. - acpi-cpufreq fix to prevent the module from staying in memory when the driver cannot be registered and then attempting to unregister things that have never been registered on exit" * tag 'pm+acpi-3.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: acpi-cpufreq: Fail initialization if driver cannot be registered PM / hibernate: Move software_resume to late_initcall_sync intel_pstate: Correct calculation of min pstate value intel_pstate: Improve accuracy by not truncating until final result
2013-10-24PM / hibernate: Move software_resume to late_initcall_syncRuss Dill
software_resume is being called after deferred_probe_initcall in drivers base. If the probing of the device that contains the resume image is deferred, and the system has been instructed to wait for it to show up, this wait will occur in software_resume. This causes a deadlock. Move software_resume into late_initcall_sync so that it happens after all the other late_initcalls. Signed-off-by: Russ Dill <Russ.Dill@ti.com> Acked-by: Pavel Machek <Pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-10-23clockevents: Sanitize ticks to nsec conversionThomas Gleixner
Marc Kleine-Budde pointed out, that commit 77cc982 "clocksource: use clockevents_config_and_register() where possible" caused a regression for some of the converted subarchs. The reason is, that the clockevents core code converts the minimal hardware tick delta to a nanosecond value for core internal usage. This conversion is affected by integer math rounding loss, so the backwards conversion to hardware ticks will likely result in a value which is less than the configured hardware limitation. The affected subarchs used their own workaround (SIGH!) which got lost in the conversion. The solution for the issue at hand is simple: adding evt->mult - 1 to the shifted value before the integer divison in the core conversion function takes care of it. But this only works for the case where for the scaled math mult/shift pair "mult <= 1 << shift" is true. For the case where "mult > 1 << shift" we can apply the rounding add only for the minimum delta value to make sure that the backward conversion is not less than the given hardware limit. For the upper bound we need to omit the rounding add, because the backwards conversion is always larger than the original latch value. That would violate the upper bound of the hardware device. Though looking closer at the details of that function reveals another bogosity: The upper bounds check is broken as well. Checking for a resulting "clc" value greater than KTIME_MAX after the conversion is pointless. The conversion does: u64 clc = (latch << evt->shift) / evt->mult; So there is no sanity check for (latch << evt->shift) exceeding the 64bit boundary. The latch argument is "unsigned long", so on a 64bit arch the handed in argument could easily lead to an unnoticed shift overflow. With the above rounding fix applied the calculation before the divison is: u64 clc = (latch << evt->shift) + evt->mult - 1; So we need to make sure, that neither the shift nor the rounding add is overflowing the u64 boundary. [ukl: move assignment to rnd after eventually changing mult, fix build issue and correct comment with the right math] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: nicolas.ferre@atmel.com Cc: Marc Pignat <marc.pignat@hevs.ch> Cc: john.stultz@linaro.org Cc: kernel@pengutronix.de Cc: Ronald Wahl <ronald.wahl@raritan.com> Cc: LAK <linux-arm-kernel@lists.infradead.org> Cc: Ludovic Desroches <ludovic.desroches@atmel.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1380052223-24139-1-git-send-email-u.kleine-koenig@pengutronix.de Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
2013-10-22Merge branch 'for-3.12-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Two late fixes for cgroup. One fixes descendant walk introduced during this rc1 cycle. The other fixes a post 3.9 bug during task attach which can lead to hang. Both fixes are critical and the fixes are relatively straight-forward" * 'for-3.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix to break the while loop in cgroup_attach_task() correctly cgroup: fix cgroup post-order descendant walk of empty subtree
2013-10-20Merge tag 'perf-urgent-for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf/urgent fixes from Arnaldo Carvalho de Melo: " * Fix build error on Fedora 12. * Fix to initialize fname always before use it, bug introduced during this merge window, from Masami Hiramatsu. * Disable PERF_RECORD_MMAP2 support, from Stephane Eranian. " Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-18mutex: Avoid gcc version dependent __builtin_constant_p() usageTetsuo Handa
Commit 040a0a37 ("mutex: Add support for wound/wait style locks") used "!__builtin_constant_p(p == NULL)" but gcc 3.x cannot handle such expression correctly, leading to boot failure when built with CONFIG_DEBUG_MUTEXES=y. Fix it by explicitly passing a bool which tells whether p != NULL or not. [ PeterZ: This is a sad patch, but provided it actually generates similar code I suppose its the best we can do bar whole sale deprecating gcc-3. ] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> Cc: peterz@infradead.org Cc: imirkin@alum.mit.edu Cc: daniel.vetter@ffwll.ch Cc: robdclark@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/201310171945.AGB17114.FSQVtHOJFOOFML@I-love.SAKURA.ne.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-17perf: Disable PERF_RECORD_MMAP2 supportStephane Eranian
For now, we disable the extended MMAP record support (MMAP2). We have identified cases where it would not report the correct mapping information, clone(VM_CLONE) but with separate pids. We will revisit the support once we find a solution for this case. The patch changes the kernel to return EINVAL if attr->mmap2 is set. The patch also modifies the perf tool to use regular PERF_RECORD_MMAP for synthetic events and it also prevents the tool from requesting attr->mmap2 mode because the kernel would reject it. The support will be revisited once the kenrel interface is updated. In V2, we reduce the patch to the strict minimum. In V3, we avoid calling perf_event_open() with mmap2 set because we know it will fail and require fallback retry. Signed-off-by: Stephane Eranian <eranian@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131017173215.GA8820@quad Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2013-10-13cgroup: fix to break the while loop in cgroup_attach_task() correctlyAnjana V Kumar
Both Anjana and Eunki reported a stall in the while_each_thread loop in cgroup_attach_task(). It's because, when we attach a single thread to a cgroup, if the cgroup is exiting or is already in that cgroup, we won't break the loop. If the task is already in the cgroup, the bug can lead to another thread being attached to the cgroup unexpectedly: # echo 5207 > tasks # cat tasks 5207 # echo 5207 > tasks # cat tasks 5207 5215 What's worse, if the task to be attached isn't the leader of the thread group, we might never exit the loop, hence cpu stall. Thanks for Oleg's analysis. This bug was introduced by commit 081aa458c38ba576bdd4265fc807fa95b48b9e79 ("cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc()") [ lizf: - fixed the first continue, pointed out by Oleg, - rewrote changelog. ] Cc: <stable@vger.kernel.org> # 3.9+ Reported-by: Eunki Kim <eunki_kim@samsung.com> Reported-by: Anjana V Kumar <anjanavk12@gmail.com> Signed-off-by: Anjana V Kumar <anjanavk12@gmail.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-10-08Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Various fixlets: On the kernel side: - fix a race - fix a bug in the handling of the perf ring-buffer data page On the tooling side: - fix the handling of certain corrupted perf.data files - fix a bug in 'perf probe' - fix a bug in 'perf record + perf sched' - fix a bug in 'make install' - fix a bug in libaudit feature-detection on certain distros" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf session: Fix infinite loop on invalid perf.data file perf tools: Fix installation of libexec components perf probe: Fix to find line information for probe list perf tools: Fix libaudit test perf stat: Set child_pid after perf_evlist__prepare_workload() perf tools: Add default handler for mmap2 events perf/x86: Clean up cap_user_time* setting perf: Fix perf_pmu_migrate_context
2013-10-04Merge tag 'pm+acpi-3.12-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management fixes from Rafael Wysocki: - The resume part of user space driven hibernation (s2disk) is now broken after the change that moved the creation of memory bitmaps to after the freezing of tasks, because I forgot that the resume utility loaded the image before freezing tasks and needed the bitmaps for that. The fix adds special handling for that case. - One of recent commits changed the export of acpi_bus_get_device() to EXPORT_SYMBOL_GPL(), which was technically correct but broke existing binary modules using that function including one in particularly widespread use. Change it back to EXPORT_SYMBOL(). - The intel_pstate driver sometimes fails to disable turbo if its no_turbo sysfs attribute is set. Fix from Srinivas Pandruvada. - One of recent cpufreq fixes forgot to update a check in cpufreq-cpu0 which still (incorrectly) treats non-NULL as non-error. Fix from Philipp Zabel. - The SPEAr cpufreq driver uses a wrong variable type in one place preventing it from catching errors returned by one of the functions called by it. Fix from Sachin Kamat. * tag 'pm+acpi-3.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: Use EXPORT_SYMBOL() for acpi_bus_get_device() intel_pstate: fix no_turbo cpufreq: cpufreq-cpu0: NULL is a valid regulator, part 2 cpufreq: SPEAr: Fix incorrect variable type PM / hibernate: Fix user space driven resume regression
2013-10-04perf: Fix perf_pmu_migrate_contextPeter Zijlstra
While auditing the list_entry usage due to a trinity bug I found that perf_pmu_migrate_context violates the rules for perf_event::event_entry. The problem is that perf_event::event_entry is a RCU list element, and hence we must wait for a full RCU grace period before re-using the element after deletion. Therefore the usage in perf_pmu_migrate_context() which re-uses the entry immediately is broken. For now introduce another list_head into perf_event for this specific usage. This doesn't actually fix the trinity report because that never goes through this code. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-mkj72lxagw1z8fvjm648iznw@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-02Merge branch 'irq/urgent-v2' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into irq/urgent Pull a hardirq-nesting fix from Frederic Weisbecker. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-01irq: Force hardirq exit's softirq processing on its own stackFrederic Weisbecker
The commit facd8b80c67a3cf64a467c4a2ac5fb31f2e6745b ("irq: Sanitize invoke_softirq") converted irq exit calls of do_softirq() to __do_softirq() on all architectures, assuming it was only used there for its irq disablement properties. But as a side effect, the softirqs processed in the end of the hardirq are always called on the inline current stack that is used by irq_exit() instead of the softirq stack provided by the archs that override do_softirq(). The result is mostly safe if the architecture runs irq_exit() on a separate irq stack because then softirqs are processed on that same stack that is near empty at this stage (assuming hardirq aren't nesting). Otherwise irq_exit() runs in the task stack and so does the softirq too. The interrupted call stack can be randomly deep already and the softirq can dig through it even further. To add insult to the injury, this softirq can be interrupted by a new hardirq, maximizing the chances for a stack overrun as reported in powerpc for example: do_IRQ: stack overflow: 1920 CPU: 0 PID: 1602 Comm: qemu-system-ppc Not tainted 3.10.4-300.1.fc19.ppc64p7 #1 Call Trace: [c0000000050a8740] .show_stack+0x130/0x200 (unreliable) [c0000000050a8810] .dump_stack+0x28/0x3c [c0000000050a8880] .do_IRQ+0x2b8/0x2c0 [c0000000050a8930] hardware_interrupt_common+0x154/0x180 --- Exception: 501 at .cp_start_xmit+0x3a4/0x820 [8139cp] LR = .cp_start_xmit+0x390/0x820 [8139cp] [c0000000050a8d40] .dev_hard_start_xmit+0x394/0x640 [c0000000050a8e00] .sch_direct_xmit+0x110/0x260 [c0000000050a8ea0] .dev_queue_xmit+0x260/0x630 [c0000000050a8f40] .br_dev_queue_push_xmit+0xc4/0x130 [bridge] [c0000000050a8fc0] .br_dev_xmit+0x198/0x270 [bridge] [c0000000050a9070] .dev_hard_start_xmit+0x394/0x640 [c0000000050a9130] .dev_queue_xmit+0x428/0x630 [c0000000050a91d0] .ip_finish_output+0x2a4/0x550 [c0000000050a9290] .ip_local_out+0x50/0x70 [c0000000050a9310] .ip_queue_xmit+0x148/0x420 [c0000000050a93b0] .tcp_transmit_skb+0x4e4/0xaf0 [c0000000050a94a0] .__tcp_ack_snd_check+0x7c/0xf0 [c0000000050a9520] .tcp_rcv_established+0x1e8/0x930 [c0000000050a95f0] .tcp_v4_do_rcv+0x21c/0x570 [c0000000050a96c0] .tcp_v4_rcv+0x734/0x930 [c0000000050a97a0] .ip_local_deliver_finish+0x184/0x360 [c0000000050a9840] .ip_rcv_finish+0x148/0x400 [c0000000050a98d0] .__netif_receive_skb_core+0x4f8/0xb00 [c0000000050a99d0] .netif_receive_skb+0x44/0x110 [c0000000050a9a70] .br_handle_frame_finish+0x2bc/0x3f0 [bridge] [c0000000050a9b20] .br_nf_pre_routing_finish+0x2ac/0x420 [bridge] [c0000000050a9bd0] .br_nf_pre_routing+0x4dc/0x7d0 [bridge] [c0000000050a9c70] .nf_iterate+0x114/0x130 [c0000000050a9d30] .nf_hook_slow+0xb4/0x1e0 [c0000000050a9e00] .br_handle_frame+0x290/0x330 [bridge] [c0000000050a9ea0] .__netif_receive_skb_core+0x34c/0xb00 [c0000000050a9fa0] .netif_receive_skb+0x44/0x110 [c0000000050aa040] .napi_gro_receive+0xe8/0x120 [c0000000050aa0c0] .cp_rx_poll+0x31c/0x590 [8139cp] [c0000000050aa1d0] .net_rx_action+0x1dc/0x310 [c0000000050aa2b0] .__do_softirq+0x158/0x330 [c0000000050aa3b0] .irq_exit+0xc8/0x110 [c0000000050aa430] .do_IRQ+0xdc/0x2c0 [c0000000050aa4e0] hardware_interrupt_common+0x154/0x180 --- Exception: 501 at .bad_range+0x1c/0x110 LR = .get_page_from_freelist+0x908/0xbb0 [c0000000050aa7d0] .list_del+0x18/0x50 (unreliable) [c0000000050aa850] .get_page_from_freelist+0x908/0xbb0 [c0000000050aa9e0] .__alloc_pages_nodemask+0x21c/0xae0 [c0000000050aaba0] .alloc_pages_vma+0xd0/0x210 [c0000000050aac60] .handle_pte_fault+0x814/0xb70 [c0000000050aad50] .__get_user_pages+0x1a4/0x640 [c0000000050aae60] .get_user_pages_fast+0xec/0x160 [c0000000050aaf10] .__gfn_to_pfn_memslot+0x3b0/0x430 [kvm] [c0000000050aafd0] .kvmppc_gfn_to_pfn+0x64/0x130 [kvm] [c0000000050ab070] .kvmppc_mmu_map_page+0x94/0x530 [kvm] [c0000000050ab190] .kvmppc_handle_pagefault+0x174/0x610 [kvm] [c0000000050ab270] .kvmppc_handle_exit_pr+0x464/0x9b0 [kvm] [c0000000050ab320] kvm_start_lightweight+0x1ec/0x1fc [kvm] [c0000000050ab4f0] .kvmppc_vcpu_run_pr+0x168/0x3b0 [kvm] [c0000000050ab9c0] .kvmppc_vcpu_run+0xc8/0xf0 [kvm] [c0000000050aba50] .kvm_arch_vcpu_ioctl_run+0x5c/0x1a0 [kvm] [c0000000050abae0] .kvm_vcpu_ioctl+0x478/0x730 [kvm] [c0000000050abc90] .do_vfs_ioctl+0x4ec/0x7c0 [c0000000050abd80] .SyS_ioctl+0xd4/0xf0 [c0000000050abe30] syscall_exit+0x0/0x98 Since this is a regression, this patch proposes a minimalistic and low-risk solution by blindly forcing the hardirq exit processing of softirqs on the softirq stack. This way we should reduce significantly the opportunities for task stack overflow dug by softirqs. Longer term solutions may involve extending the hardirq stack coverage to irq_exit(), etc... Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: #3.9.. <stable@vger.kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@au1.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@au1.ibm.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: James E.J. Bottomley <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Andrew Morton <akpm@linux-foundation.org>
2013-09-30pidns: fix free_pid() to handle the first fork failureOleg Nesterov
"case 0" in free_pid() assumes that disable_pid_allocation() should clear PIDNS_HASH_ADDING before the last pid goes away. However this doesn't happen if the first fork() fails to create the child reaper which should call disable_pid_allocation(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-30kernel/kmod.c: check for NULL in call_usermodehelper_exec()Tetsuo Handa
If /proc/sys/kernel/core_pattern contains only "|", a NULL pointer dereference happens upon core dump because argv_split("") returns argv[0] == NULL. This bug was once fixed by commit 264b83c07a84 ("usermodehelper: check subprocess_info->path != NULL") but was by error reintroduced by commit 7f57cfa4e2aa ("usermodehelper: kill the sub_info->path[0] check"). This bug seems to exist since 2.6.19 (the version which core dump to pipe was added). Depending on kernel version and config, some side effect might happen immediately after this oops (e.g. kernel panic with 2.6.32-358.18.1.el6). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-30PM / hibernate: Fix user space driven resume regressionRafael J. Wysocki
Recent commit 8fd37a4 (PM / hibernate: Create memory bitmaps after freezing user space) broke the resume part of the user space driven hibernation (s2disk), because I forgot that the resume utility loaded the image into memory without freezing user space (it still freezes tasks after loading the image). This means that during user space driven resume we need to create the memory bitmaps at the "device open" time rather than at the "freeze tasks" time, so make that happen (that's a special case anyway, so it needs to be treated in a special way). Reported-and-tested-by: Ronald <ronald645@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-09-28Merge branches 'sched-urgent-for-linus', 'timers-urgent-for-linus' and ↵Linus Torvalds
'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler, timer and x86 fixes from Ingo Molnar: - A context tracking ARM build and functional fix - A handful of ARM clocksource/clockevent driver fixes - An AMD microcode patch level sysfs reporting fixlet * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arm: Fix build error with context tracking calls * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: em_sti: Set cpu_possible_mask to fix SMP broadcast clocksource: of: Respect device tree node status clocksource: exynos_mct: Set IRQ affinity when the CPU goes online arm: clocksource: mvebu: Use the main timer as clock source from DT * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/microcode/AMD: Fix patch level reporting for family 15h
2013-09-28kernel/params: fix handling of signed integer typesJean Delvare
Commit 6072ddc8520b ("kernel: replace strict_strto*() with kstrto*()") broke the handling of signed integer types, fix it. Signed-off-by: Jean Delvare <khali@linux-fr.org> Reported-by: Christian Kujau <lists@nerdbynature.de> Tested-by: Christian Kujau <lists@nerdbynature.de> Cc: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-28Merge branch 'context_tracking/fixes' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/urgent Pull context tracking ARM fix from Frederic Weisbecker. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-27arm: Fix build error with context tracking callsFrederic Weisbecker
ad65782fba50 (context_tracking: Optimize main APIs off case with static key) converted context tracking main APIs to inline function and left ARM asm callers behind. This can be easily fixed by making ARM calling the post static keys context tracking function. We just need to replicate the static key checks there. We'll remove these later when ARM will support the context tracking static keys. Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-by: Russell King <linux@arm.linux.org.uk> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Tested-by: Kevin Hilman <khilman@linaro.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Anil Kumar <anilk4.v@gmail.com> Cc: Tony Lindgren <tony@atomide.com> Cc: Benoit Cousson <b-cousson@ti.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Russell King <linux@arm.linux.org.uk> Cc: Kevin Hilman <khilman@linaro.org>
2013-09-25Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Three small fixes" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/balancing: Fix cfs_rq->task_h_load calculation sched/balancing: Fix 'local->avg_load > busiest->avg_load' case in fix_small_imbalance() sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance()
2013-09-25Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Assorted standalone fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Add model number for Avoton Silvermont perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page' perf/x86/intel/uncore: Don't use smp_processor_id() in validate_group() perf: Update ABI comment tools lib lk: Uninclude linux/magic.h in debugfs.c perf tools: Fix old GCC build error in trace-event-parse.c:parse_proc_kallsyms() perf probe: Fix finder to find lines of given function perf session: Check for SIGINT in more loops perf tools: Fix compile with libelf without get_phdrnum perf tools: Fix buildid cache handling of kallsyms with kcore perf annotate: Fix objdump line parsing offset validation perf tools: Fill in new definitions for madvise()/mmap() flags perf tools: Sharpen the libaudit dependencies test
2013-09-25kernel/reboot.c: re-enable the function of variable reboot_defaultChuansheng Liu
Commit 1b3a5d02ee07 ("reboot: move arch/x86 reboot= handling to generic kernel") did some cleanup for reboot= command line, but it made the reboot_default inoperative. The default value of variable reboot_default should be 1, and if command line reboot= is not set, system will use the default reboot mode. [akpm@linux-foundation.org: fix comment layout] Signed-off-by: Li Fei <fei.li@intel.com> Signed-off-by: liu chuansheng <chuansheng.liu@intel.com> Acked-by: Robin Holt <robinmholt@linux.com> Cc: <stable@vger.kernel.org> [3.11.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-25audit: fix endless wait in audit_log_start()Konstantin Khlebnikov
After commit 829199197a43 ("kernel/audit.c: avoid negative sleep durations") audit emitters will block forever if userspace daemon cannot handle backlog. After the timeout the waiting loop turns into busy loop and runs until daemon dies or returns back to work. This is a minimal patch for that bug. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Richard Guy Briggs <rgb@redhat.com> Cc: Eric Paris <eparis@redhat.com> Cc: Chuck Anderson <chuck.anderson@oracle.com> Cc: Dan Duval <dan.duval@oracle.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-25watchdog: update watchdog_thresh properlyMichal Hocko
watchdog_tresh controls how often nmi perf event counter checks per-cpu hrtimer_interrupts counter and blows up if the counter hasn't changed since the last check. The counter is updated by per-cpu watchdog_hrtimer hrtimer which is scheduled with 2/5 watchdog_thresh period which guarantees that hrtimer is scheduled 2 times per the main period. Both hrtimer and perf event are started together when the watchdog is enabled. So far so good. But... But what happens when watchdog_thresh is updated from sysctl handler? proc_dowatchdog will set a new sampling period and hrtimer callback (watchdog_timer_fn) will use the new value in the next round. The problem, however, is that nobody tells the perf event that the sampling period has changed so it is ticking with the period configured when it has been set up. This might result in an ear ripping dissonance between perf and hrtimer parts if the watchdog_thresh is increased. And even worse it might lead to KABOOM if the watchdog is configured to panic on such a spurious lockup. This patch fixes the issue by updating both nmi perf even counter and hrtimers if the threshold value has changed. The nmi one is disabled and then reinitialized from scratch. This has an unpleasant side effect that the allocation of the new event might fail theoretically so the hard lockup detector would be disabled for such cpus. On the other hand such a memory allocation failure is very unlikely because the original event is deallocated right before. It would be much nicer if we just changed perf event period but there doesn't seem to be any API to do that right now. It is also unfortunate that perf_event_alloc uses GFP_KERNEL allocation unconditionally so we cannot use on_each_cpu() and do the same thing from the per-cpu context. The update from the current CPU should be safe because perf_event_disable removes the event atomically before it clears the per-cpu watchdog_ev so it cannot change anything under running handler feet. The hrtimer is simply restarted (thanks to Don Zickus who has pointed this out) if it is queued because we cannot rely it will fire&adopt to the new sampling period before a new nmi event triggers (when the treshold is decreased). [akpm@linux-foundation.org: the UP version of __smp_call_function_single ended up in the wrong place] Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Fabio Estevam <festevam@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-25watchdog: update watchdog attributes atomicallyMichal Hocko
proc_dowatchdog doesn't synchronize multiple callers which might lead to confusion when two parallel callers might confuse watchdog_enable_all_cpus resp watchdog_disable_all_cpus (eg watchdog gets enabled even if watchdog_thresh was set to 0 already). This patch adds a local mutex which synchronizes callers to the sysctl handler. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>