summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2012-08-22vfio: grab vfio_device reference *before* exposing the sucker via fd_install()Al Viro
It's not critical (anymore) since another thread closing the file will block on ->device_lock before it gets to dropping the final reference, but it's definitely cleaner that way... Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-22vfio: get rid of vfio_device_put()/vfio_group_get_device* racesAl Viro
we really need to make sure that dropping the last reference happens under the group->device_lock; otherwise a loop (under device_lock) might find vfio_device instance that is being freed right now, has already dropped the last reference and waits on device_lock to exclude the sucker from the list. Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-22vfio: get rid of open-coding kref_put_mutexAl Viro
Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-22introduce kref_put_mutex()Al Viro
equivalent of mutex_lock(mutex); if (!kref_put(kref, release)) mutex_unlock(mutex); Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-22vfio: don't dereference after kfree...Al Viro
Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-22KVM: MMU: Fix mmu_shrink() so that it can free mmu pages as intendedTakuya Yoshikawa
Although the possible race described in commit 85b7059169e128c57a3a8a3e588fb89cb2031da1 KVM: MMU: fix shrinking page from the empty mmu was correct, the real cause of that issue was a more trivial bug of mmu_shrink() introduced by commit 1952639665e92481c34c34c3e2a71bf3e66ba362 KVM: MMU: do not iterate over all VMs in mmu_shrink() Here is the bug: if (kvm->arch.n_used_mmu_pages > 0) { if (!nr_to_scan--) break; continue; } We skip VMs whose n_used_mmu_pages is not zero and try to shrink others: in other words we try to shrink empty ones by mistake. This patch reverses the logic so that mmu_shrink() can free pages from the first VM whose n_used_mmu_pages is not zero. Note that we also add comments explaining the role of nr_to_scan which is not practically important now, hoping this will be improved in the future. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-22x86/alternatives: Fix p6 nops on non-modular kernelsAvi Kivity
Probably a leftover from the early days of self-patching, p6nops are marked __initconst_or_module, which causes them to be discarded in a non-modular kernel. If something later triggers patching, it will overwrite kernel code with garbage. Reported-by: Tomas Racek <tracek@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com> Cc: Michael Tokarev <mjt@tls.msk.ru> Cc: Borislav Petkov <borislav.petkov@amd.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: qemu-devel@nongnu.org Cc: Anthony Liguori <anthony@codemonkey.ws> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/r/5034AE84.90708@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-22UBIFS: fix complaints about too small debug buffer sizeArtem Bityutskiy
When debugging is enabled, we use a temporary on-stack buffer for formatting the key strings like "(11368871, direntry, 0xcd0750)". The buffer size is 32 bytes and sometimes it is not enough to fit the key string - e.g., when inode numbers are high. This is not fatal, but the key strings are incomplete and UBIFS complains like this: UBIFS assert failed in dbg_snprintf_key at 137 (pid 1) This is a regression caused by "515315a UBIFS: fix key printing". Fix the issue by increasing the buffer to 48 bytes. Reported-by: Michael Hench <michaelhench@gmail.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Michael Hench <michaelhench@gmail.com> Cc: stable@vger.kernel.org [v3.3+]
2012-08-22time: Avoid making adjustments if we haven't accumulated anythingJohn Stultz
If update_wall_time() is called and the current offset isn't large enough to accumulate, avoid re-calling timekeeping_adjust which may change the clock freq and can cause 1ns inconsistencies with CLOCK_REALTIME_COARSE/CLOCK_MONOTONIC_COARSE. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1345595449-34965-5-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22time: Avoid potential shift overflow with large shift valuesJohn Stultz
Andreas Schwab noticed that the 1 << tk->shift could overflow if the shift value was greater than 30, since 1 would be a 32bit long on 32bit architectures. This issue was introduced by 1e75fa8be (time: Condense timekeeper.xtime into xtime_sec) Use 1ULL instead to ensure we don't overflow on the shift. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1345595449-34965-4-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22time: Fix casting issue in timekeeping_forward_nowAndreas Schwab
arch_gettimeoffset returns a u32 value which when shifted by tk->shift can overflow. This issue was introduced with 1e75fa8be (time: Condense timekeeper.xtime into xtime_sec) Cast it to u64 first. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1345595449-34965-3-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22time: Ensure we normalize the timekeeper in tk_xtime_addJohn Stultz
Andreas noticed problems with resume on specific hardware after commit 1e75fa8b (time: Condense timekeeper.xtime into xtime_sec) combined with commit b44d50dca (time: Fix casting issue in tk_set_xtime and tk_xtime_add) After some digging I realized we aren't normalizing the timekeeper after the add. Add the missing normalize call. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Tested-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1345595449-34965-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-22x86/fixup_irq: Use cpu_online_mask instead of cpu_all_maskLiu, Chuansheng
When one CPU is going down and this CPU is the last one in irq affinity, current code is setting cpu_all_mask as the new affinity for that irq. But for some systems (such as in Medfield Android mobile) the firmware sends the interrupt to each CPU in the irq affinity mask, averaged, and cpu_all_mask includes all potential CPUs, i.e. offline ones as well. So replace cpu_all_mask with cpu_online_mask. Signed-off-by: liu chuansheng <chuansheng.liu@intel.com> Acked-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A137286@SHSMSX101.ccr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-22x86/spinlocks: Fix comment in spinlock.hRichard Weinberger
This comment is no longer true. We support up to 2^16 CPUs because __ticket_t is an u16 if NR_CPUS is larger than 256. Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-08-22fbcon: fix race condition between console lock and cursor timer (v1.1)Dave Airlie
So we've had a fair few reports of fbcon handover breakage between efi/vesafb and i915 surface recently, so I dedicated a couple of days to finding the problem. Essentially the last thing we saw was the conflicting framebuffer message and that was all. So after much tracing with direct netconsole writes (printks under console_lock not so useful), I think I found the race. Thread A (driver load) Thread B (timer thread) unbind_con_driver -> | bind_con_driver -> | vc->vc_sw->con_deinit -> | fbcon_deinit -> | console_lock() | | | | fbcon_flashcursor timer fires | console_lock() <- blocked for A | | fbcon_del_cursor_timer -> del_timer_sync (BOOM) Of course because all of this is under the console lock, we never see anything, also since we also just unbound the active console guess what we never see anything. Hopefully this fixes the problem for anyone seeing vesafb->kms driver handoff. v1.1: add comment suggestion from Alan. Cc: stable@vger.kernel.org Signed-off-by: Dave Airlie <airlied@redhat.com>
2012-08-22Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds
Merge fixes from Andrew Morton. Random drivers and some VM fixes. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (17 commits) mm: compaction: Abort async compaction if locks are contended or taking too long mm: have order > 0 compaction start near a pageblock with free pages rapidio/tsi721: fix unused variable compiler warning rapidio/tsi721: fix inbound doorbell interrupt handling drivers/rtc/rtc-rs5c348.c: fix hour decoding in 12-hour mode mm: correct page->pfmemalloc to fix deactivate_slab regression drivers/rtc/rtc-pcf2123.c: initialize dynamic sysfs attributes mm/compaction.c: fix deferring compaction mistake drivers/misc/sgi-xp/xpc_uv.c: SGI XPC fails to load when cpu 0 is out of IRQ resources string: do not export memweight() to userspace hugetlb: update hugetlbpage.txt checkpatch: add control statement test to SINGLE_STATEMENT_DO_WHILE_MACRO mm: hugetlbfs: correctly populate shared pmd cciss: fix incorrect scsi status reporting Documentation: update mount option in filesystem/vfat.txt mm: change nr_ptes BUG_ON to WARN_ON cs5535-clockevt: typo, it's MFGPT, not MFPGT
2012-08-21Merge branch 'v4l_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fixes from Mauro Carvalho Chehab: "For bug fixes, at soc_camera, si470x, uvcvideo, iguanaworks IR driver, radio_shark Kbuild fixes, and at the V4L2 core (radio fixes)." * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: [media] media: soc_camera: don't clear pix->sizeimage in JPEG mode [media] media: mx2_camera: Fix clock handling for i.MX27 [media] video: mx2_camera: Use clk_prepare_enable/clk_disable_unprepare [media] video: mx1_camera: Use clk_prepare_enable/clk_disable_unprepare [media] media: mx3_camera: buf_init() add buffer state check [media] radio-shark2: Only compile led support when CONFIG_LED_CLASS is set [media] radio-shark: Only compile led support when CONFIG_LED_CLASS is set [media] radio-shark*: Call cancel_work_sync from disconnect rather then release [media] radio-shark*: Remove work-around for dangling pointer in usb intfdata [media] Add USB dependency for IguanaWorks USB IR Transceiver [media] Add missing logging for rangelow/high of hwseek [media] VIDIOC_ENUM_FREQ_BANDS fix [media] mem2mem_testdev: fix querycap regression [media] si470x: v4l2-compliance fixes [media] DocBook: Remove a spurious character [media] uvcvideo: Reset the bytesused field when recycling an erroneous buffer
2012-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking update from David Miller: "A couple weeks of bug fixing in there. The largest chunk is all the broken crap Amerigo Wang found in the netpoll layer." 1) netpoll and it's users has several serious bugs: a) uses GFP_KERNEL with locks held b) interfaces requiring interrupts disabled are called with them enabled c) and vice versa d) VLAN tag demuxing, as per all other RX packet input paths, is not applied All from Amerigo Wang. 2) Hopefully cure the ipv4 mapped ipv6 address TCP early demux bugs for good, from Neal Cardwell. 3) Unlike AF_UNIX, AF_PACKET sockets don't set a default credentials when the user doesn't specify one explicitly during sendmsg(). Instead we attach an empty (zero) SCM credential block which is definitely not what we want. Fix from Eric Dumazet. 4) IPv6 illegally invokes netdevice notifiers with RCU lock held, fix from Ben Hutchings. 5) inet_csk_route_child_sock() checks wrong inet options pointer, fix from Christoph Paasch. 6) When AF_PACKET is used for transmit, packet loopback doesn't behave properly when a socket fanout is enabled, from Eric Leblond. 7) On bluetooth l2cap channel create failure, we leak the socket, from Jaganath Kanakkassery. 8) Fix all the netprio file handling bugs found by Al Viro, from John Fastabend. 9) Several error return and NULL deref bug fixes in networking drivers from Julia Lawall. 10) A large smattering of struct padding et al. kernel memory leaks to userspace found of Mathias Krause. 11) Conntrack expections in netfilter can access an uninitialized timer, fix from Pablo Neira Ayuso. 12) Several netfilter SIP tracker bug fixes from Patrick McHardy. 13) IPSEC ipv6 routes are not initialized correctly all the time, resulting in an OOPS in inet_putpeer(). Also from Patrick McHardy. 14) Bridging does rcu_dereference() outside of RCU protected area, from Stephen Hemminger. 15) Fix routing cache removal performance regression when looking up output routes that have a local destination. From Zheng Yan. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits) af_netlink: force credentials passing [CVE-2012-3520] ipv4: fix ip header ident selection in __ip_make_skb() ipv4: Use newinet->inet_opt in inet_csk_route_child_sock() tcp: fix possible socket refcount problem net: tcp: move sk_rx_dst_set call after tcp_create_openreq_child() net/core/dev.c: fix kernel-doc warning netconsole: remove a redundant netconsole_target_put() net: ipv6: fix oops in inet_putpeer() net/stmmac: fix issue of clk_get for Loongson1B. caif: Do not dereference NULL in chnl_recv_cb() af_packet: don't emit packet on orig fanout group drivers/net/irda: fix error return code drivers/net/wan/dscc4.c: fix error return code drivers/net/wimax/i2400m/fw.c: fix error return code smsc75xx: add missing entry to MAINTAINERS net: qmi_wwan: new devices: UML290 and K5006-Z net: sh_eth: Add eth support for R8A7779 device netdev/phy: skip disabled mdio-mux nodes dt: introduce for_each_available_child_of_node, of_get_next_available_child net: netprio: fix cgrp create and write priomap race ...
2012-08-21mm: compaction: Abort async compaction if locks are contended or taking too longMel Gorman
Jim Schutt reported a problem that pointed at compaction contending heavily on locks. The workload is straight-forward and in his own words; The systems in question have 24 SAS drives spread across 3 HBAs, running 24 Ceph OSD instances, one per drive. FWIW these servers are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160 Ceph Linux clients doing dd simultaneously to a Ceph file system backed by 12 of these servers. Early in the test everything looks fine procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0 27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0 28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0 6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0 22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0 and then it goes to pot procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0 207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0 123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0 123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0 622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0 223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0 Note that system CPU usage is very high blocks being written out has dropped by 42%. He analysed this with perf and found perf record -g -a sleep 10 perf report --sort symbol --call-graph fractal,5 34.63% [k] _raw_spin_lock_irqsave | |--97.30%-- isolate_freepages | compaction_alloc | unmap_and_move | migrate_pages | compact_zone | compact_zone_order | try_to_compact_pages | __alloc_pages_direct_compact | __alloc_pages_slowpath | __alloc_pages_nodemask | alloc_pages_vma | do_huge_pmd_anonymous_page | handle_mm_fault | do_page_fault | page_fault | | | |--87.39%-- skb_copy_datagram_iovec | | tcp_recvmsg | | inet_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call | | __recv | | | | | --100.00%-- (nil) | | | --12.61%-- memcpy --2.70%-- [...] There was other data but primarily it is all showing that compaction is contended heavily on the zone->lock and zone->lru_lock. commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled while isolating pages for migration] noted that it was possible for migration to hold the lru_lock for an excessive amount of time. Very broadly speaking this patch expands the concept. This patch introduces compact_checklock_irqsave() to check if a lock is contended or the process needs to be scheduled. If either condition is true then async compaction is aborted and the caller is informed. The page allocator will fail a THP allocation if compaction failed due to contention. This patch also introduces compact_trylock_irqsave() which will acquire the lock only if it is not contended and the process does not need to schedule. Reported-by: Jim Schutt <jaschut@sandia.gov> Tested-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21mm: have order > 0 compaction start near a pageblock with free pagesMel Gorman
Commit 7db8889ab05b ("mm: have order > 0 compaction start off where it left") introduced a caching mechanism to reduce the amount work the free page scanner does in compaction. However, it has a problem. Consider two process simultaneously scanning free pages C Process A M S F |---------------------------------------| Process B M FS C is zone->compact_cached_free_pfn S is cc->start_pfree_pfn M is cc->migrate_pfn F is cc->free_pfn In this diagram, Process A has just reached its migrate scanner, wrapped around and updated compact_cached_free_pfn accordingly. Simultaneously, Process B finishes isolating in a block and updates compact_cached_free_pfn again to the location of its free scanner. Process A moves to "end_of_zone - one_pageblock" and runs this check if (cc->order > 0 && (!cc->wrapped || zone->compact_cached_free_pfn > cc->start_free_pfn)) pfn = min(pfn, zone->compact_cached_free_pfn); compact_cached_free_pfn is above where it started so the free scanner skips almost the entire space it should have scanned. When there are multiple processes compacting it can end in a situation where the entire zone is not being scanned at all. Further, it is possible for two processes to ping-pong update to compact_cached_free_pfn which is just random. Overall, the end result wrecks allocation success rates. There is not an obvious way around this problem without introducing new locking and state so this patch takes a different approach. First, it gets rid of the skip logic because it's not clear that it matters if two free scanners happen to be in the same block but with racing updates it's too easy for it to skip over blocks it should not. Second, it updates compact_cached_free_pfn in a more limited set of circumstances. If a scanner has wrapped, it updates compact_cached_free_pfn to the end of the zone. When a wrapped scanner isolates a page, it updates compact_cached_free_pfn to point to the highest pageblock it can isolate pages from. If a scanner has not wrapped when it has finished isolated pages it checks if compact_cached_free_pfn is pointing to the end of the zone. If so, the value is updated to point to the highest pageblock that pages were isolated from. This value will not be updated again until a free page scanner wraps and resets compact_cached_free_pfn. This is not optimal and it can still race but the compact_cached_free_pfn will be pointing to or very near a pageblock with free pages. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21rapidio/tsi721: fix unused variable compiler warningAlexandre Bounine
Fix unused variable compiler warning when built with CONFIG_RAPIDIO_DEBUG option off. This patch is applicable to kernel versions starting from v3.2 Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21rapidio/tsi721: fix inbound doorbell interrupt handlingAlexandre Bounine
Make sure that there is no doorbell messages left behind due to disabled interrupts during inbound doorbell processing. The most common case for this bug is loss of rionet JOIN messages in systems with three or more rionet participants and MSI or MSI-X enabled. As result, requests for packet transfers may finish with "destination unreachable" error message. This patch is applicable to kernel versions starting from v3.2. Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21drivers/rtc/rtc-rs5c348.c: fix hour decoding in 12-hour modeAtsushi Nemoto
Correct the offset by subtracting 20 from tm_hour before taking the modulo 12. [ "Why 20?" I hear you ask. Or at least I did. Here's the reason why: RS5C348_BIT_PM is 32, and is - stupidly - included in the RS5C348_HOURS_MASK define. So it's really subtracting out that bit to get "hour+12". But then because it does things modulo 12, it needs to add the 12 in again afterwards anyway. This code is confused. It would be much clearer if RS5C348_HOURS_MASK just didn't include the RS5C348_BIT_PM bit at all, then it wouldn't need to do the silly subtract either. Whatever. It's all just math, the end result is the same. - Linus ] Reported-by: James Nute <newten82@gmail.com> Tested-by: James Nute <newten82@gmail.com> Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21mm: correct page->pfmemalloc to fix deactivate_slab regressionAlex Shi
Commit cfd19c5a9ecf ("mm: only set page->pfmemalloc when ALLOC_NO_WATERMARKS was used") tried to narrow down page->pfmemalloc setting, but it missed some places the pfmemalloc should be set. So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS cause incorrect deactivate_slab() on our core2 server: 64.73% fio [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |---0.34%-- deactivate_slab | __slab_alloc | kmem_cache_alloc | | That causes our fio sync write performance to have a 40% regression. Move the checking in get_page_from_freelist() which resolves this issue. Signed-off-by: Alex Shi <alex.shi@intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Sage Weil <sage@inktank.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21drivers/rtc/rtc-pcf2123.c: initialize dynamic sysfs attributesIlya Shchepetkov
Dynamically allocated sysfs attributes must be initialized using sysfs_attr_init(), otherwise lockdep complains: BUG: key <address> not in .data! Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Ilya Shchepetkov <shchepetkov@ispras.ru> Cc: Chris Verges <chrisv@cyberswitching.com> Cc: Christian Pellegrin <chripell@fsfe.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21mm/compaction.c: fix deferring compaction mistakeMinchan Kim
Commit aff622495c9a ("vmscan: only defer compaction for failed order and higher") fixed bad deferring policy but made mistake about checking compact_order_failed in __compact_pgdat(). So it can't update compact_order_failed with the new order. This ends up preventing correct operation of policy deferral. This patch fixes it. Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21drivers/misc/sgi-xp/xpc_uv.c: SGI XPC fails to load when cpu 0 is out of IRQ ↵Robin Holt
resources On many of our larger systems, CPU 0 has had all of its IRQ resources consumed before XPC loads. Worst cases on machines with multiple 10 GigE cards and multiple IB cards have depleted the entire first socket of IRQs. This patch makes selecting the node upon which IRQs are allocated (as well as all the other GRU Message Queue structures) specifiable as a module load param and has a default behavior of searching all nodes/cpus for an available resources. [akpm@linux-foundation.org: fix build: include cpu.h and module.h] Signed-off-by: Robin Holt <holt@sgi.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21string: do not export memweight() to userspaceWANG Cong
Fix the following warning: usr/include/linux/string.h:8: userspace cannot reference function or variable defined in the kernel Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Acked-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21hugetlb: update hugetlbpage.txtZhouping Liu
Commit f0f57b2b1488 ("mm: move hugepage test examples to tools/testing/selftests/vm") moved map_hugetlb.c, hugepage-shm.c and hugepage-mmap.c tests into tools/testing/selftests/vm/ directory, but it didn't update hugetlbpage.txt Signed-off-by: Zhouping Liu <sanweidaying@gmail.com> Acked-by: Dave Young <dyoung@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21checkpatch: add control statement test to SINGLE_STATEMENT_DO_WHILE_MACROJoe Perches
Commit b13edf7ff2dd ("checkpatch: add checks for do {} while (0) macro misuses") added a test that is overly simplistic for single statement macros. Macros that start with control tests should be enclosed in a do {} while (0) loop. Add the necessary control tests to the check. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Andy Whitcroft <apw@canonical.com> Tested-by: Franz Schrober <franzschrober@yahoo.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21mm: hugetlbfs: correctly populate shared pmdMichal Hocko
Each page mapped in a process's address space must be correctly accounted for in _mapcount. Normally the rules for this are straightforward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman: kernel BUG at mm/filemap.c:135! invalid opcode: 0000 [#1] SMP CPU 22 Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] Pid: 18001, comm: mpitest Tainted: G W 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 RIP: 0010:[<ffffffff8112cfed>] [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170 Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20) Call Trace: delete_from_page_cache+0x40/0x80 truncate_hugepages+0x115/0x1f0 hugetlbfs_evict_inode+0x18/0x30 evict+0x9f/0x1b0 iput_final+0xe3/0x1e0 iput+0x3e/0x50 d_kill+0xf8/0x110 dput+0xe2/0x1b0 __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram: parent | ------------>pmd src_pte----------> data page ^ other--------->pmd--------------------| ^ child-----------| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share page tables with each other. This is possible due to the following style of race. PROC A PROC B copy_hugetlb_page_range copy_hugetlb_page_range src_pte == huge_pte_offset src_pte == huge_pte_offset !src_pte so no sharing !src_pte so no sharing (time passes) hugetlb_fault hugetlb_fault huge_pte_alloc huge_pte_alloc huge_pmd_share huge_pmd_share LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) pmd_alloc pmd_alloc LOCK(instantiation_mutex) fault UNLOCK(instantiation_mutex) LOCK(instantiation_mutex) fault UNLOCK(instantiation_mutex) These two processes are not poing to the same data page but are not sharing page tables because the opportunity was missed. When either process later forks, the src_pte == dst pte is potentially insufficient. As the check falls through, the wrong PTE information is copied in (harmless but wrong) and the mapcount is bumped for a page mapped by a shared page table leading to the BUG_ON. This patch addresses the issue by moving pmd_alloc into huge_pmd_share which guarantees that the shared pud is populated in the same critical section as pmd. This also means that huge_pte_offset test in huge_pmd_share is serialized correctly now which in turn means that the success of the sharing will be higher as the racing tasks see the pud and pmd populated together. Race identified and changelog written mostly by Mel Gorman. {akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style] Reported-by: Larry Woodman <lwoodman@redhat.com> Tested-by: Larry Woodman <lwoodman@redhat.com> Reviewed-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21cciss: fix incorrect scsi status reportingStephen M. Cameron
Delete code which sets SCSI status incorrectly as it's already been set correctly above this incorrect code. The bug was introduced in 2009 by commit b0e15f6db111 ("cciss: fix typo that causes scsi status to be lost.") Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> Reported-by: Roel van Meer <roel.vanmeer@bokxing.nl> Tested-by: Roel van Meer <roel.vanmeer@bokxing.nl> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21Documentation: update mount option in filesystem/vfat.txtNamjae Jeon
Update two mount options(discard, nfs) in vfat.txt. Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21mm: change nr_ptes BUG_ON to WARN_ONHugh Dickins
Occasionally an isolated BUG_ON(mm->nr_ptes) gets reported, indicating that not all the page tables allocated could be found and freed when exit_mmap() tore down the user address space. There's usually nothing we can say about it, beyond that it's probably a sign of some bad memory or memory corruption; though it might still indicate a bug in vma or page table management (and did recently reveal a race in THP, fixed a few months ago). But one overdue change we can make is from BUG_ON to WARN_ON. It's fairly likely that the system will crash shortly afterwards in some other way (for example, the BUG_ON(page_mapped(page)) in __delete_from_page_cache(), once an inode mapped into the lost page tables gets evicted); but might tell us more before that. Change the BUG_ON(page_mapped) to WARN_ON too? Later perhaps: I'm less eager, since that one has several times led to fixes. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21cs5535-clockevt: typo, it's MFGPT, not MFPGTJens Rottmann
Signed-off-by: Jens Rottmann <JRottmann@LiPPERTEmbedded.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21drm: Add missing static storage class specifiers in drm_proc.c fileSachin Kamat
Fixes the following sparse warning: drivers/gpu/drm/drm_proc.c:92:5: warning: symbol 'drm_proc_create_files' was not declared. Should it be static? drivers/gpu/drm/drm_proc.c:175:5: warning: symbol 'drm_proc_remove_files' was not declared. Should it be static? Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Dave Airlie <airlied@redhat.com>
2012-08-21drm/udl: dpms off the crtc when disabled.Dave Airlie
This turns off the crtc when its been disabled, fixes it not turning off properly the whole time. Signed-off-by: Dave Airlie <airlied@redhat.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
2012-08-21drm: Remove two unused fields from struct drm_display_modeDamien Lespiau
Signed-off-by: Damien Lespiau <damien.lespiau@intel.com> Reviewed-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
2012-08-21drm: stop vmgfx driver explosionAlan Cox
If you do a page flip with no flags set then event is NULL. If event is NULL then the vmw_gfx driver likes to go digging into NULL and extracts NULL->base.file_priv. On a modern kernel with NULL mapping protection it's just another oops, without it there are some "intriguing" possibilities. What it should do is an open question but that for the driver owners to sort out. Signed-off-by: Alan Cox <alan@linux.intel.com> Reviewed-by: Jakob Bornecrantz <jakob@vmware.com> Cc: stable@vger.kernel.org Signed-off-by: Dave Airlie <airlied@redhat.com>
2012-08-21Merge branch 'drm-intel-fixes' of ↵Dave Airlie
git://people.freedesktop.org/~danvet/drm-intel into drm-fixes Daniel writes: " Nothing too major: - A few fixes around the edid handling from Jani, also fixing a regression in 3.5 due to us using gmbus by default. - Fixup hsw uncached pte flags. - Fix suspend/resume crash when using hw contexts, from Ben. - Try to tune gpu turbo a bit better, seems to help with some oddball power regressions." * 'drm-intel-fixes' of git://people.freedesktop.org/~danvet/drm-intel: drm/i915: use hsw rps tuning values everywhere on gen6+ drm/i915: fall back to bit-banging if GMBUS fails in CRT EDID reads drm/i915: extract connector update from intel_ddc_get_modes() for reuse drm/i915: fix hsw uncached pte drm/i915/contexts: fix list corruption drm/i915: fix EDID memory leak in SDVO
2012-08-21Merge branch 'drm-fixes-3.6' of git://people.freedesktop.org/~agd5f/linux ↵Dave Airlie
into drm-fixes Alex writes: "This is the current set of radeon fixes for 3.6. Nothing too major. Highlights: - fix vbios fetch on pure uefi systems - fix vbios fetch on thunderbolt systems - MSAA fixes - lockup timeout fix - modesetting fix" * 'drm-fixes-3.6' of git://people.freedesktop.org/~agd5f/linux: drm/radeon/ss: use num_crtc rather than hardcoded 6 Revert "drm/radeon: fix bo creation retry path" drm/radeon: split ATRM support out from the ATPX handler (v3) drm/radeon: convert radeon vfct code to use acpi_get_table_with_size ACPI: export symbol acpi_get_table_with_size drm/radeon: implement ACPI VFCT vbios fetch (v3) drm/radeon/kms: extend the Fujitsu D3003-S2 board connector quirk to cover later silicon stepping drm/radeon: fix checking of MSAA renderbuffers on r600-r700 drm/radeon: allow CMASK and FMASK in the CS checker on r600-r700 drm/radeon: init lockup timeout on ring init drm/radeon: avoid turning off spread spectrum for used pll
2012-08-21ceph: avoid divide by zero in __validate_layout()Sage Weil
If "l->stripe_unit" is zero the the mod on the next line will cause a divide by zero bug. This comes from the copy_from_user() in ceph_ioctl_set_layout_policy(). Passing 0 is valid, though (it means "do not change") so avoid the % check in that case. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
2012-08-21libceph: avoid truncation due to racing bannersJim Schutt
Because the Ceph client messenger uses a non-blocking connect, it is possible for the sending of the client banner to race with the arrival of the banner sent by the peer. When ceph_sock_state_change() notices the connect has completed, it schedules work to process the socket via con_work(). During this time the peer is writing its banner, and arrival of the peer banner races with con_work(). If con_work() calls try_read() before the peer banner arrives, there is nothing for it to do, after which con_work() calls try_write() to send the client's banner. In this case Ceph's protocol negotiation can complete succesfully. The server-side messenger immediately sends its banner and addresses after accepting a connect request, *before* actually attempting to read or verify the banner from the client. As a result, it is possible for the banner from the server to arrive before con_work() calls try_read(). If that happens, try_read() will read the banner and prepare protocol negotiation info via prepare_write_connect(). prepare_write_connect() calls con_out_kvec_reset(), which discards the as-yet-unsent client banner. Next, con_work() calls try_write(), which sends the protocol negotiation info rather than the banner that the peer is expecting. The result is that the peer sees an invalid banner, and the client reports "negotiation failed". Fix this by moving con_out_kvec_reset() out of prepare_write_connect() to its callers at all locations except the one where the banner might still need to be sent. [elder@inktak.com: added note about server-side behavior] Signed-off-by: Jim Schutt <jaschut@sandia.gov> Reviewed-by: Alex Elder <elder@inktank.com>
2012-08-21ceph: tolerate (and warn on) extraneous dentry from mdsSage Weil
If the MDS gives us a dentry and we weren't prepared to handle it, WARN_ON_ONCE instead of crashing. Reported-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
2012-08-21drm/radeon/ss: use num_crtc rather than hardcoded 6Alex Deucher
When checking if a pll is in use. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2012-08-21ide: fix generic_ide_suspend/resume OopsMiklos Szeredi
This patch fixes a regresion introduced by commit 0998d063 (device-core: Ensure drvdata = NULL when no driver is bound). Suspend oopses in generic_ide_suspend() because dev_get_drvdata() returns NULL (dev->p->driver_data == NULL) and this function is not prepared for this. Fix is based on Alan Stern's suggestion. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-21af_netlink: force credentials passing [CVE-2012-3520]Eric Dumazet
Pablo Neira Ayuso discovered that avahi and potentially NetworkManager accept spoofed Netlink messages because of a kernel bug. The kernel passes all-zero SCM_CREDENTIALS ancillary data to the receiver if the sender did not provide such data, instead of not including any such data at all or including the correct data from the peer (as it is the case with AF_UNIX). This bug was introduced in commit 16e572626961 (af_unix: dont send SCM_CREDENTIALS by default) This patch forces passing credentials for netlink, as before the regression. Another fix would be to not add SCM_CREDENTIALS in netlink messages if not provided by the sender, but it might break some programs. With help from Florian Weimer & Petr Matousek This issue is designated as CVE-2012-3520 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-21ipv4: fix ip header ident selection in __ip_make_skb()Eric Dumazet
Christian Casteyde reported a kmemcheck 32-bit read from uninitialized memory in __ip_select_ident(). It turns out that __ip_make_skb() called ip_select_ident() before properly initializing iph->daddr. This is a bug uncovered by commit 1d861aa4b3fb (inet: Minimize use of cached route inetpeer.) Addresses https://bugzilla.kernel.org/show_bug.cgi?id=46131 Reported-by: Christian Casteyde <casteyde.christian@free.fr> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-21ipv4: Use newinet->inet_opt in inet_csk_route_child_sock()Christoph Paasch
Since 0e734419923bd ("ipv4: Use inet_csk_route_child_sock() in DCCP and TCP."), inet_csk_route_child_sock() is called instead of inet_csk_route_req(). However, after creating the child-sock in tcp/dccp_v4_syn_recv_sock(), ireq->opt is set to NULL, before calling inet_csk_route_child_sock(). Thus, inside inet_csk_route_child_sock() opt is always NULL and the SRR-options are not respected anymore. Packets sent by the server won't have the correct destination-IP. This patch fixes it by accessing newinet->inet_opt instead of ireq->opt inside inet_csk_route_child_sock(). Reported-by: Luca Boccassi <luca.boccassi@gmail.com> Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-21tcp: fix possible socket refcount problemEric Dumazet
Commit 6f458dfb40 (tcp: improve latencies of timer triggered events) added bug leading to following trace : [ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000 [ 2866.131726] [ 2866.132188] ========================= [ 2866.132281] [ BUG: held lock freed! ] [ 2866.132281] 3.6.0-rc1+ #622 Not tainted [ 2866.132281] ------------------------- [ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there! [ 2866.132281] (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6 [ 2866.132281] 4 locks held by kworker/0:1/652: [ 2866.132281] #0: (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f [ 2866.132281] #1: ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f [ 2866.132281] #2: (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6 [ 2866.132281] #3: (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f [ 2866.132281] [ 2866.132281] stack backtrace: [ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622 [ 2866.132281] Call Trace: [ 2866.132281] <IRQ> [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159 [ 2866.132281] [<ffffffff818a0839>] ? __sk_free+0xfd/0x114 [ 2866.132281] [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a [ 2866.132281] [<ffffffff818a0839>] __sk_free+0xfd/0x114 [ 2866.132281] [<ffffffff818a08c0>] sk_free+0x1c/0x1e [ 2866.132281] [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56 [ 2866.132281] [<ffffffff81078082>] run_timer_softirq+0x218/0x35f [ 2866.132281] [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f [ 2866.132281] [<ffffffff810f5831>] ? rb_commit+0x58/0x85 [ 2866.132281] [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148 [ 2866.132281] [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9 [ 2866.132281] [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e [ 2866.132281] [<ffffffff81a1227c>] call_softirq+0x1c/0x30 [ 2866.132281] [<ffffffff81039f38>] do_softirq+0x4a/0xa6 [ 2866.132281] [<ffffffff81070f2b>] irq_exit+0x51/0xad [ 2866.132281] [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4 [ 2866.132281] [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f [ 2866.132281] <EOI> [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1 [ 2866.132281] [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56 [ 2866.132281] [<ffffffff81078692>] mod_timer+0x178/0x1a9 [ 2866.132281] [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26 [ 2866.132281] [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4 [ 2866.132281] [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70 [ 2866.132281] [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4 [ 2866.132281] [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1 [ 2866.132281] [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a [ 2866.132281] [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6 [ 2866.132281] [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5 [ 2866.132281] [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f [ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb [ 2866.132281] [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4 [ 2866.132281] [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5 [ 2866.132281] [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f [ 2866.132281] [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd [ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb [ 2866.132281] [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43 [ 2866.132281] [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80 [ 2866.132281] [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0 [ 2866.132281] [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61 [ 2866.132281] [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1 [ 2866.132281] [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db [ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c [ 2866.132281] [<ffffffff81999d92>] call_transmit+0x1c5/0x20e [ 2866.132281] [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225 [ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c [ 2866.132281] [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34 [ 2866.132281] [<ffffffff810835d6>] process_one_work+0x24d/0x47f [ 2866.132281] [<ffffffff81083567>] ? process_one_work+0x1de/0x47f [ 2866.132281] [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225 [ 2866.132281] [<ffffffff81083a6d>] worker_thread+0x236/0x317 [ 2866.132281] [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f [ 2866.132281] [<ffffffff8108b7b8>] kthread+0x9a/0xa2 [ 2866.132281] [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10 [ 2866.132281] [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13 [ 2866.132281] [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a [ 2866.132281] [<ffffffff81a12180>] ? gs_change+0x13/0x13 [ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000 [ 2866.309689] ============================================================================= [ 2866.310254] BUG TCP (Not tainted): Object already free [ 2866.310254] ----------------------------------------------------------------------------- [ 2866.310254] The bug comes from the fact that timer set in sk_reset_timer() can run before we actually do the sock_hold(). socket refcount reaches zero and we free the socket too soon. timer handler is not allowed to reduce socket refcnt if socket is owned by the user, or we need to change sk_reset_timer() implementation. We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED was used instead of TCP_DELACK_TIMER_DEFERRED. For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED, even if not fired from a timer. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>