Age | Commit message (Collapse) | Author |
|
We drop the con mutex when delivering a message. When we retake the
lock, we need to verify we are still in the OPEN state before
preparing to read the next tag, or else we risk stepping on a
connection that has been closed.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
|
|
Revoke all mon_client messages when we shut down the old connection.
This is mostly moot since we are re-using the same ceph_connection,
but it is cleaner.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
|
|
If the connect() call immediately fails such that sock == NULL, we
still need con_close_socket() to reset our socket state to CLOSED.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
|
|
There are many (normal) conditions that can lead to us getting
unexpected replies, include cluster topology changes, osd failures,
and timeouts. There's no need to spam the console about it.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
|
|
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
Rename flags with CON_FLAG prefix, move the definitions into the c file,
and (better) document their meaning.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
Use a simple set of 6 enumerated values for the socket states (CON_STATE_*)
and use those instead of the state bits. All of the con->state checks are
now under the protection of the con mutex, so this is safe. It also
simplifies many of the state checks because we can check for anything other
than the expected state instead of various bits for races we can think of.
This appears to hold up well to stress testing both with and without socket
failure injection on the server side.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
If we are CLOSED, the socket is closed and we won't get these.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
It is simpler to do this immediately, since we already hold the con mutex.
It also avoids the need to deal with a not-quite-CLOSED socket in con_work.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
If the state is CLOSED or OPENING, we shouldn't have a socket.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
Take the con mutex before checking whether the connection is closed to
avoid racing with someone else closing it.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
Avoid dropping and retaking con->mutex in the ceph_con_send() case by
leaving locking up to the caller.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
If we fault on a lossy connection, we should still close the socket
immediately, and do so under the con mutex.
We should also take the con mutex before printing out the state bits in
the debug output.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
This is a trivial fix for the debug output, as it is inconsistent
with the function name so may confuse people when debugging.
[elder@inktank.com: switched to use __func__]
Signed-off-by: Jiaju Zhang <jjzhang@suse.de>
Reviewed-by: Alex Elder <elder@inktank.com>
|
|
We exponentially back off when we encounter connection errors. If several
errors accumulate, we will eventually wait ages before even trying to
reconnect.
Fix this by resetting the backoff counter after a successful negotiation/
connection with the remote node. Fixes ceph issue #2802.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
|
|
Take the con mutex while we are initiating a ceph open. This is necessary
because the may have previously been in use and then closed, which could
result in a racing workqueue running con_work().
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
|
|
Previously, we were opportunistically initializing the bio_iter if it
appeared to be uninitialized in the middle of the read path. The problem
is that a sequence like:
- start reading message
- initialize bio_iter
- read half a message
- messenger fault, reconnect
- restart reading message
- ** bio_iter now non-NULL, not reinitialized **
- read past end of bio, crash
Instead, initialize the bio_iter unconditionally when we allocate/claim
the message for read.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
|
|
The linger op registration (i.e., watch) modifies the object state. As
such, the OSD will reply with success if it has already applied without
doing the associated side-effects (setting up the watch session state).
If we lose the ACK and resubmit, we will see success but the watch will not
be correctly registered and we won't get notifies.
To fix this, always resubmit the linger op with a new tid. We accomplish
this by re-registering as a linger (i.e., 'registered') if we are not yet
registered. Then the second loop will treat this just like a normal
case of re-registering.
This mirrors a similar fix on the userland ceph.git, commit 5dd68b95, and
ceph bug #2796.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
|
|
Hold the mutex while twiddling all of the state bits to avoid possible
races. While we're here, make not of why we cannot close the socket
directly.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
|
|
We need to set error_msg to something useful before calling ceph_fault();
do so here for try_{read,write}(). This is more informative than
libceph: osd0 192.168.106.220:6801 (null)
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
|
|
The server side recently added support for tuning some magic
crush variables. Decode these variables if they are present, or use the
default values if they are not present.
Corresponds to ceph.git commit 89af369c25f274fe62ef730e5e8aad0c54f1e5a5.
Signed-off-by: caleb miles <caleb.miles@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
|
|
Without this patch kernel will panic on LockD start, because lockd_up() checks
lockd_up_net() result for negative value.
From my pow it's better to return negative value from rpcbind routines instead
of replacing all such checks like in lockd_up().
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 3.0]
|
|
This is simply cleanup that will keep things more closely synced with the
userland code.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
|
|
We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:
PID: 2507 TASK: ffff88103691ab40 CPU: 14 COMMAND: "rpciod/14"
#0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
#1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
#2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
#3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
#4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
#5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
#6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
#7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
#8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
#9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca
rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.
Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
|
|
rpc_make_runnable is not generally called with the queue lock held, unless
it's waking up a task that has been sitting on a waitqueue. This is safe
when the task has not entered the FSM yet, but the comments don't really
spell this out.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
Add a missing set of braces that commit 4e0038b6b24
("SUNRPC: Move clnt->cl_server into struct rpc_xprt")
forgot.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 3.4]
|
|
Simple table that can be marked const.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After IP route cache removal, rt_cache_rebuild_count is no longer
used.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
commit c6cffba4ffa2 (ipv4: Fix input route performance regression.)
added various fatal races with dst refcounts.
crashes happen on tcp workloads if routes are added/deleted at the same
time.
The dst_free() calls from free_fib_info_rcu() are clearly racy.
We need instead regular dst refcounting (dst_release()) and make
sure dst_release() is aware of RCU grace periods :
Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
before dst destruction for cached dst
Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
to make sure we dont increase a zero refcount (On a dst currently
waiting an rcu grace period before destruction)
rt_cache_route() must take a reference on the new cached route, and
release it if was not able to install it.
With this patch, my machines survive various benchmarks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
early_demux() handlers should be called in RCU context, and as we
use skb_dst_set_noref(skb, dst), caller must not exit from RCU context
before dst use (skb_dst(skb)) or release (skb_drop(dst))
Therefore, rcu_read_lock()/rcu_read_unlock() pairs around
->early_demux() are confusing and not needed :
Protocol handlers are already in an RCU read lock section.
(__netif_receive_skb() does the rcu_read_lock() )
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add an atomic variable 'stopping' as flag in struct ceph_messenger,
set this flag to 1 in function ceph_destroy_client(), and add the condition code
in function ceph_data_ready() to test the flag value, if true(1), just return.
Signed-off-by: Guanjun He <gjhe@suse.com>
Reviewed-by: Sage Weil <sage@inktank.com>
|
|
In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.
Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time. This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
These don't strictly need to be initialized based on how they are used, but
it is good practice to do so.
Reported-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
Initialize the type field for messages in a msgpool. The caller was doing
this for osd ops, but not for the reply messages.
Reported-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
ieee80211_request_sched_scan_stop() cleared
local->sched_scan_sdata. However, sched_scan_sdata
should be cleared only after the driver calls
ieee80211_sched_scan_stopped() (like with normal hw scan).
Clearing sched_scan_sdata too early caused
ieee80211_sched_scan_stopped_work to exit prematurely
without properly cleaning all the sched scan resources
and without calling cfg80211_sched_scan_stopped (so
userspace wasn't notified about sched scan completion).
Signed-off-by: Eliad Peller <eliad@wizery.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
|
|
When userspace use RTM_GETROUTE to dump route table, with an already
expired route entry, we always got an 'expires' value(2147157)
calculated base on INT_MAX.
The reason of this problem is in the following satement:
rt->dst.expires - jiffies < INT_MAX
gcc promoted the type of both sides of '<' to unsigned long, thus
a small negative value would be considered greater than INT_MAX.
With the help of Eric Dumazet, do the out of bound checks in
rtnl_put_cacheinfo(), _after_ conversion to clock_t.
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The first parameter struct trie *t is not used anymore.
Remove it.
Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It should print size of struct rt_trie_node * allocated instead of size
of struct rt_trie_node.
Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
kern_path_create()/done_path_create() resp.
One side effect - attempt to create a cross-device link on a read-only fs fails
with EROFS instead of EXDEV now. Makes more sense, POSIX allows, etc.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
releases what needs to be released after {kern,user}_path_create()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Back in 2006, commit 1a2449a87b ("[I/OAT]: TCP recv offload to I/OAT")
added support for receive offloading to IOAT dma engine if available.
The code in tcp_rcv_established() tries to perform early DMA copy if
applicable. It however does so without checking whether the userspace
task is actually expecting the data in the buffer.
This is not a problem under normal circumstances, but there is a corner
case where this doesn't work -- and that's when MSG_TRUNC flag to
recvmsg() is used.
If the IOAT dma engine is not used, the code properly checks whether
there is a valid ucopy.task and the socket is owned by userspace, but
misses the check in the dmaengine case.
This problem can be observed in real trivially -- for example 'tbench' is a
good reproducer, as it makes a heavy use of MSG_TRUNC. On systems utilizing
IOAT, you will soon find tbench waiting indefinitely in sk_wait_data(), as they
have been already early-copied in tcp_rcv_established() using dma engine.
This patch introduces the same check we are performing in the simple
iovec copy case to the IOAT case as well. It fixes the indefinite
recvmsg(MSG_TRUNC) hangs.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts commit 5b3e7e6cb5771bedda51cdb6f715d1da8cd9e644.
The problem that the original commit was attempting to fix can
never happen in practice because validation is done one a per-flow
basis rather than a per-packet basis. Adding additional checks at
runtime is unnecessary and inconsistent with the rest of the code.
CC: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
commit 92101b3b2e317 (ipv4: Prepare for change of rt->rt_iif encoding.)
invalidated TCP early demux, because rx_dst_ifindex is not properly
initialized and checked.
Also remove the use of inet_iif(skb) in favor or skb->skb_iif
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When device flags are set using rtnetlink, IFF_PROMISC and IFF_ALLMULTI
flags are handled specially. Function dev_change_flags sets IFF_PROMISC and
IFF_ALLMULTI bits in dev->gflags according to the passed value but
do_setlink passes a result of rtnl_dev_combine_flags which takes those bits
from dev->flags.
This can be easily trigerred by doing:
tcpdump -i eth0 &
ip l s up eth0
ip sets IFF_UP flag in ifi_flags and ifi_change, which is combined with
IFF_PROMISC by rtnl_dev_combine_flags, causing __dev_change_flags to set
IFF_PROMISC in gflags.
Reported-by: Max Matveev <makc@redhat.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int. But
patch "tcp: Add TCP_USER_TIMEOUT socket option"(dca43c75) didn't check the negative
values. If a user assign -1 to it, the socket will set successfully and wait
for 4294967295 miliseconds. This patch add a negative value check to avoid
this issue.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds
Pull LED subsystem update from Bryan Wu.
* 'for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds: (50 commits)
leds-lp8788: forgotten unlock at lp8788_led_work
LEDS: propagate error codes in blinkm_detect()
LEDS: memory leak in blinkm_led_common_set()
leds: add new lp8788 led driver
LEDS: add BlinkM RGB LED driver, documentation and update MAINTAINERS
leds: max8997: Simplify max8997_led_set_mode implementation
leds/leds-s3c24xx: use devm_gpio_request
leds: convert Network Space v2 LED driver to devm_kzalloc() and cleanup error exit path
leds: convert DAC124S085 LED driver to devm_kzalloc()
leds: convert LM3530 LED driver to devm_kzalloc() and cleanup error exit path
leds: convert TCA6507 LED driver to devm_kzalloc()
leds: convert Freescale MC13783 LED driver to devm_kzalloc() and cleanup error exit path
leds: convert ADP5520 LED driver to devm_kzalloc() and cleanup error exit path
leds: convert PCA955x LED driver to devm_kzalloc() and cleanup error exit path
leds: convert Sun Fire LED driver to devm_kzalloc() and cleanup error exit path
leds: convert PCA9532 LED driver to devm_kzalloc()
leds: convert LT3593 LED driver to devm_kzalloc()
leds: convert Renesas TPU LED driver to devm_kzalloc() and cleanup error exit path
leds: convert LP5523 LED driver to devm_kzalloc() and cleanup error exit path
leds: convert PCA9633 LED driver to devm_kzalloc()
...
|
|
Pull networking updates and fixes from David Miller:
1) Reinstate the no-ref optimization for input route lookups in ipv4 to
fix some routing cache removal perf regressions.
2) Make TCP socket pre-demux work on ipv6 side too, from Eric Dumazet.
3) Get RX hash value from correct place in be2net driver, from
Sarveshwar Bandi.
4) Validation of FIB cached routes missing critical check, from Eric
Dumazet.
5) EEH support in mlx4 driver, from Kleber Sacilotto de Souza.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (23 commits)
ipv6: Early TCP socket demux
ipv4: Fix input route performance regression.
pch_gbe: vlan skb len fix
pch_gbe: add extra clean tx
pch_gbe: fix transmit watchdog timeout
ixgbe: fix panic while dumping packets on Tx hang with IOMMU
be2net: Fix to parse RSS hash from Receive completions correctly.
net/mlx4_en: Limit the RFS filter IDs to be < RPS_NO_FILTER
hyperv: Add error handling to rndis_filter_device_add()
hyperv: Add a check for ring_size value
ipv4: rt_cache_valid must check expired routes
net/pch_gpe: Cannot disable ethernet autonegation
qeth: repair crash in qeth_l3_vlan_rx_kill_vid()
netiucv: cleanup attribute usage
net: wiznet add missing HAS_IOMEM dependency
be2net: Missing byteswap in be_get_fw_log_level causes oops on PowerPC
mlx4: Add support for EEH error recovery
cdc-ncm: tag Ericsson WWAN devices (eg F5521gw) with FLAG_WWAN
wanmain: comparing array with NULL
caif: fix NULL pointer check
...
|
|
This is the IPv6 missing bits for infrastructure added in commit
41063e9dd1195 (ipv4: Early TCP socket demux.)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|