summaryrefslogtreecommitdiff
path: root/net/ipv4/tcp_input.c
AgeCommit message (Collapse)Author
2016-01-30tcp: avoid cwnd undo after receiving ECNYuchung Cheng
RFC 4015 section 3.4 says the TCP sender MUST refrain from reversing the congestion control state when the ACK signals congestion through the ECN-Echo flag. Currently we may not always do that when prior_ssthresh is reset upon receiving ACKs with ECE marks. This patch fixes that. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-29tcp: fix tcp_mark_head_lost to check skb len before fragmentingNeal Cardwell
This commit fixes a corner case in tcp_mark_head_lost() which was causing the WARN_ON(len > skb->len) in tcp_fragment() to fire. tcp_mark_head_lost() was assuming that if a packet has tcp_skb_pcount(skb) of N, then it's safe to fragment off a prefix of M*mss bytes, for any M < N. But with the tricky way TCP pcounts are maintained, this is not always true. For example, suppose the sender sends 4 1-byte packets and have the last 3 packet sacked. It will merge the last 3 packets in the write queue into an skb with pcount = 3 and len = 3 bytes. If another recovery happens after a sack reneging event, tcp_mark_head_lost() may attempt to split the skb assuming it has more than 2*MSS bytes. This sounds very counterintuitive, but as the commit description for the related commit c0638c247f55 ("tcp: don't fragment SACKed skbs in tcp_mark_head_lost()") notes, this is because tcp_shifted_skb() coalesces adjacent regions of SACKed skbs, and when doing this it preserves the sum of their packet counts in order to reflect the real-world dynamics on the wire. The c0638c247f55 commit tried to avoid problems by not fragmenting SACKed skbs, since SACKed skbs are where the non-proportionality between pcount and skb->len/mss is known to be possible. However, that commit did not handle the case where during a reneging event one of these weird SACKed skbs becomes an un-SACKed skb, which tcp_mark_head_lost() can then try to fragment. The fix is to simply mark the entire skb lost when this happens. This makes the recovery slightly more aggressive in such corner cases before we detect reordering. But once we detect reordering this code path is by-passed because FACK is disabled. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2016-01-06tcp: fix zero cwnd in tcp_cwnd_reductionYuchung Cheng
Patch 3759824da87b ("tcp: PRR uses CRB mode by default and SS mode conditionally") introduced a bug that cwnd may become 0 when both inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead to a div-by-zero if the connection starts another cwnd reduction phase by setting tp->prior_cwnd to the current cwnd (0) in tcp_init_cwnd_reduction(). To prevent this we skip PRR operation when nothing is acked or sacked. Then cwnd must be positive in all cases as long as ssthresh is positive: 1) The proportional reduction mode inflight > ssthresh > 0 2) The reduction bound mode a) inflight == ssthresh > 0 b) inflight < ssthresh sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh Therefore in all cases inflight and sndcnt can not both be 0. We check invalid tp->prior_cwnd to avoid potential div0 bugs. In reality this bug is triggered only with a sequence of less common events. For example, the connection is terminating an ECN-triggered cwnd reduction with an inflight 0, then it receives reordered/old ACKs or DSACKs from prior transmission (which acks nothing). Or the connection is in fast recovery stage that marks everything lost, but fails to retransmit due to local issues, then receives data packets from other end which acks nothing. Fixes: 3759824da87b ("tcp: PRR uses CRB mode by default and SS mode conditionally") Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18net: Allow accepted sockets to be bound to l3mdev domainDavid Ahern
Allow accepted sockets to derive their sk_bound_dev_if setting from the l3mdev domain in which the packets originated. A sysctl setting is added to control the behavior which is similar to sk_mark and sysctl_tcp_fwmark_accept. This effectively allow a process to have a "VRF-global" listen socket, with child sockets bound to the VRF device in which the packet originated. A similar behavior can be achieved using sk_mark, but a solution using marks is incomplete as it does not handle duplicate addresses in different L3 domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev domain provides a complete solution. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30tcp: initialize tp->copied_seq in case of cross SYN connectionEric Dumazet
Dmitry provided a syzkaller (http://github.com/google/syzkaller) generated program that triggers the WARNING at net/ipv4/tcp.c:1729 in tcp_recvmsg() : WARN_ON(tp->copied_seq != tp->rcv_nxt && !(flags & (MSG_PEEK | MSG_TRUNC))); His program is specifically attempting a Cross SYN TCP exchange, that we support (for the pleasure of hackers ?), but it looks we lack proper tcp->copied_seq initialization. Thanks again Dmitry for your report and testings. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20tcp: fix potential huge kmalloc() calls in TCP_REPAIREric Dumazet
tcp_send_rcvq() is used for re-injecting data into tcp receive queue. Problems : - No check against size is performed, allowed user to fool kernel in attempting very large memory allocations, eventually triggering OOM when memory is fragmented. - In case of fault during the copy we do not return correct errno. Lets use alloc_skb_with_frags() to cook optimal skbs. Fixes: 292e8d8c8538 ("tcp: Move rcvq sending to tcp_input.c") Fixes: c0e88ff0f256 ("tcp: Repair socket queues") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: use RACK to detect lossesYuchung Cheng
This patch implements the second half of RACK that uses the the most recent transmit time among all delivered packets to detect losses. tcp_rack_mark_lost() is called upon receiving a dubious ACK. It then checks if an not-yet-sacked packet was sent at least "reo_wnd" prior to the sent time of the most recently delivered. If so the packet is deemed lost. The "reo_wnd" reordering window starts with 1msec for fast loss detection and changes to min-RTT/4 when reordering is observed. We found 1msec accommodates well on tiny degree of reordering (<3 pkts) on faster links. We use min-RTT instead of SRTT because reordering is more of a path property but SRTT can be inflated by self-inflicated congestion. The factor of 4 is borrowed from the delayed early retransmit and seems to work reasonably well. Since RACK is still experimental, it is now used as a supplemental loss detection on top of existing algorithms. It is only effective after the fast recovery starts or after the timeout occurs. The fast recovery is still triggered by FACK and/or dupack threshold instead of RACK. We introduce a new sysctl net.ipv4.tcp_recovery for future experiments of loss recoveries. For now RACK can be disabled by setting it to 0. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: track the packet timings in RACKYuchung Cheng
This patch is the first half of the RACK loss recovery. RACK loss recovery uses the notion of time instead of packet sequence (FACK) or counts (dupthresh). It's inspired by the previous FACK heuristic in tcp_mark_lost_retrans(): when a limited transmit (new data packet) is sacked, then current retransmitted sequence below the newly sacked sequence must been lost, since at least one round trip time has elapsed. But it has several limitations: 1) can't detect tail drops since it depends on limited transmit 2) is disabled upon reordering (assumes no reordering) 3) only enabled in fast recovery ut not timeout recovery RACK (Recently ACK) addresses these limitations with the notion of time instead: a packet P1 is lost if a later packet P2 is s/acked, as at least one round trip has passed. Since RACK cares about the time sequence instead of the data sequence of packets, it can detect tail drops when later retransmission is s/acked while FACK or dupthresh can't. For reordering RACK uses a dynamically adjusted reordering window ("reo_wnd") to reduce false positives on ever (small) degree of reordering. This patch implements tcp_advanced_rack() which tracks the most recent transmission time among the packets that have been delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp is the key to determine which packet has been lost. Consider an example that the sender sends six packets: T1: P1 (lost) T2: P2 T3: P3 T4: P4 T100: sack of P2. rack.mstamp = T2 T101: retransmit P1 T102: sack of P2,P3,P4. rack.mstamp = T4 T205: ACK of P4 since the hole is repaired. rack.mstamp = T101 We need to be careful about spurious retransmission because it may falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK to falsely mark all packets lost, just like a spurious timeout. We identify spurious retransmission by the ACK's TS echo value. If TS option is not applicable but the retransmission is acknowledged less than min-RTT ago, it is likely to be spurious. We refrain from using the transmission time of these spurious retransmissions. The second half is implemented in the next patch that marks packet lost using RACK timestamp. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: add tcp_tsopt_ecr_before helperYuchung Cheng
a helper to prepare the main RACK patch Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: remove tcp_mark_lost_retrans()Yuchung Cheng
Remove the existing lost retransmit detection because RACK subsumes it completely. This also stops the overloading the ack_seq field of the skb control block. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: track min RTT using windowed min-filterYuchung Cheng
Kathleen Nichols' algorithm for tracking the minimum RTT of a data stream over some measurement window. It uses constant space and constant time per update. Yet it almost always delivers the same minimum as an implementation that has to keep all the data in the window. The measurement window is tunable via sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes. The algorithm keeps track of the best, 2nd best & 3rd best min values, maintaining an invariant that the measurement time of the n'th best >= n-1'th best. It also makes sure that the three values are widely separated in the time window since that bounds the worse case error when that data is monotonically increasing over the window. Upon getting a new min, we can forget everything earlier because it has no value - the new min is less than everything else in the window by definition and it's the most recent. So we restart fresh on every new min and overwrites the 2nd & 3rd choices. The same property holds for the 2nd & 3rd best. Therefore we have to maintain two invariants to maximize the information in the samples, one on values (1st.v <= 2nd.v <= 3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <= now). These invariants determine the structure of the code The RTT input to the windowed filter is the minimum RTT measured from ACK or SACK, or as the last resort from TCP timestamps. The accessor tcp_min_rtt() returns the minimum RTT seen in the window. ~0U indicates it is not available. The minimum is 1usec even if the true RTT is below that. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: apply Kern's check on RTTs used for congestion controlYuchung Cheng
Currently ca_seq_rtt_us does not use Kern's check. Fix that by checking if any packet acked is a retransmit, for both RTT used for RTT estimation and congestion control. Fixes: 5b08e47ca ("tcp: prefer packet timing to TS-ECR for RTT") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-19tcp: do not set queue_mapping on SYNACKEric Dumazet
At the time of commit fff326990789 ("tcp: reflect SYN queue_mapping into SYNACK packets") we had little ways to cope with SYN floods. We no longer need to reflect incoming skb queue mappings, and instead can pick a TX queue based on cpu cooking the SYNACK, with normal XPS affinities. Note that all SYNACK retransmits were picking TX queue 0, this no longer is a win given that SYNACK rtx are now distributed on all cpus. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13net: shrink struct sock and request_sock by 8 bytesEric Dumazet
One 32bit hole is following skc_refcnt, use it. skc_incoming_cpu can also be an union for request_sock rcv_wnd. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05tcp: avoid two atomic ops for syncookiesEric Dumazet
inet_reqsk_alloc() is used to allocate a temporary request in order to generate a SYNACK with a cookie. Then later, syncookie validation also uses a temporary request. These paths already took a reference on listener refcount, we can avoid a couple of atomic operations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05tcp: fix fastopen races vs lockless listenerEric Dumazet
There are multiple races that need fixes : 1) skb_get() + queue skb + kfree_skb() is racy An accept() can be done on another cpu, data consumed immediately. tcp_recvmsg() uses __kfree_skb() as it is assumed all skb found in socket receive queue are private. Then the kfree_skb() in tcp_rcv_state_process() uses an already freed skb 2) tcp_reqsk_record_syn() needs to be done before tcp_try_fastopen() for the same reasons. 3) We want to send the SYNACK before queueing child into accept queue, otherwise we might reintroduce the ooo issue fixed in commit 7c85af881044 ("tcp: avoid reorders for TFO passive connections") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp: attach SYNACK messages to request sockets instead of listenerEric Dumazet
If a listen backlog is very big (to avoid syncookies), then the listener sk->sk_wmem_alloc is the main source of false sharing, as we need to touch it twice per SYNACK re-transmit and TX completion. (One SYN packet takes listener lock once, but up to 6 SYNACK are generated) By attaching the skb to the request socket, we remove this source of contention. Tested: listen(fd, 10485760); // single listener (no SO_REUSEPORT) 16 RX/TX queue NIC Sustain a SYNFLOOD attack of ~320,000 SYN per second, Sending ~1,400,000 SYNACK per second. Perf profiles now show listener spinlock being next bottleneck. 20.29% [kernel] [k] queued_spin_lock_slowpath 10.06% [kernel] [k] __inet_lookup_established 5.12% [kernel] [k] reqsk_timer_handler 3.22% [kernel] [k] get_next_timer_interrupt 3.00% [kernel] [k] tcp_make_synack 2.77% [kernel] [k] ipt_do_table 2.70% [kernel] [k] run_timer_softirq 2.50% [kernel] [k] ip_finish_output 2.04% [kernel] [k] cascade Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp/dccp: install syn_recv requests into ehash tableEric Dumazet
In this patch, we insert request sockets into TCP/DCCP regular ehash table (where ESTABLISHED and TIMEWAIT sockets are) instead of using the per listener hash table. ACK packets find SYN_RECV pseudo sockets without having to find and lock the listener. In nominal conditions, this halves pressure on listener lock. Note that this will allow for SO_REUSEPORT refinements, so that we can select a listener using cpu/numa affinities instead of the prior 'consistent hash', since only SYN packets will apply this selection logic. We will shrink listen_sock in the following patch to ease code review. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ying Cai <ycai@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp: move synflood_warned into struct request_sock_queueEric Dumazet
long term plan is to remove struct listen_sock when its hash table is no longer there. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29tcp: constify tcp_syn_flood_action() socket argumentEric Dumazet
tcp_syn_flood_action() will soon be called with unlocked socket. In order to avoid SYN flood warning being emitted multiple times, use xchg(). Extend max_qlen_log and synflood_warned fields in struct listen_sock to u32 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29tcp: remove tcp_rcv_state_process() tcp_hdr argumentEric Dumazet
Factorize code to get tcp header from skb. It makes no sense to duplicate code in callers. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29tcp: remove unused len argument from tcp_rcv_state_process()Eric Dumazet
Once we realize tcp_rcv_synsent_state_process() does not use its 'len' argument and we get rid of it, then it becomes clear this argument is no longer used in tcp_rcv_state_process() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29tcp: avoid reorders for TFO passive connectionsEric Dumazet
We found that a TCP Fast Open passive connection was vulnerable to reorders, as the exchange might look like [1] C -> S S <FO ...> <request> [2] S -> C S. ack request <options> [3] S -> C . <answer> packets [2] and [3] can be generated at almost the same time. If C receives the 3rd packet before the 2nd, it will drop it as the socket is in SYN_SENT state and expects a SYNACK. S will have to retransmit the answer. Current OOO avoidance in linux is defeated because SYNACK packets are attached to the LISTEN socket, while DATA packets are attached to the children. They might be sent by different cpus, and different TX queues might be selected. It turns out that for TFO, we created a child, which is a full blown socket in TCP_SYN_RECV state, and we simply can attach the SYNACK packet to this socket. This means that at the time tcp_sendmsg() pushes DATA packet, skb->ooo_okay will be set iff the SYNACK packet had been sent and TX completed. This removes the reorder source at the host level. We also removed the export of tcp_try_fastopen(), as it is no longer called from IPv6. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21tcp: usec resolution SYN/ACK RTTYuchung Cheng
Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK RTT is often measured as 0ms or sometimes 1ms, which would affect RTT estimation and min RTT samping used by some congestion control. This patch improves SYN/ACK RTT to be usec resolution if platform supports it. While the timestamping of SYN/ACK is done in request sock, the RTT measurement is carefully arranged to avoid storing another u64 timestamp in tcp_sock. For regular handshake w/o SYNACK retransmission, the RTT is sampled right after the child socket is created and right before the request sock is released (tcp_check_req() in tcp_minisocks.c) For Fast Open the child socket is already created when SYN/ACK was sent, the RTT is sampled in tcp_rcv_state_process() after processing the final ACK an right before the request socket is released. If the SYN/ACK was retransmistted or SYN-cookie was used, we rely on TCP timestamps to measure the RTT. The sample is taken at the same place in tcp_rcv_state_process() after the timestamp values are validated in tcp_validate_incoming(). Note that we do not store TS echo value in request_sock for SYN-cookies, because the value is already stored in tp->rx_opt used by tcp_ack_update_rtt(). One side benefit is that the RTT measurement now happens before initializing congestion control (of the passive side). Therefore the congestion control can use the SYN/ACK RTT. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-18tcp: provide skb->hash to synack packetsEric Dumazet
In commit b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf on xmit"), Tom provided a l4 hash to most outgoing TCP packets. We'd like to provide one as well for SYNACK packets, so that all packets of a given flow share same txhash, to later enable bonding driver to also use skb->hash to perform slave selection. Note that a SYNACK retransmit shuffles the tx hash, as Tom did in commit 265f94ff54d62 ("net: Recompute sk_txhash on negative routing advice") for established sockets. This has nice effect making TCP flows resilient to some kind of black holes, even at connection establish phase. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Cc: Mahesh Bandewar <maheshb@google.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31tcp: use dctcp if enabled on the route to the initiatorDaniel Borkmann
Currently, the following case doesn't use DCTCP, even if it should: A responder has f.e. Cubic as system wide default, but for a specific route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The initiating host then uses DCTCP as congestion control, but since the initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok, and we have to fall back to Reno after 3WHS completes. We were thinking on how to solve this in a minimal, non-intrusive way without bloating tcp_ecn_create_request() needlessly: lets cache the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0) is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows to only do a single metric feature lookup inside tcp_ecn_create_request(). Joint work with Florian Westphal. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25tcp: refine pacing rate determinationEric Dumazet
When TCP pacing was added back in linux-3.12, we chose to apply a fixed ratio of 200 % against current rate, to allow probing for optimal throughput even during slow start phase, where cwnd can be doubled every other gRTT. At Google, we found it was better applying a different ratio while in Congestion Avoidance phase. This ratio was set to 120 %. We've used the normal tcp_in_slow_start() helper for a while, then tuned the condition to select the conservative ratio as soon as cwnd >= ssthresh/2 : - After cwnd reduction, it is safer to ramp up more slowly, as we approach optimal cwnd. - Initial ramp up (ssthresh == INFINITY) still allows doubling cwnd every other RTT. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25tcp: fix slow start after idle vs TSO/GSOEric Dumazet
slow start after idle might reduce cwnd, but we perform this after first packet was cooked and sent. With TSO/GSO, it means that we might send a full TSO packet even if cwnd should have been reduced to IW10. Moving the SSAI check in skb_entail() makes sense, because we slightly reduce number of times this check is done, especially for large send() and TCP Small queue callbacks from softirq context. As Neal pointed out, we also need to perform the check if/when receive window opens. Tested: Following packetdrill test demonstrates the problem // Test of slow start after idle `sysctl -q net.ipv4.tcp_slow_start_after_idle=1` 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6> +.100 < . 1:1(0) ack 1 win 511 +0 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0 +0 write(4, ..., 26000) = 26000 +0 > . 1:5001(5000) ack 1 +0 > . 5001:10001(5000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10 }% +.100 < . 1:1(0) ack 10001 win 511 +0 %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }% +0 > . 10001:20001(10000) ack 1 +0 > P. 20001:26001(6000) ack 1 +.100 < . 1:1(0) ack 26001 win 511 +0 %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }% +4 write(4, ..., 20000) = 20000 // If slow start after idle works properly, we should send 5 MSS here (cwnd/2) +0 > . 26001:31001(5000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }% +0 > . 31001:36001(5000) ack 1 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: net/bridge/br_mdb.c br_mdb.c conflict was a function call being removed to fix a bug in 'net' but whose signature was changed in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22net: track success and failure of TCP PMTU probingRick Jones
Track success and failure of TCP PMTU probing. Signed-off-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16tcp: don't use F-RTO on non-recurring timeoutsYuchung Cheng
Currently F-RTO may repeatedly send new data packets on non-recurring timeouts in CA_Loss mode. This is a bug because F-RTO (RFC5682) should only be used on either new recovery or recurring timeouts. This exacerbates the recovery progress during frequent timeout & repair, because we prioritize sending new data packets instead of repairing the holes when the bandwidth is already scarce. Fix it by correcting the test of a new recovery episode. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09tcp: update congestion state first before raising cwndYuchung Cheng
The congestion state and cwnd can be updated in the wrong order. For example, upon receiving a dubious ACK, we incorrectly raise the cwnd first (tcp_may_raise_cwnd()/tcp_cong_avoid()) because the state is still Open, then enter recovery state to reduce cwnd. For another example, if the ACK indicates spurious timeout or retransmits, we first revert the cwnd reduction and congestion state back to Open state. But we don't raise the cwnd even though the ACK does not indicate any congestion. To fix this problem we should first call tcp_fastretrans_alert() to process the dubious ACK and update the congestion state, then call tcp_may_raise_cwnd() that raises cwnd based on the current state. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09tcp: v1 always send a quick ack when quickacks are enabledJon Maxwell
V1 of this patch contains Eric Dumazet's suggestion to move the per dst RTAX_QUICKACK check into tcp_in_quickack_mode(). Thanks Eric. I ran some tests and after setting the "ip route change quickack 1" knob there were still many delayed ACKs sent. This occured because when icsk_ack.quick=0 the !icsk_ack.pingpong value is subsequently ignored as tcp_in_quickack_mode() checks both these values. The condition for a quick ack to trigger requires that both icsk_ack.quick != 0 and icsk_ack.pingpong=0. Currently only icsk_ack.pingpong is controlled by the knob. But the icsk_ack.quick value changes dynamically depending on heuristics. The crux of the matter is that delayed acks still cannot be entirely disabled even with the RTAX_QUICKACK per dst knob enabled. This patch ensures that a quick ack is always sent when the RTAX_QUICKACK per dst knob is turned on. The "ip route change quickack 1" knob was recently added to enable quickacks. It was modeled around the TCP_QUICKACK setsockopt() option. This issue is that even with "ip route change quickack 1" enabled we still see delayed ACKs under some conditions. It would be nice to be able to completely disable delayed ACKs. Here is an example: # netstat -s|grep dela 3 delayed acks sent For all routes enable the knob # ip route change quickack 1 Generate some traffic across a slow link and we still see the delayed acks. # netstat -s|grep dela 106 delayed acks sent 1 delayed acks further delayed because of locked socket The issue is that both the "ip route change quickack 1" knob and the TCP_QUICKACK option set the icsk_ack.pingpong variable to 0. However at the business end in the __tcp_ack_snd_check() routine, tcp_in_quickack_mode() checks that both icsk_ack.quick != 0 and icsk_ack.pingpong=0 in order to trigger a quickack. As icsk_ack.quick is determined by heuristics it can be 0. When that occurs the icsk_ack.pingpong value is ignored and a delayed ACK is sent regardless. This patch moves the RTAX_QUICKACK per dst check into the tcp_in_quickack_mode() routine which ensures that a quickack is always sent when the quickack knob is enabled for that dst. Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08tcp: PRR uses CRB mode by default and SS mode conditionallyYuchung Cheng
PRR slow start is often too aggressive especially when drops are caused by traffic policers. The policers mainly use token bucket to enforce the rate so sending (twice) faster than the delivery rate causes excessive drops. This patch changes PRR to the conservative reduction bound (CRB) mode in RFC 6937 by default. CRB follows the packet conservation rule to send at most the delivery rate by default. But if many packets are lost and the pipe is empty, CRB may take N round trips to repair N losses. We conditionally turn on slow start mode if all these conditions are made to speed up the recovery: 1) on the second round or later in recovery 2) retransmission sent in the previous round is delivered on this ACK 3) no retransmission is marked lost on this ACK By using packet conservation by default, this change reduces the loss retransmits signicantly on networks that deploy traffic policers, up to 20% reduction of overall loss rate. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08tcp: reduce cwnd if retransmit is lost in CA_LossYuchung Cheng
If the retransmission in CA_Loss is lost again, we should not continue to slow start or raise cwnd in congestion avoidance mode. Instead we should enter fast recovery and use PRR to reduce cwnd, following the principle in RFC5681: "... or the loss of a retransmission, should be taken as two indications of congestion and, therefore, cwnd (and ssthresh) MUST be lowered twice in this case." This is especially important to reduce loss when the CA_Loss state was caused by a traffic policer dropping the entire inflight. The CA_Loss state has a problem where a loss of L packets causes the sender to send a burst of L packets. So a policer that's dropping most packets in a given RTT can cause a huge retransmit storm. By contrast, PRR includes logic to bound the number of outbound packets that result from a given ACK. So switching to CA_Recovery on lost retransmits in CA_Loss avoids this retransmit storm problem when in CA_Loss. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11tcp: fill shinfo->gso_size at last momentEric Dumazet
In commit cd7d8498c9a5 ("tcp: change tcp_skb_pcount() location") we stored gso_segs in a temporary cache hot location. This patch does the same for gso_size. This allows to save 2 cache line misses in tcp xmit path for the last packet that is considered but not sent because of various conditions (cwnd, tso defer, receiver window, TSQ...) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11tcp: fill shinfo->gso_type at last momentEric Dumazet
Our goal is to touch skb_shinfo(skb) only when absolutely needed, to avoid two cache line misses in TCP output path for last skb that is considered but not sent because of various conditions (cwnd, tso defer, receiver window, TSQ...) A packet is GSO only when skb_shinfo(skb)->gso_size is not zero. We can set skb_shinfo(skb)->gso_type to sk->sk_gso_type even for non GSO packets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11tcp: export tcp_enter_cwr()Kenneth Klette Jonassen
Upcoming tcp_cdg uses tcp_enter_cwr() to initiate PRR. Export this function so that CDG can be compiled as a module. Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: David Hayes <davihay@ifi.uio.no> Cc: Andreas Petlund <apetlund@simula.no> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Nicolas Kuhn <nicolas.kuhn@telecom-bretagne.eu> Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/cadence/macb.c drivers/net/phy/phy.c include/linux/skbuff.h net/ipv4/tcp.c net/switchdev/switchdev.c Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD} renaming overlapping with net-next changes of various sorts. phy.c was a case of two changes, one adding a local variable to a function whilst the second was removing one. tcp.c overlapped a deadlock fix with the addition of new tcp_info statistic values. macb.c involved the addition of two zyncq device entries. skbuff.h involved adding back ipv4_daddr to nf_bridge_info whilst net-next changes put two other existing members of that struct into a union. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-22tcp: fix a potential deadlock in tcp_get_info()Eric Dumazet
Taking socket spinlock in tcp_get_info() can deadlock, as inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i], while packet processing can use the reverse locking order. We could avoid this locking for TCP_LISTEN states, but lockdep would certainly get confused as all TCP sockets share same lockdep classes. [ 523.722504] ====================================================== [ 523.728706] [ INFO: possible circular locking dependency detected ] [ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted [ 523.739202] ------------------------------------------------------- [ 523.745474] ss/18032 is trying to acquire lock: [ 523.750002] (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360 [ 523.758129] [ 523.758129] but task is already holding lock: [ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0 [ 523.774661] [ 523.774661] which lock already depends on the new lock. [ 523.774661] [ 523.782850] [ 523.782850] the existing dependency chain (in reverse order) is: [ 523.790326] -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}: [ 523.796599] [<ffffffff811126bb>] lock_acquire+0xbb/0x270 [ 523.802565] [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50 [ 523.808628] [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110 [ 523.815273] [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350 [ 523.822067] [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500 [ 523.828199] [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0 [ 523.834331] [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10 [ 523.840202] [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0 [ 523.847214] [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0 [ 523.853440] [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0 [ 523.859624] [<ffffffff81659db7>] ip_rcv+0x307/0x420 Lets use u64_sync infrastructure instead. As a bonus, 64bit arches get optimized, as these are nop for them. Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19tcp: don't over-send F-RTO probesYuchung Cheng
After sending the new data packets to probe (step 2), F-RTO may incorrectly send more probes if the next ACK advances SND_UNA and does not sack new packet. However F-RTO RFC 5682 probes at most once. This bug may cause sender to always send new data instead of repairing holes, inducing longer HoL blocking on the receiver for the application. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19tcp: only undo on partial ACKs in CA_LossYuchung Cheng
Undo based on TCP timestamps should only happen on ACKs that advance SND_UNA, according to the Eifel algorithm in RFC 3522: Section 3.2: (4) If the value of the Timestamp Echo Reply field of the acceptable ACK's Timestamps option is smaller than the value of RetransmitTS, then proceed to step (5), Section Terminology: We use the term 'acceptable ACK' as defined in [RFC793]. That is an ACK that acknowledges previously unacknowledged data. This is because upon receiving an out-of-order packet, the receiver returns the last timestamp that advances RCV_NXT, not the current timestamp of the packet in the DUPACK. Without checking the flag, the DUPACK will cause tcp_packet_delayed() to return true and tcp_try_undo_loss() will revert cwnd reduction. Note that we check the condition in CA_Recovery already by only calling tcp_try_undo_partial() if FLAG_SND_UNA_ADVANCED is set or tcp_try_undo_recovery() if snd_una crosses high_seq. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18tcp: allow one skb to be received per socket under memory pressureEric Dumazet
While testing tight tcp_mem settings, I found tcp sessions could be stuck because we do not allow even one skb to be received on them. By allowing one skb to be received, we introduce fairness and eventuallu force memory hogs to release their allocation. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18tcp: introduce tcp_under_memory_pressure()Eric Dumazet
Introduce an optimized version of sk_under_memory_pressure() for TCP. Our intent is to use it in fast paths. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09tcp: adjust window probe timers to safer valuesEric Dumazet
With the advent of small rto timers in datacenter TCP, (ip route ... rto_min x), the following can happen : 1) Qdisc is full, transmit fails. TCP sets a timer based on icsk_rto to retry the transmit, without exponential backoff. With low icsk_rto, and lot of sockets, all cpus are servicing timer interrupts like crazy. Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN) and 500ms (TCP_RESOURCE_PROBE_INTERVAL) 2) Receivers can send zero windows if they don't drain their receive queue. TCP sends zero window probes, based on icsk_rto current value, with exponential backoff. With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in some cases), sender can abort in less than one or two minutes ! If receiver stops the sender, it obviously doesn't care of very tight rto. Probability of dropping the ACK reopening the window is not worth the risk. Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these events (but not normal RTO based retransmits) A followup patch adds a new SNMP counter, as it would have helped a lot diagnosing this issue. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05tcp: provide SYN headers for passive connectionsEric Dumazet
This patch allows a server application to get the TCP SYN headers for its passive connections. This is useful if the server is doing fingerprinting of clients based on SYN packet contents. Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN. The first is used on a socket to enable saving the SYN headers for child connections. This can be set before or after the listen() call. The latter is used to retrieve the SYN headers for passive connections, if the parent listener has enabled TCP_SAVE_SYN. TCP_SAVED_SYN is read once, it frees the saved SYN headers. The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP headers. Original patch was written by Tom Herbert, I changed it to not hold a full skb (and associated dst and conntracking reference). We have used such patch for about 3 years at Google. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04tcp: invoke pkts_acked hook on every ACKKenneth Klette Jonassen
Invoking pkts_acked is currently conditioned on FLAG_ACKED: receiving a cumulative ACK of new data, or ACK with SYN flag set. Remove this condition so that CC may get RTT measurements from all SACKs. Cc: Yuchung Cheng <ycheng@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04tcp: improve RTT from SACK for CCKenneth Klette Jonassen
tcp_sacktag_one() always picks the earliest sequence SACKed for RTT. This might not make sense for congestion control in cases where: 1. ACKs are lost, i.e. a SACK following a lost SACK covers both new and old segments at the receiver. 2. The receiver disregards the RFC 5681 recommendation to immediately ACK out-of-order segments. Give congestion control a RTT for the latest segment SACKed, which is the most accurate RTT estimate, but preserve the conservative RTT for RTO. Removes the call to skb_mstamp_get() in tcp_sacktag_one(). Cc: Yuchung Cheng <ycheng@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04tcp: move struct tcp_sacktag_state to tcp_ack()Kenneth Klette Jonassen
Later patch passes two values set in tcp_sacktag_one() to tcp_clean_rtx_queue(). Prepare passing them via struct tcp_sacktag_state. Acked-by: Yuchung Cheng <ycheng@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Signed-off-by: David S. Miller <davem@davemloft.net>