summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2013-06-07net: tcp: move GRO/GSO functions to tcp_offloadDaniel Borkmann
Would be good to make things explicit and move those functions to a new file called tcp_offload.c, thus make this similar to tcpv6_offload.c. While moving all related functions into tcp_offload.c, we can also make some of them static, since they are only used there. Also, add an explicit registration function. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07net: minor: tcp: use tcp_skb_mss helper in tcp_tso_segmentDaniel Borkmann
We have the minimal inline helper tcp_skb_mss to access skb_shinfo(skb)->gso_size, so also use it here to get mss. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-06Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Conflicts: net/netfilter/nf_log.c The conflict in nf_log.c is that in 'net' we added CONFIG_PROC_FS protection around foo_proc_entry() calls to fix a build failure, whereas in Pablo's tree a guard if() test around a call is remove_proc_entry() was removed. Trivially resolved. Pablo Neira Ayuso says: ==================== The following patchset contains the first batch of Netfilter/IPVS updates for your net-next tree, they are: * Three patches with improvements and code refactorization for nfnetlink_queue, from Florian Westphal. * FTP helper now parses replies without brackets, as RFC1123 recommends, from Jeff Mahoney. * Rise a warning to tell everyone about ULOG deprecation, NFLOG has been already in the kernel tree for long time and supersedes the old logging over netlink stub, from myself. * Don't panic if we fail to load netfilter core framework, just bail out instead, from myself. * Add cond_resched_rcu, used by IPVS to allow rescheduling while walking over big hashtables, from Simon Horman. * Change type of IPVS sysctl_sync_qlen_max sysctl to avoid possible overflow, from Zhang Yanfei. * Use strlcpy instead of strncpy to skip zeroing of already initialized area to write the extension names in ebtables, from Chen Gang. * Use already existing per-cpu notrack object from xt_CT, from Eric Dumazet. * Save explicit socket lookup in xt_socket now that we have early demux, also from Eric Dumazet. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Merge 'net' bug fixes into 'net-next' as we have patches that will build on top of them. This merge commit includes a change from Emil Goode (emilgoode@gmail.com) that fixes a warning that would have been introduced by this merge. Specifically it fixes the pingv6_ops method ipv6_chk_addr() to add a "const" to the "struct net_device *dev" argument and likewise update the dummy_ipv6_chk_addr() declaration. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04ping: always initialize ->sin6_scope_id and ->sin6_flowinfoCong Wang
If we don't need scope id, we should initialize it to zero. Same for ->sin6_flowinfo. Cc: Lorenzo Colitti <lorenzo@google.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04Kconfig: remove dangling references to the deleted fileJean Sacren
Commit 202dc3fc599c1dded235d3b448d9ca924252e354 (Documentation: remove obsolete networking/multicast.txt file) deleted the obsolete file. After the file has been removed, clean up a couple of places where references to the deleted file were made so that users wouldn't be confused when they consult the Help menu. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04net: ipv6: Implement /proc/net/icmp6.Lorenzo Colitti
The format is based on /proc/net/icmp and /proc/net/{udp,raw}6. Compiles and displays reasonable results with CONFIG_IPV6={n,m,y} Couldn't figure out how to test without CONFIG_PROC_FS enabled. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04net: ipv4: make the ping /proc code AF-independentLorenzo Colitti
Introduce a ping_seq_afinfo structure (similar to its UDP equivalent) and use it to make some of the ping /proc functions address-family independent. Rename the remaining ping /proc functions from ping_* to ping_v4_*. Compiles and displays reasonable results with CONFIG_IPV6={n,m,y} Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03icmp: avoid allocating large struct on stackCong Wang
struct icmp_bxm is a large struct, reduce stack usage by allocating it on heap. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03] icmp: fix icmp_unreach() comment.Rami Rosen
ICMP_PARAMETERPROB is handled by icmp_unreach(); This patch adds ICMP_PARAMETERPROB to the list of ICMP message types handled by icmp_unreach(). Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03ipv4: use separate genid for next hop exceptionsTimo Teräs
commit 13d82bf5 (ipv4: Fix flushing of cached routing informations) added the support to flush learned pmtu information. However, using rt_genid is quite heavy as it is bumped on route add/change and multicast events amongst other places. These can happen quite often, especially if using dynamic routing protocols. While this is ok with routes (as they are just recreated locally), the pmtu information is learned from remote systems and the icmp notification can come with long delays. It is worthy to have separate genid to avoid excessive pmtu resets. Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03ipv4: rate limit updating of next hop exceptions with same pmtuTimo Teräs
The tunnel devices call update_pmtu for each packet sent, this causes contention on the fnhe_lock. Ignore the pmtu update if pmtu is not actually changed, and there is still plenty of time before the entry expires. Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03ipv4: properly refresh rtable entries on pmtu/redirect eventsTimo Teräs
This reverts commit 05ab86c5 (xfrm4: Invalidate all ipv4 routes on IPsec pmtu events). Flushing all cached entries is not needed. Instead, invalidate only the related next hop dsts to recheck for the added next hop exception where needed. This also fixes a subtle race due to bumping generation id's before updating the pmtu. Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-01sit: add IPv4 over IPv4 supportNicolas Dichtel
This patch adds the support of IPv4 over Ipv4 for the module sit. The gain of this feature is to be able to have 4in4 and 6in4 over the same interface instead of having one interface for 6in4 and another for 4in4 even if encapsulation addresses are the same. To avoid conflicting with ipip module, sit IPv4 over IPv4 protocol is registered with a smaller priority. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-01iptunnel: specify protocol outside IP headerNicolas Dichtel
Before this patch, ip_tunnel_xmit() was using the field protocol from the IP header passed into argument. There is no functional change, this patch prepares the support of IPv4 over IPv4 for module sit. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31gro: should aggregate frames without DFEric Dumazet
GRO on IPv4 doesn't aggregate frames if they don't have DF bit set. Some servers use IP_MTU_DISCOVER/IP_PMTUDISC_PROBE, so linux receivers are unable to aggregate this kind of traffic. The right thing to do is to allow aggregation as long as the DF bit has same value on all segments. bnx2x LRO does this correctly. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jerry Chu <hkchu@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ben Hutchings <bhutchings@solarflare.com> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31net: Update RFS target at poll for tcp/udpDavid Majnemer
The current state of affairs is that read()/write() will setup RFS (Receive Flow Steering) for internet protocol sockets while poll()/epoll() does not. When poll() gets called with a TCP or UDP socket, we should update the flow target. This permits to RFS (if enabled) to select the appropriate CPU for following incoming packets. Note: Only connected UDP sockets can benefit from RFS. Signed-off-by: David Majnemer <majnemer@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paul Turner <pjt@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31tcp: undo on DSACK during recoveryYuchung Cheng
If the receiver supports DSACK, sender can detect false recoveries and revert cwnd reductions triggered by either severe network reordering or concurrent reordering and loss event. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31tcp: fix undo on partial ack in recoveryYuchung Cheng
Upon detecting spurious fast retransmit via timestamps during recovery, use PRR to clock out new data packet instead of retransmission. Once all retransmission are proven spurious, the sender then reverts the cwnd reduction and congestion state to open or disorder. The current code does the opposite: it undoes cwnd as soon as any retransmission is spurious and continues to retransmit until all data are acked. This nullifies the point to undo the cwnd because the sender is still retransmistting spuriously. This patch fixes it. The undo_ssthresh argument of tcp_undo_cwnd_reductiuon() is no longer needed and is removed. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31tcp: refactor undo functionsYuchung Cheng
Refactor and relocate various functions or variables to prepare the undo fix. Remove some unused function arguments. Rename tcp_undo_cwr to tcp_undo_cwnd_reduction to be consistent with the rest of CWR related function names. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31tcp: consolidate PRR packet accountingYuchung Cheng
This patch series fixes an undo bug in fast recovery: the sender mistakenly undos the cwnd too early but continues fast retransmits until all pending data are acked. This also multiplies the SNMP stat PARTIALUNDO events by the degree of the network reordering. The first patch prepares the fix by consolidating the accounting of newly_acked_sacked in tcp_cwnd_reduction(), instead of updating newly_acked_sacked everytime sacked_out is adjusted. Also pass acked and prior_unsacked as const type because they are readonly in the rest of recovery processing. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-30Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== The following patchset contains Netfilter/IPVS fixes for 3.10-rc3, they are: * fix xt_addrtype with IPv6, from Florian Westphal. This required a new hook for IPv6 functions in the netfilter core to avoid hard dependencies with the ipv6 subsystem when this match is only used for IPv4. * fix connection reuse case in IPVS. Currently, if an reused connection are directed to the same server. If that server is down, those connection would fail. Therefore, clear the connection and choose a new server among the available ones. * fix possible non-nul terminated string sent to user-space if ipt_ULOG is used as the default netfilter logging stub, from Chen Gang. * fix mark logging of IPv6 packets in xt_LOG, from Michal Kubecek. This bug has been there since 2.6.26. * Fix breakage ip_vs_sh due to incorrect structure layout for RCU, from Jan Beulich. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-29net, ipv4, ipv6: Correct assignment of skb->network_header to skb->tailSimon Horman
This corrects an regression introduced by "net: Use 16bits for *_headers fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In that case skb->tail will be a pointer however skb->network_header is now an offset. This patch corrects the problem by adding a wrapper to return skb tail as an offset regardless of the value of NET_SKBUFF_DATA_USES_OFFSET. It seems that skb->tail that this offset may be more than 64k and some care has been taken to treat such cases as an error. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-29ipv4: Correct comparisons and calculations using skb->tail and ↵Simon Horman
skb-transport_header This corrects an regression introduced by "net: Use 16bits for *_headers fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In that case skb->tail will be a pointer whereas skb->transport_header will be an offset from head. This is corrected by using wrappers that ensure that comparisons and calculations are always made using pointers. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-29net: always pass struct netdev_notifier_info to netdevice notifiersCong Wang
commit 351638e7deeed2ec8ce451b53d3 (net: pass info struct via netdevice notifier) breaks booting of my KVM guest, this is due to we still forget to pass struct netdev_notifier_info in several places. This patch completes it. Cc: Jiri Pirko <jiri@resnulli.us> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28arp: flush arp cache on IFF_NOARP changeTimo Teräs
IFF_NOARP affects what kind of neighbor entries are created (nud NOARP or nud INCOMPLETE). If the flag changes, flush the arp cache to refresh all entries. Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: Jiri Pirko <jiri@resnulli.us> v2->v3: shortened notifier_info struct name Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28net: pass info struct via netdevice notifierJiri Pirko
So far, only net_device * could be passed along with netdevice notifier event. This patch provides a possibility to pass custom structure able to provide info that event listener needs to know. Signed-off-by: Jiri Pirko <jiri@resnulli.us> v2->v3: fix typo on simeth shortened dev_getter shortened notifier_info struct name v1->v2: fix notifier_call parameter in call_netdevice_notifier() Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28ipv4: fix redirect handling for TCP packetsMichal Kubecek
Unlike ipv4_redirect() and ipv4_sk_redirect(), ip_do_redirect() doesn't call __build_flow_key() directly but via ip_rt_build_flow_key() wrapper. This leads to __build_flow_key() getting pointer to IPv4 header of the ICMP redirect packet rather than pointer to the embedded IPv4 header of the packet initiating the redirect. As a result, handling of ICMP redirects initiated by TCP packets is broken. Issue was introduced by 4895c771c ("ipv4: Add FIB nexthop exceptions.") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28MPLS: Add limited GSO supportSimon Horman
In the case where a non-MPLS packet is received and an MPLS stack is added it may well be the case that the original skb is GSO but the NIC used for transmit does not support GSO of MPLS packets. The aim of this code is to provide GSO in software for MPLS packets whose skbs are GSO. SKB Usage: When an implementation adds an MPLS stack to a non-MPLS packet it should do the following to skb metadata: * Set skb->inner_protocol to the old non-MPLS ethertype of the packet. skb->inner_protocol is added by this patch. * Set skb->protocol to the new MPLS ethertype of the packet. * Set skb->network_header to correspond to the end of the L3 header, including the MPLS label stack. I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to kernel" which adds MPLS support to the kernel datapath of Open vSwtich. That patch sets the above requirements in datapath/actions.c:push_mpls() and was used to exercise this code. The datapath patch is against the Open vSwtich tree but it is intended that it be added to the Open vSwtich code present in the mainline Linux kernel at some point. Features: I believe that the approach that I have taken is at least partially consistent with the handling of other protocols. Jesse, I understand that you have some ideas here. I am more than happy to change my implementation. This patch adds dev->mpls_features which may be used by devices to advertise features supported for MPLS packets. A new NETIF_F_MPLS_GSO feature is added for devices which support hardware MPLS GSO offload. Currently no devices support this and MPLS GSO always falls back to software. Alternate Implementation: One possible alternate implementation is to teach netif_skb_features() and skb_network_protocol() about MPLS, in a similar way to their understanding of VLANs. I believe this would avoid the need for net/mpls/mpls_gso.c and in particular the calls to __skb_push() and __skb_push() in mpls_gso_segment(). I have decided on the implementation in this patch as it should not introduce any overhead in the case where mpls_gso is not compiled into the kernel or inserted as a module. MPLS GSO suggested by Jesse Gross. Based in part on "v4 GRE: Add TCP segmentation offload for GRE" by Pravin B Shelar. Cc: Jesse Gross <jesse@nicira.com> Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-26ip_tunnel: fix kernel panic with icmp_dest_unreachEric Dumazet
Daniel Petre reported crashes in icmp_dst_unreach() with following call graph: #3 [ffff88003fc03938] __stack_chk_fail at ffffffff81037f77 #4 [ffff88003fc03948] icmp_send at ffffffff814d5fec #5 [ffff88003fc03ae8] ipv4_link_failure at ffffffff814a1795 #6 [ffff88003fc03af8] ipgre_tunnel_xmit at ffffffff814e7965 #7 [ffff88003fc03b78] dev_hard_start_xmit at ffffffff8146e032 #8 [ffff88003fc03bc8] sch_direct_xmit at ffffffff81487d66 #9 [ffff88003fc03c08] __qdisc_run at ffffffff81487efd #10 [ffff88003fc03c48] dev_queue_xmit at ffffffff8146e5a7 #11 [ffff88003fc03c88] ip_finish_output at ffffffff814ab596 Daniel found a similar problem mentioned in http://lkml.indiana.edu/hypermail/linux/kernel/1007.0/00961.html And indeed this is the root cause : skb->cb[] contains data fooling IP stack. We must clear IPCB in ip_tunnel_xmit() sooner in case dst_link_failure() is called. Or else skb->cb[] might contain garbage from GSO segmentation layer. A similar fix was tested on linux-3.9, but gre code was refactored in linux-3.10. I'll send patches for stable kernels as well. Many thanks to Daniel for providing reports, patches and testing ! Reported-by: Daniel Petre <daniel.petre@rcs-rds.ro> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-26tcp: Remove 2 indentation levels in tcp_rcv_state_processJoe Perches
case TCP_FIN_WAIT1 can also be simplified by reversing tests and adding breaks; Add braces after case and move automatic definitions. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-26tcp: Remove another indentation level in tcp_rcv_state_processJoe Perches
case TCP_SYN_RECV: can have another indentation level removed by converting if (acceptable) { ...; } else { return 1; } to if (!acceptable) return 1; ...; Reflow code and comments to fit 80 columns. Another pure cleanup patch. Signed-off-by: Joe Perches <joe@perches.com> Improved-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-26tcp: remove one indentation level in tcp_rcv_state_process()Eric Dumazet
Remove one level of indentation 'introduced' in commit c3ae62af8e75 (tcp: should drop incoming frames without ACK flag set) if (true) { ... } @acceptable variable is a boolean. This patch is a pure cleanup. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-26net: ipv6: Add IPv6 support to the ping socket.Lorenzo Colitti
This adds the ability to send ICMPv6 echo requests without a raw socket. The equivalent ability for ICMPv4 was added in 2011. Instead of having separate code paths for IPv4 and IPv6, make most of the code in net/ipv4/ping.c dual-stack and only add a few IPv6-specific bits (like the protocol definition) to a new net/ipv6/ping.c. Hopefully this will reduce divergence and/or duplication of bugs in the future. Caveats: - Setting options via ancillary data (e.g., using IPV6_PKTINFO to specify the outgoing interface) is not yet supported. - There are no separate security settings for IPv4 and IPv6; everything is controlled by /proc/net/ipv4/ping_group_range. - The proc interface does not yet display IPv6 ping sockets properly. Tested with a patched copy of ping6 and using raw socket calls. Compiles and works with all of CONFIG_IPV6={n,m,y}. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Merge net into net-next because some upcoming net-next changes build on top of bug fixes that went into net. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-24tcp: xps: fix reordering issuesEric Dumazet
commit 3853b5841c01a ("xps: Improvements in TX queue selection") introduced ooo_okay flag, but the condition to set it is slightly wrong. In our traces, we have seen ACK packets being received out of order, and RST packets sent in response. We should test if we have any packets still in host queue. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-23netfilter: ipt_ULOG: fix non-null terminated string in the nf_log pathChen Gang
If nf_log uses ipt_ULOG as logging output, we can deliver non-null terminated strings to user-space since the maximum length of the prefix that is passed by nf_log is NF_LOG_PREFIXLEN but pm->prefix is 32 bytes long (ULOG_PREFIX_LEN). This is actually happening already from nf_conntrack_tcp if ipt_ULOG is used, since it is passing strings longer than 32 bytes. Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-05-23netfilter: {ipt,ebt}_ULOG: rise warning on deprecationPablo Neira Ayuso
This target has been superseded by NFLOG. Spot a warning so we prepare removal in a couple of years. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
2013-05-23tcp: bug fix in proportional rate reduction.Nandita Dukkipati
This patch is a fix for a bug triggering newly_acked_sacked < 0 in tcp_ack(.). The bug is triggered by sacked_out decreasing relative to prior_sacked, but packets_out remaining the same as pior_packets. This is because the snapshot of prior_packets is taken after tcp_sacktag_write_queue() while prior_sacked is captured before tcp_sacktag_write_queue(). The problem is: tcp_sacktag_write_queue (tcp_match_skb_to_sack() -> tcp_fragment) adjusts the pcount for packets_out and sacked_out (MSS change or other reason). As a result, this delta in pcount is reflected in (prior_sacked - sacked_out) but not in (prior_packets - packets_out). This patch does the following: 1) initializes prior_packets at the start of tcp_ack() so as to capture the delta in packets_out created by tcp_fragment. 2) introduces a new "previous_packets_out" variable that snapshots packets_out right before tcp_clean_rtx_queue, so pkts_acked can be correctly computed as before. 3) Computes pkts_acked using previous_packets_out, and computes newly_acked_sacked using prior_packets. Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-20tcp: md5: remove spinlock usage in fast pathEric Dumazet
TCP md5 code uses per cpu variables but protects access to them with a shared spinlock, which is a contention point. [ tcp_md5sig_pool_lock is locked twice per incoming packet ] Makes things much simpler, by allocating crypto structures once, first time a socket needs md5 keys, and not deallocating them as they are really small. Next step would be to allow crypto allocations being done in a NUMA aware way. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-20ip_gre: fix a possible crash in ipgre_err()Eric Dumazet
Another fix needed in ipgre_err(), as parse_gre_header() might change skb->head. Bug added in commit c54419321455 (GRE: Refactor GRE tunneling code.) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-20tcp: remove bad timeout logic in fast recoveryYuchung Cheng
tcp_timeout_skb() was intended to trigger fast recovery on timeout, unfortunately in reality it often causes spurious retransmission storms during fast recovery. The particular sign is a fast retransmit over the highest sacked sequence (SND.FACK). Currently the RTO timer re-arming (as in RFC6298) offers a nice cushion to avoid spurious timeout: when SND.UNA advances the sender re-arms RTO and extends the timeout by icsk_rto. The sender does not offset the time elapsed since the packet at SND.UNA was sent. But if the next (DUP)ACK arrives later than ~RTTVAR and triggers tcp_fastretrans_alert(), then tcp_timeout_skb() will mark any packet sent before the icsk_rto interval lost, including one that's above the highest sacked sequence. Most likely a large part of scorebard will be marked. If most packets are not lost then the subsequent DUPACKs with new SACK blocks will cause the sender to continue to retransmit packets beyond SND.FACK spuriously. Even if only one packet is lost the sender may falsely retransmit almost the entire window. The situation becomes common in the world of bufferbloat: the RTT continues to grow as the queue builds up but RTTVAR remains small and close to the minimum 200ms. If a data packet is lost and the DUPACK triggered by the next data packet is slightly delayed, then a spurious retransmission storm forms. As the original comment on tcp_timeout_skb() suggests: the usefulness of this feature is questionable. It also wastes cycles walking the sack scoreboard and is actually harmful because of false recovery. It's time to remove this. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-16tcp: speedup tcp_fixup_rcvbuf()Eric Dumazet
tcp_fixup_rcvbuf() contains a loop to estimate initial socket rcv space needed for a given mss. With large MTU (like 64K on lo), we can loop ~500 times and consume a lot of cpu cycles. perf top of 200 concurrent netperf -t TCP_CRR 5.62% netperf [kernel.kallsyms] [k] tcp_init_buffer_space 1.71% netperf [kernel.kallsyms] [k] _raw_spin_lock 1.55% netperf [kernel.kallsyms] [k] kmem_cache_free 1.51% netperf [kernel.kallsyms] [k] tcp_transmit_skb 1.50% netperf [kernel.kallsyms] [k] tcp_ack Lets use a 100% factor, and remove the loop. 100% is needed anyway for tcp_adv_win_scale=1 default value, and is also the maximum factor. Refs: commit b49960a05e32 ("tcp: change tcp_adv_win_scale and tcp_rmem[2]") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-16tcp: gso: do not generate out of order packetsEric Dumazet
GSO TCP handler has following issues : 1) ooo_okay from original GSO packet is duplicated to all segments 2) segments (but the last one) are orphaned, so transmit path can not get transmit queue number from the socket. This happens if GSO segmentation is done before stacked device for example. Result is we can send packets from a given TCP flow to different TX queues (if using multiqueue NICS). This generates OOO problems and spurious SACK & retransmits. Fix this by keeping socket pointer set for all segments. This means that every segment must also have a destructor, and the original gso skb truesize must be split on all segments, to keep precise sk->sk_wmem_alloc accounting. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-16Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== The following patchset contains three Netfilter fixes and update for the MAINTAINER file for your net tree, they are: * Fix crash if nf_log_packet is called from conntrack, in that case both interfaces are NULL, from Hans Schillstrom. This bug introduced with the logging netns support in the previous merge window. * Fix compilation of nf_log and nf_queue without CONFIG_PROC_FS, from myself. This bug was introduced in the previous merge window with the new netns support for the netfilter logging infrastructure. * Fix possible crash in xt_TCPOPTSTRIP due to missing sanity checkings to validate that the TCP header is well-formed, from myself. I can find this bug in 2.6.25, probably it's been there since the beginning. I'll pass this to -stable. * Update MAINTAINER file to point to new nf trees at git.kernel.org, remove Harald and use M: instead of P: (now obsolete tag) to keep Jozsef in the list of people. Please, consider pulling this. Thanks! ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-15netfilter: log: netns NULL ptr bug when calling from conntrackHans Schillstrom
Since (69b34fb netfilter: xt_LOG: add net namespace support for xt_LOG), we hit this: [ 4224.708977] BUG: unable to handle kernel NULL pointer dereference at 0000000000000388 [ 4224.709074] IP: [<ffffffff8147f699>] ipt_log_packet+0x29/0x270 when callling log functions from conntrack both in and out are NULL i.e. the net pointer is invalid. Adding struct net *net in call to nf_logfn() will secure that there always is a vaild net ptr. Reported as netfilter's bugzilla bug 818: https://bugzilla.netfilter.org/show_bug.cgi?id=818 Reported-by: Ronald <ronald645@gmail.com> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-05-14tcp: fix tcp_md5_hash_skb_data()Eric Dumazet
TCP md5 communications fail [1] for some devices, because sg/crypto code assume page offsets are below PAGE_SIZE. This was discovered using mlx4 driver [2], but I suspect loopback might trigger the same bug now we use order-3 pages in tcp_sendmsg() [1] Failure is giving following messages. huh, entered softirq 3 NET_RX ffffffff806ad230 preempt_count 00000100, exited with 00000101? [2] mlx4 driver uses order-2 pages to allocate RX frags Reported-by: Matt Schnall <mischnal@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Bernhard Beck <bbeck@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-11ipv4: ip_output: remove inline marking of EXPORT_SYMBOL functionsDenis Efremov
EXPORT_SYMBOL and inline directives are contradictory to each other. The patch fixes this inconsistency. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Denis Efremov <yefremov.denis@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-08gso: Handle Trans-Ether-Bridging protocol in skb_network_protocol()Pravin B Shelar
Rather than having logic to calculate inner protocol in every tunnel gso handler move it to gso code. This simplifies code. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Cong Wang <amwang@redhat.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-06fib_trie: no need to delay vfree()Al Viro
Now that vfree() can be called from interrupt contexts, there's no need to play games with schedule_work() to escape calling vfree() from RCU callbacks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>