summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2012-05-08netfilter: remove ip_queue supportPablo Neira Ayuso
This patch removes ip_queue support which was marked as obsolete years ago. The nfnetlink_queue modules provides more advanced user-space packet queueing mechanism. This patch also removes capability code included in SELinux that refers to ip_queue. Otherwise, we break compilation. Several warning has been sent regarding this to the mailing list in the past month without anyone rising the hand to stop this with some strong argument. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08netfilter: nf_conntrack: fix explicit helper attachment and NATPablo Neira Ayuso
Explicit helper attachment via the CT target is broken with NAT if non-standard ports are used. This problem was hidden behind the automatic helper assignment routine. Thus, it becomes more noticeable now that we can disable the automatic helper assignment with Eric Leblond's: 9e8ac5a netfilter: nf_ct_helper: allow to disable automatic helper assignment Basically, nf_conntrack_alter_reply asks for looking up the helper up if NAT is enabled. Unfortunately, we don't have the conntrack template at that point anymore. Since we don't want to rely on the automatic helper assignment, we can skip the second look-up and stick to the helper that was attached by iptables. With the CT target, the user is in full control of helper attachment, thus, the policy is to trust what the user explicitly configures via iptables (no automatic magic anymore). Interestingly, this bug was hidden by the automatic helper look-up code. But it can be easily trigger if you attach the helper in a non-standard port, eg. iptables -I PREROUTING -t raw -p tcp --dport 8888 \ -j CT --helper ftp And you disabled the automatic helper assignment. I added the IPS_HELPER_BIT that allows us to differenciate between a helper that has been explicitly attached and those that have been automatically assigned. I didn't come up with a better solution (having backward compatibility in mind). Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08netfilter: nf_ct_expect: partially implement ctnetlink_change_expectKelvie Wong
This refreshes the "timeout" attribute in existing expectations if one is given. The use case for this would be for userspace helpers to extend the lifetime of the expectation when requested, as this is not possible right now without deleting/recreating the expectation. I use this specifically for forwarding DCERPC traffic through: DCERPC has a port mapper daemon that chooses a (seemingly) random port for future traffic to go to. We expect this traffic (with a reasonable timeout), but sometimes the port mapper will tell the client to continue using the same port. This allows us to extend the expectation accordingly. Signed-off-by: Kelvie Wong <kelvie@ieee.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08net: export sysctl_[r|w]mem_max symbols needed by ip_vs_syncHans Schillstrom
To build ip_vs as a module sysctl_rmem_max and sysctl_wmem_max needs to be exported. The dependency was added by "ipvs: wakeup master thread" patch. Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Signed-off-by: Simon Horman <horms@verge.net.au> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08ipvs: ip_vs_proto: local functions should not be exposed globallyH Hartley Sweeten
Functions not referenced outside of a source file should be marked static to prevent it from being exposed globally. This quiets the sparse warnings: warning: symbol '__ipvs_proto_data_get' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: ip_vs_ftp: local functions should not be exposed globallyH Hartley Sweeten
Functions not referenced outside of a source file should be marked static to prevent it from being exposed globally. This quiets the sparse warnings: warning: symbol 'ip_vs_ftp_init' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: optimize the use of flags in ip_vs_bind_destPablo Neira Ayuso
cp->flags is marked volatile but ip_vs_bind_dest can safely modify the flags, so save some CPU cycles by using temp variable. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: add support for sync threadsPablo Neira Ayuso
Allow master and backup servers to use many threads for sync traffic. Add sysctl var "sync_ports" to define the number of threads. Every thread will use single UDP port, thread 0 will use the default port 8848 while last thread will use port 8848+sync_ports-1. The sync traffic for connections is scheduled to many master threads based on the cp address but one connection is always assigned to same thread to avoid reordering of the sync messages. Remove ip_vs_sync_switch_mode because this check for sync mode change is still risky. Instead, check for mode change under sync_buff_lock. Make sure the backup socks do not block on reading. Special thanks to Aleksey Chudov for helping in all tests. Signed-off-by: Julian Anastasov <ja@ssi.bg> Tested-by: Aleksey Chudov <aleksey.chudov@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: reduce sync rate with time thresholdsJulian Anastasov
Add two new sysctl vars to control the sync rate with the main idea to reduce the rate for connection templates because currently it depends on the packet rate for controlled connections. This mechanism should be useful also for normal connections with high traffic. sync_refresh_period: in seconds, difference in reported connection timer that triggers new sync message. It can be used to avoid sync messages for the specified period (or half of the connection timeout if it is lower) if connection state is not changed from last sync. sync_retries: integer, 0..3, defines sync retries with period of sync_refresh_period/8. Useful to protect against loss of sync messages. Allow sysctl_sync_threshold to be used with sysctl_sync_period=0, so that only single sync message is sent if sync_refresh_period is also 0. Add new field "sync_endtime" in connection structure to hold the reported time when connection expires. The 2 lowest bits will represent the retry count. As the sysctl_sync_period now can be 0 use ACCESS_ONCE to avoid division by zero. Special thanks to Aleksey Chudov for being patient with me, for his extensive reports and helping in all tests. Signed-off-by: Julian Anastasov <ja@ssi.bg> Tested-by: Aleksey Chudov <aleksey.chudov@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: wakeup master threadPablo Neira Ayuso
High rate of sync messages in master can lead to overflowing the socket buffer and dropping the messages. Fixed sleep of 1 second without wakeup events is not suitable for loaded masters, Use delayed_work to schedule sending for queued messages and limit the delay to IPVS_SYNC_SEND_DELAY (20ms). This will reduce the rate of wakeups but to avoid sending long bursts we wakeup the master thread after IPVS_SYNC_WAKEUP_RATE (8) messages. Add hard limit for the queued messages before sending by using "sync_qlen_max" sysctl var. It defaults to 1/32 of the memory pages but actually represents number of messages. It will protect us from allocating large parts of memory when the sending rate is lower than the queuing rate. As suggested by Pablo, add new sysctl var "sync_sock_size" to configure the SNDBUF (master) or RCVBUF (slave) socket limit. Default value is 0 (preserve system defaults). Change the master thread to detect and block on SNDBUF overflow, so that we do not drop messages when the socket limit is low but the sync_qlen_max limit is not reached. On ENOBUFS or other errors just drop the messages. Change master thread to enter TASK_INTERRUPTIBLE state early, so that we do not miss wakeups due to messages or kthread_should_stop event. Thanks to Pablo Neira Ayuso for his valuable feedback! Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: always update some of the flags bits in backupJulian Anastasov
As the goal is to mirror the inactconns/activeconns counters in the backup server, make sure the cp->flags are updated even if cp is still not bound to dest. If cp->flags are not updated ip_vs_bind_dest will rely only on the initial flags when updating the counters. To avoid mistakes and complicated checks for protocol state rely only on the IP_VS_CONN_F_INACTIVE bit when updating the counters. Signed-off-by: Julian Anastasov <ja@ssi.bg> Tested-by: Aleksey Chudov <aleksey.chudov@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: fix ip_vs_try_bind_dest to rebind app and transmitterJulian Anastasov
Initially, when the synced connection is created we use the forwarding method provided by master but once we bind to destination it can be changed. As result, we must update the application and the transmitter. As ip_vs_try_bind_dest is called always for connections that require dest binding, there is no need to validate the cp and dest pointers. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: remove check for IP_VS_CONN_F_SYNC from ip_vs_bind_destJulian Anastasov
As the IP_VS_CONN_F_INACTIVE bit is properly set in cp->flags for all kind of connections we do not need to add special checks for synced connections when updating the activeconns/inactconns counters for first time. Now logic will look just like in ip_vs_unbind_dest. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: ignore IP_VS_CONN_F_NOOUTPUT in backup serverJulian Anastasov
As IP_VS_CONN_F_NOOUTPUT is derived from the forwarding method we should get it from conn_flags just like we do it for IP_VS_CONN_F_FWD_MASK bits when binding to real server. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: use GFP_KERNEL allocation where possibleSasha Levin
Use GFP_KERNEL instead of GFP_ATOMIC when registering an ipvs protocol. This is safe since it will always run from a process context. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08ipvs: SH scheduler does not need GFP_ATOMIC allocationJulian Anastasov
Schedulers are initialized and bound to services only on commands. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: LBLCR scheduler does not need GFP_ATOMIC allocation on initJulian Anastasov
Schedulers are initialized and bound to services only on commands. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: WRR scheduler does not need GFP_ATOMIC allocationJulian Anastasov
Schedulers are initialized and bound to services only on commands. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: DH scheduler does not need GFP_ATOMIC allocationJulian Anastasov
Schedulers are initialized and bound to services only on commands. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: LBLC scheduler does not need GFP_ATOMIC allocation on initJulian Anastasov
Schedulers are initialized and bound to services only on commands. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: timeout tables do not need GFP_ATOMIC allocationJulian Anastasov
They are called only on initialization. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08netfilter: bridge: optionally set indev to vlanPablo Neira Ayuso
if net.bridge.bridge-nf-filter-vlan-tagged sysctl is enabled, bridge netfilter removes the vlan header temporarily and then feeds the packet to ip(6)tables. When the new "bridge-nf-pass-vlan-input-device" sysctl is on (default off), then bridge netfilter will also set the in-interface to the vlan interface; if such an interface exists. This is needed to make iptables REDIRECT target work with "vlan-on-top-of-bridge" setups and to allow use of "iptables -i" to match the vlan device name. Also update Documentation with current brnf default settings. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Bart De Schuymer <bdschuym@pandora.be> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08netfilter: nf_ct_helper: allow to disable automatic helper assignmentEric Leblond
This patch allows you to disable automatic conntrack helper lookup based on TCP/UDP ports, eg. echo 0 > /proc/sys/net/netfilter/nf_conntrack_helper [ Note: flows that already got a helper will keep using it even if automatic helper assignment has been disabled ] Once this behaviour has been disabled, you have to explicitly use the iptables CT target to attach helper to flows. There are good reasons to stop supporting automatic helper assignment, for further information, please read: http://www.netfilter.org/news.html#2012-04-03 This patch also adds one message to inform that automatic helper assignment is deprecated and it will be removed soon (this is spotted only once, with the first flow that gets a helper attached to make it as less annoying as possible). Signed-off-by: Eric Leblond <eric@regit.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08netfilter: nf_ct_ecache: refactor notifier registrationTony Zelenoff
* ret variable initialization removed as useless * similar code strings concatenated and functions code flow became more plain Signed-off-by: Tony Zelenoff <antonz@parallels.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/intel/e1000e/param.c drivers/net/wireless/iwlwifi/iwl-agn-rx.c drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c drivers/net/wireless/iwlwifi/iwl-trans.h Resolved the iwlwifi conflict with mainline using 3-way diff posted by John Linville and Stephen Rothwell. In 'net' we added a bug fix to make iwlwifi report a more accurate skb->truesize but this conflicted with RX path changes that happened meanwhile in net-next. In e1000e a conflict arose in the validation code for settings of adapter->itr. 'net-next' had more sophisticated logic so that logic was used. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-08net: IP_MULTICAST_IF setsockopt now recognizes struct mreqJiri Pirko
Until now, struct mreq has not been recognized and it was worked with as with struct in_addr. That means imr_multiaddr was copied to imr_address. So do recognize struct mreq here and copy that correctly. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-06skb: Add inline helper for getting the skb end offset from headAlexander Duyck
With the recent changes for how we compute the skb truesize it occurs to me we are probably going to have a lot of calls to skb_end_pointer - skb->head. Instead of running all over the place doing that it would make more sense to just make it a separate inline skb_end_offset(skb) that way we can return the correct value without having gcc having to do all the optimization to cancel out skb->head - skb->head. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-06skb: Drop "fastpath" variable for skb_cloned check in pskb_expand_headAlexander Duyck
Since there is now only one spot that actually uses "fastpath" there isn't much point in carrying it. Instead we can just use a check for skb_cloned to verify if we can perform the fast-path free for the head or not. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-06skb: Drop bad code from pskb_expand_headAlexander Duyck
The fast-path for pskb_expand_head contains a check where the size plus the unaligned size of skb_shared_info is compared against the size of the data buffer. This code path has two issues. First is the fact that after the recent changes by Eric Dumazet to __alloc_skb and build_skb the shared info is always placed in the optimal spot for a buffer size making this check unnecessary. The second issue is the fact that the check doesn't take into account the aligned size of shared info. As a result the code burns cycles doing a memcpy with nothing actually being shifted. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-04tcp: be more strict before accepting ECN negociationEric Dumazet
It appears some networks play bad games with the two bits reserved for ECN. This can trigger false congestion notifications and very slow transferts. Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can disable TCP ECN negociation if it happens we receive mangled CT bits in the SYN packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Perry Lorier <perryl@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Wilmer van der Gaast <wilmer@google.com> Cc: Ankur Jain <jankur@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Dave Täht <dave.taht@bufferbloat.net> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-04net: sched: factorize code (qdisc_drop())Eric Dumazet
Use qdisc_drop() helper where possible. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Transfer padding was wrong for full-speed USB in ASIX driver, fix from Ingo van Lil. 2) Propagate the negative packet offset fix into the PowerPC BPF JIT. From Jan Seiffert. 3) dl2k driver's private ioctls were letting unprivileged tasks make MII writes and other ugly bits like that. Fix from Jeff Mahoney. 4) Fix TX VLAN and RX packet drops in ucc_geth, from Joakim Tjernlund. 5) OOPS and network namespace fixes in IPVS from Hans Schillstrom and Julian Anastasov. 6) Fix races and sleeping in locked context bugs in drop_monitor, from Neil Horman. 7) Fix link status indication in smsc95xx driver, from Paolo Pisati. 8) Fix bridge netfilter OOPS, from Peter Huang. 9) L2TP sendmsg can return on error conditions with the socket lock held, oops. Fix from Sasha Levin. 10) udp_diag should return meaningful values for socket memory usage, from Shan Wei. 11) Eric Dumazet is so awesome he gets his own section: Socket memory cgroup code (I never should have applied those patches, grumble...) made erroneous changes to sk_sockets_allocated_read_positive(). It was changed to use percpu_counter_sum_positive (which requires BH disabling) instead of percpu_counter_read_positive (which does not). Revert back to avoid crashes and lockdep warnings. Adjust the default tcp_adv_win_scale and tcp_rmem[2] values to fix throughput regressions. This is necessary as a result of our more precise skb->truesize tracking. Fix SKB leak in netem packet scheduler. 12) New device IDs for various bluetooth devices, from Manoj Iyer, AceLan Kao, and Steven Harms. 13) Fix command completion race in ipw2200, from Stanislav Yakovlev. 14) Fix rtlwifi oops on unload, from Larry Finger. 15) Fix hard_mtu when adjusting hard_header_len in smsc95xx driver. From Stephane Fillod. 16) ehea driver registers it's IRQ before all the necessary state is setup, resulting in crashes. Fix from Thadeu Lima de Souza Cascardo. 17) Fix PHY connection failures in davinci_emac driver, from Anatolij Gustschin. 18) Missing break; in switch statement in bluetooth's hci_cmd_complete_evt(). Fix from Szymon Janc. 19) Fix queue programming in iwlwifi, from Johannes Berg. 20) Interrupt throttling defaults not being actually programmed into the hardware, fix from Jeff Kirsher and Ying Cai. 21) TLAN driver SKB encoding in descriptor busted on 64-bit, fix from Benjamin Poirier. 22) Fix blind status block RX producer pointer deref in TG3 driver, from Matt Carlson. 23) Promisc and multicast are busted on ehea, fixes from Thadeu Lima de Souza Cascardo. 24) Fix crashes in 6lowpan, from Alexander Smirnov. 25) tcp_complete_cwr() needs to be careful to not rewind the CWND to ssthresh if ssthresh has the "infinite" value. Fix from Yuchung Cheng. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits) sungem: Fix WakeOnLan tcp: change tcp_adv_win_scale and tcp_rmem[2] net: l2tp: unlock socket lock before returning from l2tp_ip_sendmsg drop_monitor: prevent init path from scheduling on the wrong cpu usbnet: fix failure handling in usbnet_probe usbnet: fix leak of transfer buffer of dev->interrupt ucc_geth: Add 16 bytes to max TX frame for VLANs net: ucc_geth, increase no. of HW RX descriptors netem: fix possible skb leak sky2: fix receive length error in mixed non-VLAN/VLAN traffic sky2: propogate rx hash when packet is copied net: fix two typos in skbuff.h cxgb3: Don't call cxgb_vlan_mode until q locks are initialized ixgbe: fix calling skb_put on nonlinear skb assertion bug ixgbe: Fix a memory leak in IEEE DCB igbvf: fix the bug when initializing the igbvf smsc75xx: enable mac to detect speed/duplex from phy smsc75xx: declare smsc75xx's MII as GMII capable smsc75xx: fix phy interrupt acknowledge smsc75xx: fix phy init reset loop ...
2012-05-03skb: Add skb_head_is_locked helper functionAlexander Duyck
This patch adds support for a skb_head_is_locked helper function. It is meant to be used any time we are considering transferring the head from skb->head to a paged frag. If the head is locked it means we cannot remove the head from the skb so it must be copied or we must take the skb as a whole. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-03net: Fix truesize accounting in skb_gro_receive()Eric Dumazet
GRO is very optimistic in skb truesize estimates, only taking into account the used part of fragments. Be conservative, and use more precise computation, so that bloated GRO skbs can be collapsed eventually. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-03tcp: move stats merge to the end of tcp_try_coalesceAlexander Duyck
This change cleans up the last bits of tcp_try_coalesce so that we only need one goto which jumps to the end of the function. The idea is to make the code more readable by putting things in a linear order so that we start execution at the top of the function, and end it at the bottom. I also made a slight tweak to the code for handling frags when we are a clone. Instead of making it an if (clone) loop else nr_frags = 0 I changed the logic so that if (!clone) we just set the number of frags to 0 which disables the for loop anyway. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-03tcp: Move code related to head frag in tcp_try_coalesceAlexander Duyck
This change reorders the code related to the use of an skb->head_frag so it is placed before we check the rest of the frags. This allows the code to read more linearly instead of like some sort of loop. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-03tcp: Fix truesize accounting in tcp_try_coalesceAlexander Duyck
This patch addresses several issues in the way we were tracking the truesize in tcp_try_coalesce. First it was using ksize which prevents us from having a 0 sized head frag and getting a usable result. To resolve that this patch uses the end pointer which is set based off either ksize, or the frag_size supplied in build_skb. This allows us to compute the original truesize of the entire buffer and remove that value leaving us with just what was added as pages. The second issue was the use of skb->len if there is a mergeable head frag. We should only need to remove the size of an data aligned sk_buff from our current skb->truesize to compute the delta for a buffer with a reused head. By using skb->len the value of truesize was being artificially reduced which means that head frags could use more memory than buffers using standard allocations. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-03net: Add missing linux/prefetch.h include to net/core/sock.cDavid S. Miller
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-03net: Stop decapitating clones that have a head_fragAlexander Duyck
This change is meant ot prevent stealing the skb->head to use as a page in the event that the skb->head was cloned. This allows the other clones to track each other via shinfo->dataref. Without this we break down to two methods for tracking the reference count, one being dataref, the other being the page count. As a result it becomes difficult to track how many references there are to skb->head. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-03net: implement tcp coalescing in tcp_queue_rcv()Eric Dumazet
Extend tcp coalescing implementing it from tcp_queue_rcv(), the main receiver function when application is not blocked in recvmsg(). Function tcp_queue_rcv() is moved a bit to allow its call from tcp_data_queue() This gives good results especially if GRO could not kick, and if skb head is a fragment. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-03net: take care of cloned skbs in tcp_try_coalesce()Eric Dumazet
Before stealing fragments or skb head, we must make sure skbs are not cloned. Alexander was worried about destination skb being cloned : In bridge setups, a driver could be fooled if skb->data_len would not match skb nr_frags. If source skb is cloned, we must take references on pages instead. Bug happened using tcpdump (if not using mmap()) Introduce kfree_skb_partial() helper to cleanup code. Reported-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-03tcp: change tcp_adv_win_scale and tcp_rmem[2]Eric Dumazet
tcp_adv_win_scale default value is 2, meaning we expect a good citizen skb to have skb->len / skb->truesize ratio of 75% (3/4) In 2.6 kernels we (mis)accounted for typical MSS=1460 frame : 1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392. So these skbs were considered as not bloated. With recent truesize fixes, a typical MSS=1460 frame truesize is now the more precise : 2048 + 256 = 2304. But 2304 * 3/4 = 1728. So these skb are not good citizen anymore, because 1460 < 1728 (GRO can escape this problem because it build skbs with a too low truesize.) This also means tcp advertises a too optimistic window for a given allocated rcvspace : When receiving frames, sk_rmem_alloc can hit sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often, especially when application is slow to drain its receive queue or in case of losses (netperf is fast, scp is slow). This is a major latency source. We should adjust the len/truesize ratio to 50% instead of 75% This patch : 1) changes tcp_adv_win_scale default to 1 instead of 2 2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account better truesize tracking and to allow autotuning tcp receive window to reach same value than before. Note that same amount of kernel memory is consumed compared to 2.6 kernels. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-03net: l2tp: unlock socket lock before returning from l2tp_ip_sendmsgSasha Levin
l2tp_ip_sendmsg could return without releasing socket lock, making it all the way to userspace, and generating the following warning: [ 130.891594] ================================================ [ 130.894569] [ BUG: lock held when returning to user space! ] [ 130.897257] 3.4.0-rc5-next-20120501-sasha #104 Tainted: G W [ 130.900336] ------------------------------------------------ [ 130.902996] trinity/8384 is leaving the kernel with locks still held! [ 130.906106] 1 lock held by trinity/8384: [ 130.907924] #0: (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff82b9503f>] l2tp_ip_sendmsg+0x2f/0x550 Introduced by commit 2f16270 ("l2tp: Fix locking in l2tp_ip.c"). Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-03drop_monitor: prevent init path from scheduling on the wrong cpuNeil Horman
I just noticed after some recent updates, that the init path for the drop monitor protocol has a minor error. drop monitor maintains a per cpu structure, that gets initalized from a single cpu. Normally this is fine, as the protocol isn't in use yet, but I recently made a change that causes a failed skb allocation to reschedule itself . Given the current code, the implication is that this workqueue reschedule will take place on the wrong cpu. If drop monitor is used early during the boot process, its possible that two cpus will access a single per-cpu structure in parallel, possibly leading to data corruption. This patch fixes the situation, by storing the cpu number that a given instance of this per-cpu data should be accessed from. In the case of a need for a reschedule, the cpu stored in the struct is assigned the rescheule, rather than the currently executing cpu Tested successfully by myself. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: David Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-03tcp: early retransmit: delayed fast retransmitYuchung Cheng
Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2). Delays the fast retransmit by an interval of RTT/4. We borrow the RTO timer to implement the delay. If we receive another ACK or send a new packet, the timer is cancelled and restored to original RTO value offset by time elapsed. When the delayed-ER timer fires, we enter fast recovery and perform fast retransmit. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-03tcp: early retransmitYuchung Cheng
This patch implements RFC 5827 early retransmit (ER) for TCP. It reduces DUPACK threshold (dupthresh) if outstanding packets are less than 4 to recover losses by fast recovery instead of timeout. While the algorithm is simple, small but frequent network reordering makes this feature dangerous: the connection repeatedly enter false recovery and degrade performance. Therefore we implement a mitigation suggested in the appendix of the RFC that delays entering fast recovery by a small interval, i.e., RTT/4. Currently ER is conservative and is disabled for the rest of the connection after the first reordering event. A large scale web server experiment on the performance impact of ER is summarized in section 6 of the paper "Proportional Rate Reduction for TCP”, IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf Note that Linux has a similar feature called THIN_DUPACK. The differences are THIN_DUPACK do not mitigate reorderings and is only used after slow start. Currently ER is disabled if THIN_DUPACK is enabled. I would be happy to merge THIN_DUPACK feature with ER if people think it's a good idea. ER is enabled by sysctl_tcp_early_retrans: 0: Disables ER 1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4. 2: (Default) reduce dupthresh like mode 1. In addition, delay entering fast recovery by RTT/4. Note: mode 2 is implemented in the third part of this patch series. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-03tcp: early retransmit: tcp_enter_recovery()Yuchung Cheng
This a prepartion patch that refactors the code to enter recovery into a new function tcp_enter_recovery(). It's needed to implement the delayed fast retransmit in ER. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem
2012-05-01netem: fix possible skb leakEric Dumazet
skb_checksum_help(skb) can return an error, we must free skb in this case. qdisc_drop(skb, sch) can also be feeded with a NULL skb (if skb_unshare() failed), so lets use this generic helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01netem: add ECN capabilityEric Dumazet
Add ECN (Explicit Congestion Notification) marking capability to netem tc qdisc add dev eth0 root netem drop 0.5 ecn Instead of dropping packets, try to ECN mark them. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>