summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2013-06-11net_sched: add 64bit rate estimatorsEric Dumazet
struct gnet_stats_rate_est contains u32 fields, so the bytes per second field can wrap at 34360Mbit. Add a new gnet_stats_rate_est64 structure to get 64bit bps/pps fields, and switch the kernel to use this structure natively. This structure is dumped to user space as a new attribute : TCA_STATS_RATE_EST64 Old tc command will now display the capped bps (to 34360Mbit), instead of wrapped values, and updated tc command will display correct information. Old tc command output, after patch : eric:~# tc -s -d qd sh dev lo qdisc pfifo 8001: root refcnt 2 limit 1000p Sent 80868245400 bytes 1978837 pkt (dropped 0, overlimits 0 requeues 0) rate 34360Mbit 189696pps backlog 0b 0p requeues 0 This patch carefully reorganizes "struct Qdisc" layout to get optimal performance on SMP. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11net: pass correct parameter to skb_headers_offset_update()Peter Pan(潘卫平)
Since commit 1a37e412a022(net: Use 16bits for *_headers fields of struct skbuff), skb->*_header are relative to skb->head, so copy_skb_header() should not call skb_headers_offset_update() now, and we should pass correct parameter to skb_headers_offset_update() in pskb_expand_head() and skb_copy_expand(). Signed-off-by: Weiping Pan <panweiping3@gmail.com> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11netlink: Add compare function for netlink_tableGao feng
As we know, netlink sockets are private resource of net namespace, they can communicate with each other only when they in the same net namespace. this works well until we try to add namespace support for other subsystems which use netlink. Don't like ipv4 and route table.., it is not suited to make these subsytems belong to net namespace, Such as audit and crypto subsystems,they are more suitable to user namespace. So we must have the ability to make the netlink sockets in same user namespace can communicate with each other. This patch adds a new function pointer "compare" for netlink_table, we can decide if the netlink sockets can communicate with each other through this netlink_table self-defined compare function. The behavior isn't changed if we don't provide the compare function for netlink_table. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11bridge: Add a flag to control unicast packet flood.Vlad Yasevich
Add a flag to control flood of unicast traffic. By default, flood is on and the bridge will flood unicast traffic if it doesn't know the destination. When the flag is turned off, unicast traffic without an FDB will not be forwarded to the specified port. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11bridge: Add flag to control mac learning.Vlad Yasevich
Allow user to control whether mac learning is enabled on the port. By default, mac learning is enabled. Disabling mac learning will cause new dynamic FDB entries to not be created for a particular port. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11net: remove last caller of skb_tail_offset() and itselfCong Wang
Similar to the following commits: commit 00f97da17a0c8d656d0c9 (netpoll: fix position of network header) commit 525cebedb32a87fa48584 (pktgen: Fix position of ip and udp header) using skb_tail_offset() seems not correct since the offset is based on head pointer. With the last caller removed, skb_tail_offset() can be killed finally. Cc: Thomas Graf <tgraf@suug.ch> Cc: Daniel Borkmann <dborkmann@redhat.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11tcp: add low latency socket poll support.Eliezer Tamir
Adds low latency socket poll support for TCP. In tcp_v[46]_rcv() add a call to sk_mark_ll() to copy the napi_id from the skb to the sk. In tcp_recvmsg(), when there is no data in the socket we busy-poll. This is a good example of how to add busy-poll support to more protocols. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11udp: add low latency socket poll supportEliezer Tamir
Add upport for busy-polling on UDP sockets. In __udp[46]_lib_rcv add a call to sk_mark_ll() to copy the napi_id from the skb into the sk. This is done at the earliest possible moment, right after we identify which socket this skb is for. In __skb_recv_datagram When there is no data and the user tries to read we busy poll. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11net: add low latency socket pollEliezer Tamir
Adds an ndo_ll_poll method and the code that supports it. This method can be used by low latency applications to busy-poll Ethernet device queues directly from the socket code. sysctl_net_ll_poll controls how many microseconds to poll. Default is zero (disabled). Individual protocol support will be added by subsequent patches. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11net: add napi_id and hashEliezer Tamir
Adds a napi_id and a hashing mechanism to lookup a napi by id. This will be used by subsequent patches to implement low latency Ethernet device polling. Based on a code sample by Eric Dumazet. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07netlink: allow large data transfers from user-spacePablo Neira Ayuso
I can hit ENOBUFS in the sendmsg() path with a large batch that is composed of many netlink messages. Here that limit is 8 MBytes of skbuff data area as kmalloc does not manage to get more than that. While discussing atomic rule-set for nftables with Patrick McHardy, we decided to put all rule-set updates that need to be applied atomically in one single batch to simplify the existing approach. However, as explained above, the existing netlink code limits us to a maximum of ~20000 rules that fit in one single batch without hitting ENOBUFS. iptables does not have such limitation as it is using vmalloc. This patch adds netlink_alloc_large_skb() which is only used in the netlink_sendmsg() path. It uses alloc_skb if the memory requested is <= one memory page, that should be the common case for most subsystems, else vmalloc for higher memory allocations. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07net: tcp: move GRO/GSO functions to tcp_offloadDaniel Borkmann
Would be good to make things explicit and move those functions to a new file called tcp_offload.c, thus make this similar to tcpv6_offload.c. While moving all related functions into tcp_offload.c, we can also make some of them static, since they are only used there. Also, add an explicit registration function. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07net: minor: tcp: use tcp_skb_mss helper in tcp_tso_segmentDaniel Borkmann
We have the minimal inline helper tcp_skb_mss to access skb_shinfo(skb)->gso_size, so also use it here to get mss. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Merge 'net' into 'net-next' to get the MSG_CMSG_COMPAT regression fix. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-06net: Unbreak compat_sys_{send,recv}msgAndy Lutomirski
I broke them in this commit: commit 1be374a0518a288147c6a7398792583200a67261 Author: Andy Lutomirski <luto@amacapital.net> Date: Wed May 22 14:07:44 2013 -0700 net: Block MSG_CMSG_COMPAT in send(m)msg and recv(m)msg This patch adds __sys_sendmsg and __sys_sendmsg as common helpers that accept MSG_CMSG_COMPAT and blocks MSG_CMSG_COMPAT at the syscall entrypoints. It also reverts some unnecessary checks in sys_socketcall. Apparently I was suffering from underscore blindness the first time around. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Tested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-06Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Conflicts: net/netfilter/nf_log.c The conflict in nf_log.c is that in 'net' we added CONFIG_PROC_FS protection around foo_proc_entry() calls to fix a build failure, whereas in Pablo's tree a guard if() test around a call is remove_proc_entry() was removed. Trivially resolved. Pablo Neira Ayuso says: ==================== The following patchset contains the first batch of Netfilter/IPVS updates for your net-next tree, they are: * Three patches with improvements and code refactorization for nfnetlink_queue, from Florian Westphal. * FTP helper now parses replies without brackets, as RFC1123 recommends, from Jeff Mahoney. * Rise a warning to tell everyone about ULOG deprecation, NFLOG has been already in the kernel tree for long time and supersedes the old logging over netlink stub, from myself. * Don't panic if we fail to load netfilter core framework, just bail out instead, from myself. * Add cond_resched_rcu, used by IPVS to allow rescheduling while walking over big hashtables, from Simon Horman. * Change type of IPVS sysctl_sync_qlen_max sysctl to avoid possible overflow, from Zhang Yanfei. * Use strlcpy instead of strncpy to skip zeroing of already initialized area to write the extension names in ebtables, from Chen Gang. * Use already existing per-cpu notrack object from xt_CT, from Eric Dumazet. * Save explicit socket lookup in xt_socket now that we have early demux, also from Eric Dumazet. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Merge 'net' bug fixes into 'net-next' as we have patches that will build on top of them. This merge commit includes a change from Emil Goode (emilgoode@gmail.com) that fixes a warning that would have been introduced by this merge. Specifically it fixes the pingv6_ops method ipv6_chk_addr() to add a "const" to the "struct net_device *dev" argument and likewise update the dummy_ipv6_chk_addr() declaration. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-05netfilter: nfnetlink_queue: only add CAP_LEN attr when neededFlorian Westphal
CAP_LEN contains the size of the network packet we're queueing to userspace, i.e. normally it is the same as the NFQA_PAYLOAD attribute len. Include it only in the unlikely case when NFQA_PAYLOAD is truncated due to copy_range limitations. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-06-05netfilter: nfnetlink_queue: cleanup copy_range usageFlorian Westphal
For every packet queued, we check if configured copy_range is 0, and treat that as 'copy entire packet'. We can move this check to the queue configuration, and can set copy_range appropriately. Also, convert repetitive '0xffff - NLA_HDRLEN' to a macro. [ queue initialization still used 0xffff, although its harmless since the initial setting is overwritten on queue config ] Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-06-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix timeouts with direct mode authentication in mac80211, from Stanislaw Gruszka. 2) Aggregation sessions can deadlock in ath9k, from Felix Fietkau. 3) Netfilter's xt_addrtype doesn't work with ipv6 due to route lookups creating undesirable cache entries, from Florian Westphal. 4) Fix netfilter's ipt_ULOG from generating non-NULL terminated strings. 5) Fix netdev transmit queue crashes in mac80211, from Johannes Berg. 6) Fix copy and paste error in 802.11 stack that broke reporting of 64-bit station tx statistics, from Felix Fietkau. 7) When qlge_probe fails, it leaks the netdev. Fix from Wei Yongjun. 8) SKB control block (where we store the IP options information, amongst other things) must be cleared properly otherwise ICMP sending can crash for IP tunnels. Fix from Eric Dumazet. 9) Verification of Energy Efficient Ether support was coded wrongly, the test was inversed. Fix from Giuseppe CAVALLARO. 10) TCP handles redirects improperly because the wrong flow key is used for the route lookup. From Michal Kubecek. 11) Don't interpret MSG_CMSG_COMPAT from userspace, fix from Andy Lutomirski. 12) The new AF_VSOCK was missing from the lockdep string table, fix from Federico Vaga. 13) be2net doesn't handle checksumming of IP fragments properly, from Somnath Kotur. 14) Fix several bugs in the device address list code that lead to crashes and other misbehaviors. From Jay Vosburgh. 15) Fix ipv6 segmentation handling of fragmented GRE tunnel traffic, from Pravin B Shalr. 16) Fix usage of stale policies in IPSEC layer, from Paul Moore. 17) Fix team driver dump of ports when there are a large number of them, from Jiri Pirko. 18) Fix softlockups in UDP ipv4 socket lookup causes by and error in the hlist_nulls_for_each_entry_rcu() macro. From Eric Dumazet. 19) Fix several regressions added by the high rate accuracy changes to the htb packet scheduler. From Eric Dumazet. 20) Fix DMA'ing onto the stack in esd_usb2 and peak_usb CAN drivers, from Olivier Sobrie and Marc Kleine-Budde. 21) Fix unremovable network devices due to missing route pointer installation in the per-device ipv6 address list entries. From Gao feng. 22) Apply the tg3 5719 DMA workaround on 5720 chips as well, otherwise we get stalls. From Nithin Sujir. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (68 commits) net_sched: htb: do not mix 1ns and 64ns time units net: fix sk_buff head without data area tg3: Add read dma workaround for 5720 net: ethernet: xilinx_emaclite: set protocol selector bits when writing ANAR bnx2x: Fix bridged GSO for 57710/57711 chips net: fec: add fallback to random MAC address bnx2x: fix TCP offload for tunneling ipv4 over ipv6 ipv6: assign rt6_info to inet6_ifaddr in init_loopback net/mlx4_core: Keep VF assigned MAC in the PF admin table net/mlx4_en: Handle unassigned VF MAC address correctly net/mlx4_core: Return -EPROBE_DEFER when a VF is probed before PF is sufficiently initialized net/mlx4_en: Fix adaptive moderation cq update net: can: peak_usb: Do not do dma on the stack net: can: esd_usb2: Do not do dma on the stack net: can: kvaser_usb: fix reception on "USBcan Pro" and "USBcan R" type hardware. net_sched: restore "overhead xxx" handling net: force a reload of first item in hlist_nulls_for_each_entry_rcu hyperv: Fix vlan_proto setting in netvsc_recv_callback() team: fix port list dump for big number of ports list: introduce list_first_entry_or_null ...
2013-06-05net_sched: htb: do not mix 1ns and 64ns time unitsEric Dumazet
commit 56b765b79 ("htb: improved accuracy at high rates") added another regression for low rates, because it mixes 1ns and 64ns time units. So the maximum delay (mbuffer) was not 60 second, but 937 ms. Lets convert all time fields to 1ns as 64bit arches are becoming the norm. Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-05netpoll: fix position of network headerAmerigo Wang
Similar to the problem in pktgen, netpoll uses skb_tail_offset() too, as the code is copied from pktgen. Also use return values of skb_put() directly, this will simiplify the code. Reported-by: Thomas Graf <tgraf@suug.ch> Cc: Thomas Graf <tgraf@suug.ch> Cc: Daniel Borkmann <dborkmann@redhat.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-05pktgen: Fix position of ip and udp headerThomas Graf
skb_set_network_header() expects an offset based on the data pointer whereas skb_tail_offset() also includes the headroom. This resulted in the ip header being written in a wrong location. Use return values of skb_put() directly and rely on skb->len to set mac, network, and transport header. Cc: Simon Horman <horms@verge.net.au> Cc: Daniel Borkmann <dborkmann@redhat.com> Assisted-by: Daniel Borkmann <dborkmann@redhat.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Daniel Borkmann <dborkmann@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-05net: fix sk_buff head without data areaPablo Neira
Eric Dumazet spotted that we have to check skb->head instead of skb->data as skb->head points to the beginning of the data area of the skbuff. Similarly, we have to initialize the skb->head pointer, not skb->data in __alloc_skb_head. After this fix, netlink crashes in the release path of the sk_buff, so let's fix that as well. This bug was introduced in (0ebd0ac net: add function to allocate sk_buff head without data area). Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-05net/ethtool: Fix comment regarding location of dev_ethtool() callYan Burman
Signed-off-by: Yan Burman <yanb@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04ping: always initialize ->sin6_scope_id and ->sin6_flowinfoCong Wang
If we don't need scope id, we should initialize it to zero. Same for ->sin6_flowinfo. Cc: Lorenzo Colitti <lorenzo@google.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04ipv6: assign rt6_info to inet6_ifaddr in init_loopbackGao feng
Commit 25fb6ca4ed9cad72f14f61629b68dc03c0d9713f "net IPv6 : Fix broken IPv6 routing table after loopback down-up" forgot to assign rt6_info to the inet6_ifaddr. When disable the net device, the rt6_info which allocated in init_loopback will not be destroied in __ipv6_ifa_notify. This will trigger the waring message below [23527.916091] unregister_netdevice: waiting for tap0 to become free. Usage count = 1 Reported-by: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04net: mark netdev_create_hash __net_initBaruch Siach
netdev_create_hash() is only called from netdev_init() which is marked __net_init. Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04Kconfig: remove dangling references to the deleted fileJean Sacren
Commit 202dc3fc599c1dded235d3b448d9ca924252e354 (Documentation: remove obsolete networking/multicast.txt file) deleted the obsolete file. After the file has been removed, clean up a couple of places where references to the deleted file were made so that users wouldn't be confused when they consult the Help menu. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04xfrm: simplify the exit path of xfrm_output_one()Jean Sacren
Clean up unnecessary assignment and jump. While there, fix up the label name. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04net: ipv6: Implement /proc/net/icmp6.Lorenzo Colitti
The format is based on /proc/net/icmp and /proc/net/{udp,raw}6. Compiles and displays reasonable results with CONFIG_IPV6={n,m,y} Couldn't figure out how to test without CONFIG_PROC_FS enabled. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04net: ipv4: make the ping /proc code AF-independentLorenzo Colitti
Introduce a ping_seq_afinfo structure (similar to its UDP equivalent) and use it to make some of the ping /proc functions address-family independent. Rename the remaining ping /proc functions from ping_* to ping_v4_*. Compiles and displays reasonable results with CONFIG_IPV6={n,m,y} Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04net: ipv6: Unify {raw,udp}6_sock_seq_show.Lorenzo Colitti
udp6_sock_seq_show and raw6_sock_seq_show are identical, except the UDP version displays ports and the raw version displays the protocol. Refactor most of the code in these two functions into a new common ip6_dgram_sock_seq_show function, in preparation for using it to display ICMPv6 sockets as well. Also reduce the indentation in parts of include/net/transp_v6.h to improve readability. Compiles and displays reasonable results with CONFIG_IPV6={n,m,y} Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03icmp: avoid allocating large struct on stackCong Wang
struct icmp_bxm is a large struct, reduce stack usage by allocating it on heap. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03] icmp: fix icmp_unreach() comment.Rami Rosen
ICMP_PARAMETERPROB is handled by icmp_unreach(); This patch adds ICMP_PARAMETERPROB to the list of ICMP message types handled by icmp_unreach(). Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03ipv4: use separate genid for next hop exceptionsTimo Teräs
commit 13d82bf5 (ipv4: Fix flushing of cached routing informations) added the support to flush learned pmtu information. However, using rt_genid is quite heavy as it is bumped on route add/change and multicast events amongst other places. These can happen quite often, especially if using dynamic routing protocols. While this is ok with routes (as they are just recreated locally), the pmtu information is learned from remote systems and the icmp notification can come with long delays. It is worthy to have separate genid to avoid excessive pmtu resets. Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03ipv4: rate limit updating of next hop exceptions with same pmtuTimo Teräs
The tunnel devices call update_pmtu for each packet sent, this causes contention on the fnhe_lock. Ignore the pmtu update if pmtu is not actually changed, and there is still plenty of time before the entry expires. Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03ipv4: properly refresh rtable entries on pmtu/redirect eventsTimo Teräs
This reverts commit 05ab86c5 (xfrm4: Invalidate all ipv4 routes on IPsec pmtu events). Flushing all cached entries is not needed. Instead, invalidate only the related next hop dsts to recheck for the added next hop exception where needed. This also fixes a subtle race due to bumping generation id's before updating the pmtu. Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03net_sched: restore "overhead xxx" handlingEric Dumazet
commit 56b765b79 ("htb: improved accuracy at high rates") broke the "overhead xxx" handling, as well as the "linklayer atm" attribute. tc class add ... htb rate X ceil Y linklayer atm overhead 10 This patch restores the "overhead xxx" handling, for htb, tbf and act_police The "linklayer atm" thing needs a separate fix. Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vimalkumar <j.vimal@gmail.com> Cc: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-01xfrm: force a garbage collection after deleting a policyPaul Moore
In some cases after deleting a policy from the SPD the policy would remain in the dst/flow/route cache for an extended period of time which caused problems for SELinux as its dynamic network access controls key off of the number of XFRM policy and state entries. This patch corrects this problem by forcing a XFRM garbage collection whenever a policy is sucessfully removed. Reported-by: Ondrej Moris <omoris@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-01sit: add IPv4 over IPv4 supportNicolas Dichtel
This patch adds the support of IPv4 over Ipv4 for the module sit. The gain of this feature is to be able to have 4in4 and 6in4 over the same interface instead of having one interface for 6in4 and another for 4in4 even if encapsulation addresses are the same. To avoid conflicting with ipip module, sit IPv4 over IPv4 protocol is registered with a smaller priority. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-01iptunnel: specify protocol outside IP headerNicolas Dichtel
Before this patch, ip_tunnel_xmit() was using the field protocol from the IP header passed into argument. There is no functional change, this patch prepares the support of IPv4 over IPv4 for module sit. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-01udp6: Fix udp fragmentation for tunnel traffic.Pravin B Shelar
udp6 over GRE tunnel does not work after to GRE tso changes. GRE tso handler passes inner packet but keeps track of outer header start in SKB_GSO_CB(skb)->mac_offset. udp6 fragment need to take care of outer header, which start at the mac_offset, while adding fragment header. This bug is introduced by commit 68c3316311 (GRE: Add TCP segmentation offload for GRE). Reported-by: Dmitry Kravkov <dkravkov@gmail.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Tested-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-01net: clean up skb headers codeCong Wang
commit 1a37e412a0225fcba5587 (net: Use 16bits for *_headers fields of struct skbuff) converts skb->*_header to u16, some #if NET_SKBUFF_DATA_USES_OFFSET are now useless, and to be safe, we could just use "X = (typeof(X)) ~0U;" as suggested by David. Cc: David S. Miller <davem@davemloft.net> Cc: Simon Horman <horms@verge.net.au> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31net/core: dev_mc_sync_multiple calls wrong helperJay Vosburgh
The dev_mc_sync_multiple function is currently calling __hw_addr_sync, and not __hw_addr_sync_multiple. This will result in addresses only being synced to the first device from the set. Corrected by calling the _multiple variant. Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Reviewed-by: Vlad Yasevich <vyasevic@redhat.com> Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31net/core: __hw_addr_sync_one / _multiple brokenJay Vosburgh
Currently, __hw_addr_sync_one is called in a loop by __hw_addr_sync_multiple to sync each of a "from" device's hw addresses to a "to" device. __hw_addr_sync_one calls __hw_addr_add_ex to attempt to add each address. __hw_addr_add_ex is called with global=false, and sync=true. __hw_addr_add_ex checks to see if the new address matches an address already on the list. If so, it tests global and sync. In this case, sync=true, and it then checks if the address is already synced, and if so, returns 0. This 0 return causes __hw_addr_sync_one to increment the sync_cnt and refcount for the "from" list's address entry, even though the address is already synced and has a reference and sync_cnt. This will cause the sync_cnt and refcount to increment without bound every time an addresses is added to the "from" device and synced to the "to" device. The fix here has two parts: First, when __hw_addr_add_ex finds the address already exists and is synced, return -EEXIST instead of 0. Second, __hw_addr_sync_one checks the error return for -EEXIST, and if so, it (a) does not add a refcount/sync_cnt, and (b) returns 0 itself so that __hw_addr_sync_multiple will not return an error. Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Reviewed-by: Vlad Yasevich <vyasevic@redhat.com> Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31net/core: __hw_addr_unsync_one "from" address not marked syncedJay Vosburgh
When an address is added to a subordinate interface (the "to" list), the address entry in the "from" list is not marked "synced" as the entry added to the "to" list is. When performing the unsync operation (e.g., dev_mc_unsync), __hw_addr_unsync_one calls __hw_addr_del_entry with the "synced" parameter set to true for the case when the address reference is being released from the "from" list. This causes a test inside to fail, with the result being that the reference count on the "from" address is not properly decremeted and the address on the "from" list will never be freed. Correct this by having __hw_addr_unsync_one call the __hw_addr_del_entry function with the "sync" flag set to false for the "remove from the from list" case. Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Reviewed-by: Vlad Yasevich <vyasevic@redhat.com> Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31net/core: __hw_addr_create_ex does not initialize sync_cntJay Vosburgh
The sync_cnt field is not being initialized, which can result in arbitrary values in the field. Fixed by initializing it to zero. Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Reviewed-by: Vlad Yasevich <vyasevic@redhat.com> Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31netfilter: Correct calculation using skb->tail and skb-network_headerSimon Horman
This corrects an regression introduced by "net: Use 16bits for *_headers fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In that case skb->tail will be a pointer whereas skb->network_header will be an offset from head. This is corrected by using wrappers that ensure that calculations are always made using pointers. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31snmp6: remove IPSTATS_MIB_CSUMERRORSNicolas Dichtel
This stat is not relevant in IPv6, there is no checksum in IPv6 header. Just leave a comment to explain the hole. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>