summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2014-01-08net: xfrm: xfrm_policy: silence compiler warningYing Xue
Fix below compiler warning: net/xfrm/xfrm_policy.c:1644:12: warning: ‘xfrm_dst_alloc_copy’ defined but not used [-Wunused-function] Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07Merge branch 'tipc'David S. Miller
Jon Maloy says: ==================== tipc: link setup and failover improvements This series consists of four unrelated commits with different purposes. - Commit #1 is purely cosmetic and pedagogic, hopefully making the failover/tunneling logics slightly easier to understand. - Commit #2 fixes a bug that has always been in the code, but was not discovered until very recently. - Commit #3 fixes a non-fatal race issue in the neighbour discovery code. - Commit #4 removes an unnecessary indirection step during link startup. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07tipc: make link start event synchronousJon Paul Maloy
When a link is created we delay the start event by launching it to be executed later in a tasklet. As we hold all the necessary locks at the moment of creation, and there is no risk of deadlock or contention, this delay serves no purpose in the current code. We remove this obsolete indirection step, and the associated function link_start(). At the same time, we rename the function tipc_link_stop() to the more appropriate tipc_link_purge_queues(). Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07tipc: introduce new spinlock to protect struct link_reqYing Xue
Currently, only 'bearer_lock' is used to protect struct link_req in the function disc_timeout(). This is unsafe, since the member fields 'num_nodes' and 'timer_intv' might be accessed by below three different threads simultaneously, none of them grabbing bearer_lock in the critical region: link_activate() tipc_bearer_add_dest() tipc_disc_add_dest() req->num_nodes++; tipc_link_reset() tipc_bearer_remove_dest() tipc_disc_remove_dest() req->num_nodes-- disc_update() read req->num_nodes write req->timer_intv disc_timeout() read req->num_nodes read/write req->timer_intv Without lock protection, the only symptom of a race is that discovery messages occasionally may not be sent out. This is not fatal, since such messages are best-effort anyway. On the other hand, since discovery messages are not time critical, adding a protecting lock brings no serious overhead either. So we add a new, dedicated spinlock in order to guarantee absolute data consistency in link_req objects. This also helps reduce the overall role of the bearer_lock, which we want to remove completely in a later commit series. Signed-off-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07tipc: remove 'has_redundant_link' flag from STATE link protocol messagesJon Paul Maloy
The flag 'has_redundant_link' is defined only in RESET and ACTIVATE protocol messages. Due to an ambiguity in the protocol specification it is currently also transferred in STATE messages. Its value is used to initialize a link state variable, 'permit_changeover', which is used to inhibit futile link failover attempts when it is known that the peer node has no working links at the moment, although the local node may still think it has one. The fact that 'has_redundant_link' incorrectly is read from STATE messages has the effect that 'permit_changeover' sometimes gets a wrong value, and permanently blocks any links from being re-established. Such failures can only occur in in dual-link systems, and are extremely rare. This bug seems to have always been present in the code. Furthermore, since commit b4b5610223f17790419b03eaa962b0e3ecf930d7 ("tipc: Ensure both nodes recognize loss of contact between them"), the 'permit_changeover' field serves no purpose any more. The task of enforcing 'lost contact' cycles at both peer endpoints is now taken by a new mechanism, using the flags WAIT_NODE_DOWN and WAIT_PEER_DOWN in struct tipc_node to abort unnecessary failover attempts. We therefore remove the 'has_redundant_link' flag from STATE messages, as well as the now redundant 'permit_changeover' variable. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07tipc: rename functions related to link failover and improve commentsJon Paul Maloy
The functionality related to link addition and failover is unnecessarily hard to understand and maintain. We try to improve this by renaming some of the functions, at the same time adding or improving the explanatory comments around them. Names such as "tipc_rcv()" etc. also align better with what is used in other networking components. The changes in this commit are purely cosmetic, no functional changes are made. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07net: skbuff: const-ify casts in skb_queue_* functionsDaniel Borkmann
We should const-ify comparisons on skb_queue_* inline helper functions as their parameters are const as well, so lets not drop that. Suggested-by: Brad Spengler <spender@grsecurity.net> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07net: xfrm: xfrm_policy: fix inline not at beginning of declarationDaniel Borkmann
Fix three warnings related to: net/xfrm/xfrm_policy.c:1644:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration] net/xfrm/xfrm_policy.c:1656:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration] net/xfrm/xfrm_policy.c:1668:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration] Just removing the inline keyword is sufficient as the compiler will decide on its own about inlining or not. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07mlx4_en: Select PTP_1588_CLOCKShawn Bohrer
Now that mlx4_en includes a PHC driver it must select PTP_1588_CLOCK. drivers/built-in.o: In function `mlx4_en_get_ts_info': >> en_ethtool.c:(.text+0x391a11): undefined reference to `ptp_clock_index' drivers/built-in.o: In function `mlx4_en_remove_timestamp': >> (.text+0x397913): undefined reference to `ptp_clock_unregister' drivers/built-in.o: In function `mlx4_en_init_timestamp': >> (.text+0x397b20): undefined reference to `ptp_clock_register' Fixes: ad7d4eaed995d ("mlx4_en: Add PTP hardware clock") Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07net-gre-gro: Add GRE support to the GRO stackJerry Chu
This patch built on top of Commit 299603e8370a93dd5d8e8d800f0dff1ce2c53d36 ("net-gro: Prepare GRO stack for the upcoming tunneling support") to add the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO stack. It also serves as an example for supporting other encapsulation protocols in the GRO stack in the future. The patch supports version 0 and all the flags (key, csum, seq#) but will flush any pkt with the S (seq#) flag. This is because the S flag is not support by GSO, and a GRO pkt may end up in the forwarding path, thus requiring GSO support to break it up correctly. Currently the "packet_offload" structure only contains L3 (ETH_P_IP/ ETH_P_IPV6) GRO offload support so the encapped pkts are limited to IP pkts (i.e., w/o L2 hdr). But support for other protocol type can be easily added, so is the support for GRE variations like NVGRE. The patch also support csum offload. Specifically if the csum flag is on and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE), the code will take advantage of the csum computed by the h/w when validating the GRE csum. Note that commit 60769a5dcd8755715c7143b4571d5c44f01796f1 "ipv4: gre: add GRO capability" already introduces GRO capability to IPv4 GRE tunnels, using the gro_cells infrastructure. But GRO is done after GRE hdr has been removed (i.e., decapped). The following patch applies GRO when pkts first come in (before hitting the GRE tunnel code). There is some performance advantage for applying GRO as early as possible. Also this approach is transparent to other subsystem like Open vSwitch where GRE decap is handled outside of the IP stack hence making it harder for the gro_cells stuff to apply. On the other hand, some NICs are still not capable of hashing on the inner hdr of a GRE pkt (RSS). In that case the GRO processing of pkts from the same remote host will all happen on the same CPU and the performance may be suboptimal. I'm including some rough preliminary performance numbers below. Note that the performance will be highly dependent on traffic load, mix as usual. Moreover it also depends on NIC offload features hence the following is by no means a comprehesive study. Local testing and tuning will be needed to decide the best setting. All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs. (super_netperf 50 -H 192.168.1.18 -l 30) An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1 mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123) is configured. The GRO support for pkts AFTER decap are controlled through the device feature of the GRE device (e.g., ethtool -K gre1 gro on/off). 1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off thruput: 9.16Gbps CPU utilization: 19% 1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off thruput: 5.9Gbps CPU utilization: 15% 1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on thruput: 9.26Gbps CPU utilization: 12-13% 1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on thruput: 9.26Gbps CPU utilization: 10% The following tests were performed on a different NIC that is capable of csum offload. I.e., the h/w is capable of computing IP payload csum (CHECKSUM_COMPLETE). 2.1 ethtool -K gre1 gro on (hence will use gro_cells) 2.1.1 ethtool -K eth0 gro off; csum offload disabled thruput: 8.53Gbps CPU utilization: 9% 2.1.2 ethtool -K eth0 gro off; csum offload enabled thruput: 8.97Gbps CPU utilization: 7-8% 2.1.3 ethtool -K eth0 gro on; csum offload disabled thruput: 8.83Gbps CPU utilization: 5-6% 2.1.4 ethtool -K eth0 gro on; csum offload enabled thruput: 8.98Gbps CPU utilization: 5% 2.2 ethtool -K gre1 gro off 2.2.1 ethtool -K eth0 gro off; csum offload disabled thruput: 5.93Gbps CPU utilization: 9% 2.2.2 ethtool -K eth0 gro off; csum offload enabled thruput: 5.62Gbps CPU utilization: 8% 2.2.3 ethtool -K eth0 gro on; csum offload disabled thruput: 7.69Gbps CPU utilization: 8% 2.2.4 ethtool -K eth0 gro on; csum offload enabled thruput: 8.96Gbps CPU utilization: 5-6% Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07net: Do not enable tx-nocache-copy by defaultBenjamin Poirier
There are many cases where this feature does not improve performance or even reduces it. For example, here are the results from tests that I've run using 3.12.6 on one Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are from the Xeon, but they're similar on the i7. All numbers report the mean±stddev over 10 runs of 10s. 1) latency tests similar to what is described in "c6e1a0d net: Allow no-cache copy from user on transmit" There is no statistically significant difference between tx-nocache-copy on/off. nic irqs spread out (one queue per cpu) 200x netperf -r 1400,1 tx-nocache-copy off 692000±1000 tps 50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3 tx-nocache-copy on 693000±1000 tps 50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7 200x netperf -r 14000,14000 tx-nocache-copy off 86450±80 tps 50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40 tx-nocache-copy on 86110±60 tps 50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20 2) single stream throughput tests tx-nocache-copy leads to higher service demand throughput cpu0 cpu1 demand (Gb/s) (Gcycle) (Gcycle) (cycle/B) nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send) tx-nocache-copy off 9402±5 9.4±0.2 0.80±0.01 tx-nocache-copy on 9403±3 9.85±0.04 0.838±0.004 nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send) tx-nocache-copy off 9401±5 5.83±0.03 5.0±0.1 0.923±0.007 tx-nocache-copy on 9404±2 5.74±0.03 5.523±0.009 0.958±0.002 As a second example, here are some results from Eric Dumazet with latest net-next. tx-nocache-copy also leads to higher service demand (cpu is Intel(R) Xeon(R) CPU X5660 @ 2.80GHz) lpq83:~# ./ethtool -K eth0 tx-nocache-copy on lpq83:~# perf stat ./netperf -H lpq84 -c MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB 87380 16384 16384 10.00 9407.44 2.50 -1.00 0.522 -1.000 Performance counter stats for './netperf -H lpq84 -c': 4282.648396 task-clock # 0.423 CPUs utilized 9,348 context-switches # 0.002 M/sec 88 CPU-migrations # 0.021 K/sec 355 page-faults # 0.083 K/sec 11,812,797,651 cycles # 2.758 GHz [82.79%] 9,020,522,817 stalled-cycles-frontend # 76.36% frontend cycles idle [82.54%] 4,579,889,681 stalled-cycles-backend # 38.77% backend cycles idle [67.33%] 6,053,172,792 instructions # 0.51 insns per cycle # 1.49 stalled cycles per insn [83.64%] 597,275,583 branches # 139.464 M/sec [83.70%] 8,960,541 branch-misses # 1.50% of all branches [83.65%] 10.128990264 seconds time elapsed lpq83:~# ./ethtool -K eth0 tx-nocache-copy off lpq83:~# perf stat ./netperf -H lpq84 -c MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB 87380 16384 16384 10.00 9412.45 2.15 -1.00 0.449 -1.000 Performance counter stats for './netperf -H lpq84 -c': 2847.375441 task-clock # 0.281 CPUs utilized 11,632 context-switches # 0.004 M/sec 49 CPU-migrations # 0.017 K/sec 354 page-faults # 0.124 K/sec 7,646,889,749 cycles # 2.686 GHz [83.34%] 6,115,050,032 stalled-cycles-frontend # 79.97% frontend cycles idle [83.31%] 1,726,460,071 stalled-cycles-backend # 22.58% backend cycles idle [66.55%] 2,079,702,453 instructions # 0.27 insns per cycle # 2.94 stalled cycles per insn [83.22%] 363,773,213 branches # 127.757 M/sec [83.29%] 4,242,732 branch-misses # 1.17% of all branches [83.51%] 10.128449949 seconds time elapsed CC: Tom Herbert <therbert@google.com> Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07ipv4: loopback device: ignore value changes after device is uppedJiri Pirko
When lo is brought up, new ifa is created. Then, devconf and neigh values bitfield should be set so later changes of default values would not affect lo values. Note that the same behaviour is in ipv6. Also note that this is likely not an issue in many distros (for example Fedora 19) because userspace sets address to lo manually before bringing it up. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07IPv6: add the option to use anycast addresses as source addresses in echo replyFX Le Bail
This change allows to follow a recommandation of RFC4942. - Add "anycast_src_echo_reply" sysctl to control the use of anycast addresses as source addresses for ICMPv6 echo reply. This sysctl is false by default to preserve existing behavior. - Add inline check ipv6_anycast_destination(). - Use them in icmpv6_echo_reply(). Reference: RFC4942 - IPv6 Transition/Coexistence Security Considerations (http://tools.ietf.org/html/rfc4942#section-2.1.6) 2.1.6. Anycast Traffic Identification and Security [...] To avoid exposing knowledge about the internal structure of the network, it is recommended that anycast servers now take advantage of the ability to return responses with the anycast address as the source address if possible. Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07net/mlx4_en: fix error return code in mlx4_en_get_qp()Wei Yongjun
Fix to return a negative error code from the error handling case instead of 0. Fixes: 837052d0ccc5 ('net/mlx4_en: Add netdev support for TCP/IP offloads of vxlan tunneling') Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07r8152: correct some messagesHayes Wang
- Replace pr_warn_ratelimited() with net_ratelimit() and netdev_warn(). - Adjust the algnment of some messages. - Remove the peroid. - Fix some messages don't have terminating newline. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07bna: Fix build due to missing use of dma_unmap_len_set()David S. Miller
> as reported for linux-next of Dec.20, 2013 > when CONFIG_NEED_DMA_MAP_STATE is not enabled: > > drivers/net/ethernet/brocade/bna/bnad.c: In function 'bnad_start_xmit': > drivers/net/ethernet/brocade/bna/bnad.c:3074:26: error: 'struct bnad_tx_vector' has no member named 'dma_len' Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07gre_offload: statically build GRE offloading supportEric Dumazet
GRO/GSO layers can be enabled on a node, even if said node is only forwarding packets. This patch permits GSO (and upcoming GRO) support for GRE encapsulated packets, even if the host has no GRE tunnel setup. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07bgmac: fix typosHauke Mehrtens
This fixes some typos found by Sergei. Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch Jesse Gross says: ==================== [GIT net-next] Open vSwitch Open vSwitch changes for net-next/3.14. Highlights are: * Performance improvements in the mechanism to get packets to userspace using memory mapped netlink and skb zero copy where appropriate. * Per-cpu flow stats in situations where flows are likely to be shared across CPUs. Standard flow stats are used in other situations to save memory and allocation time. * A handful of code cleanups and rationalization. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06ovs: make functions localStephen Hemminger
Several functions and datastructures could be local Found with 'make namespacecheck' Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06openvswitch: Compute checksum in skb_gso_segment() if neededThomas Graf
The copy & csum optimization is no longer present with zerocopy enabled. Compute the checksum in skb_gso_segment() directly by dropping the HW CSUM capability from the features passed in. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06openvswitch: Use skb_zerocopy() for upcallThomas Graf
Use of skb_zerocopy() can avoid the expensive call to memcpy() when copying the packet data into the Netlink skb. Completes checksum through skb_checksum_help() if not already done in GSO segmentation. Zerocopy is only performed if user space supported unaligned Netlink messages. memory mapped netlink i/o is preferred over zerocopy if it is set up. Cost of upcall is significantly reduced from: + 7.48% vhost-8471 [k] memcpy + 5.57% ovs-vswitchd [k] memcpy + 2.81% vhost-8471 [k] csum_partial_copy_generic to: + 5.72% ovs-vswitchd [k] memcpy + 3.32% vhost-5153 [k] memcpy + 0.68% vhost-5153 [k] skb_zerocopy (megaflows disabled) Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06openvswitch: Pass datapath into userspace queue functionsThomas Graf
Allows removing the net and dp_ifindex argument and simplify the code. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06openvswitch: Drop user features if old user space attempted to create datapathThomas Graf
Drop user features if an outdated user space instance that does not understand the concept of user_features attempted to create a new datapath. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06openvswitch: Allow user space to announce ability to accept unaligned ↵Thomas Graf
Netlink messages Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06net: Export skb_zerocopy() to zerocopy from one skb to anotherThomas Graf
Make the skb zerocopy logic written for nfnetlink queue available for use by other modules. Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06openvswitch: remove duplicated include from flow_table.cWei Yongjun
Remove duplicated include. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06net: ovs: use kfree_rcu instead of rcu_free_{sw_flow_mask_cb,acts_callback}Daniel Borkmann
As we're only doing a kfree() anyway in the RCU callback, we can simply use kfree_rcu, which does the same job, and remove the function rcu_free_sw_flow_mask_cb() and rcu_free_acts_callback(). Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06openvswitch: Per cpu flow stats.Pravin B Shelar
With mega flow implementation ovs flow can be shared between multiple CPUs which makes stats updates highly contended operation. This patch uses per-CPU stats in cases where a flow is likely to be shared (if there is a wildcard in the 5-tuple and therefore likely to be spread by RSS). In other situations, it uses the current strategy, saving memory and allocation time. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06openvswitch: Enable memory mapped Netlink i/oThomas Graf
Use memory mapped Netlink i/o for all unicast openvswitch communication if a ring has been set up. Benchmark * pktgen -> ovs internal port * 5M pkts, 5M flows * 4 threads, 8 cores Before: Result: OK: 67418743(c67108212+d310530) usec, 5000000 (9000byte,0frags) 74163pps 5339Mb/sec (5339736000bps) errors: 0 + 2.98% ovs-vswitchd [k] copy_user_generic_string + 2.49% ovs-vswitchd [k] memcpy + 1.84% kpktgend_2 [k] memcpy + 1.81% kpktgend_1 [k] memcpy + 1.81% kpktgend_3 [k] memcpy + 1.78% kpktgend_0 [k] memcpy After: Result: OK: 24229690(c24127165+d102524) usec, 5000000 (9000byte,0frags) 206358pps 14857Mb/sec (14857776000bps) errors: 0 + 2.80% ovs-vswitchd [k] memcpy + 1.31% kpktgend_2 [k] memcpy + 1.23% kpktgend_0 [k] memcpy + 1.09% kpktgend_1 [k] memcpy + 1.04% kpktgend_3 [k] memcpy + 0.96% ovs-vswitchd [k] copy_user_generic_string Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06netlink: Avoid netlink mmap alloc if msg size exceeds frame sizeThomas Graf
An insufficent ring frame size configuration can lead to an unnecessary skb allocation for every Netlink message. Check frame size before taking the queue lock and allocating the skb and re-check with lock to be safe. Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06genl: Add genlmsg_new_unicast() for unicast message allocationThomas Graf
Allocates a new sk_buff large enough to cover the specified payload plus required Netlink headers. Will check receiving socket for memory mapped i/o capability and use it if enabled. Will fall back to non-mapped skb if message size exceeds the frame size of the ring. Signed-of-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06openvswitch: Silence RCU lockdep checks from flow lookup.Jesse Gross
Flow lookup can happen either in packet processing context or userspace context but it was annotated as requiring RCU read lock to be held. This also allows OVS mutex to be held without causing warnings. Reported-by: Justin Pettit <jpettit@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Reviewed-by: Thomas Graf <tgraf@redhat.com>
2014-01-06openvswitch: Change ovs_flow_tbl_lookup_xx() APIsAndy Zhou
API changes only for code readability. No functional chnages. This patch removes the underscored version. Added a new API ovs_flow_tbl_lookup_stats() that returns the n_mask_hits. Reported by: Ben Pfaff <blp@nicira.com> Reviewed-by: Thomas Graf <tgraf@redhat.com> Signed-off-by: Andy Zhou <azhou@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06openvswitch: Shrink sw_flow_mask by 8 bytes (64-bit) or 4 bytes (32-bit).Ben Pfaff
We won't normally have a ton of flow masks but using a size_t to store values no bigger than sizeof(struct sw_flow_key) seems excessive. This reduces sw_flow_key_range and sw_flow_mask by 4 bytes on 32-bit systems. On 64-bit systems it shrinks sw_flow_key_range by 12 bytes but sw_flow_mask only by 8 bytes due to padding. Compile tested only. Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Andy Zhou <azhou@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06openvswitch: Correct comment.Ben Pfaff
Signed-off-by: Ben Pfaff <blp@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-01-06Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c net/ipv6/ip6_tunnel.c net/ipv6/ip6_vti.c ipv6 tunnel statistic bug fixes conflicting with consolidation into generic sw per-cpu net stats. qlogic conflict between queue counting bug fix and the addition of multiple MAC address support. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06net_sched: act: action flushing missaccountingJamal Hadi Salim
action flushing missaccounting Account only for deleted actions Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06net_sched: Remove unnecessary checks for act->opsJamal Hadi Salim
Remove unnecessary checks for act->ops (suggested by Eric Dumazet). Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06bridge: use DEVICE_ATTR_xx macrossfeldma@cumulusnetworks.com
Use DEVICE_ATTR_RO/RW macros to simplify bridge sysfs attribute definitions. Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06bridge: use spin_lock_bh() in br_multicast_set_hash_maxCurt Brune
br_multicast_set_hash_max() is called from process context in net/bridge/br_sysfs_br.c by the sysfs store_hash_max() function. br_multicast_set_hash_max() calls spin_lock(&br->multicast_lock), which can deadlock the CPU if a softirq that also tries to take the same lock interrupts br_multicast_set_hash_max() while the lock is held . This can happen quite easily when any of the bridge multicast timers expire, which try to take the same lock. The fix here is to use spin_lock_bh(), preventing other softirqs from executing on this CPU. Steps to reproduce: 1. Create a bridge with several interfaces (I used 4). 2. Set the "multicast query interval" to a low number, like 2. 3. Enable the bridge as a multicast querier. 4. Repeatedly set the bridge hash_max parameter via sysfs. # brctl addbr br0 # brctl addif br0 eth1 eth2 eth3 eth4 # brctl setmcqi br0 2 # brctl setmcquerier br0 1 # while true ; do echo 4096 > /sys/class/net/br0/bridge/hash_max; done Signed-off-by: Curt Brune <curt@cumulusnetworks.com> Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06vxlan: keep original skb ownershipEric Dumazet
Sathya Perla posted a patch trying to address following problem : <quote> The vxlan driver sets itself as the socket owner for all the TX flows it encapsulates (using vxlan_set_owner()) and assigns it's own skb destructor. This causes all tunneled traffic to land up on only one TXQ as all encapsulated skbs refer to the vxlan socket and not the original socket. Also, the vxlan skb destructor breaks some functionality for tunneled traffic like wmem accounting and as TCP small queues and FQ/pacing packet scheduler. </quote> I reworked Sathya patch and added some explanations. vxlan_xmit() can avoid one skb_clone()/dev_kfree_skb() pair and gain better drop monitor accuracy, by calling kfree_skb() when appropriate. The UDP socket used by vxlan to perform encapsulation of xmit packets do not need to be alive while packets leave vxlan code. Its better to keep original socket ownership to get proper feedback from qdisc and NIC layers. We use skb->sk to A) control amount of bytes/packets queued on behalf of a socket, but prior vxlan code did the skb->sk transfert without any limit/control on vxlan socket sk_sndbuf. B) security purposes (as selinux) or netfilter uses, and I do not think anything is prepared to handle vxlan stacked case in this area. By not changing ownership, vxlan tunnels behave like other tunnels. As Stephen mentioned, we might do the same change in L2TP. Reported-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06tcp: out_of_order_queue do not use its lockEric Dumazet
TCP out_of_order_queue lock is not used, as queue manipulation happens with socket lock held and we therefore use the lockless skb queue routines (as __skb_queue_head()) We can use __skb_queue_head_init() instead of skb_queue_head_init() to make this more consistent. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06ipv6: don't install anycast address for /128 addresses on routersHannes Frederic Sowa
It does not make sense to create an anycast address for an /128-prefix. Suppress it. As 32019e651c6fce ("ipv6: Do not leave router anycast address for /127 prefixes.") shows we also may not leave them, because we could accidentally remove an anycast address the user has allocated or got added via another prefix. Cc: François-Xavier Le Bail <fx.lebail@yahoo.com> Cc: Thomas Haller <thaller@redhat.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06hso: fix handling of modem port SERIAL_STATE notificationsDan Williams
The existing serial state notification handling expected older Option devices, having a hardcoded assumption that the Modem port was always USB interface #2. That isn't true for devices from the past few years. hso_serial_state_notification is a local cache of a USB Communications Interface Class SERIAL_STATE notification from the device, and the USB CDC specification (section 6.3, table 67 "Class-Specific Notifications") defines wIndex as the USB interface the event applies to. For hso devices this will always be the Modem port, as the Modem port is the only port which is set up to receive them by the driver. So instead of always expecting USB interface #2, instead validate the notification with the actual USB interface number of the Modem port. Signed-off-by: Dan Williams <dcbw@redhat.com> Tested-by: H. Nikolaus Schaller <hns@goldelico.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06bonding: fix kstrtou8() return value verification in num_peer_notifVeaceslav Falico
It returns 0 in case of success, !0 error otherwise. Fix the improper error verification. Fixes: 2c9839c143bbc ("bonding: add num_grat_arp attribute netlink support") CC: sfeldma@cumulusnetworks.com CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Acked-by: Scott Feldman <sfeldma@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06r8152: replace the return value of rtl_ops_inithayeswang
Replace the boolean value with the error code for the return value of the rtl_ops_init(). Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06r8152: move the actions of saving the information of the devicehayeswang
Some information of the device may be used in other functions. Move the relative code to make sure it would be initialzed correctly before using it. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06r8152: replace some tabs with spaceshayeswang
Replace the tabs of the variables declaration with the spaces. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06isdn: Drop big endian cpp checks from telespci and hfc_pci driversGuenter Roeck
With arm:allmodconfig, building the Teles PCI driver fails with telespci.c:294:2: error: #error "not running on big endian machines now" Similar, building the driver for HFC PCI-Bus cards fails with hfc_pci.c:1647:2: error: #error "not running on big endian machines now" Remove the big endian cpp check from both drivers to fix the build errors. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net>