summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2013-08-30pkt_sched: fq: prefetch() fixEric Dumazet
kbuild bot reported following m68k build error : net/sched/sch_fq.c: In function 'fq_dequeue': >> net/sched/sch_fq.c:491:2: error: implicit declaration of function 'prefetch' [-Werror=implicit-function-declaration] cc1: some warnings being treated as errors While we are fixing this, move this prefetch() call a bit earlier. Reported-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-30SUNRPC: Add a framework to clean up management of rpc_pipefs directoriesTrond Myklebust
The current system requires everyone to set up notifiers, manage directory locking, etc. What we really want to do is have the rpc_client create its directory, and then create all the entries. This patch will allow the RPCSEC_GSS and NFS code to register all the objects that they want to have appear in the directory, and then have the sunrpc code call them back to actually create/destroy their pipefs dentries when the rpc_client creates/destroys the parent. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-30RPCSEC_GSS: Fix an Oopsable condition when creating/destroying pipefs objectsTrond Myklebust
If an error condition occurs on rpc_pipefs creation, or the user mounts rpc_pipefs and then unmounts it, then the dentries in struct gss_auth need to be reset to NULL so that a second call to gss_pipes_dentries_destroy doesn't try to free them again. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-30RPCSEC_GSS: Further cleanupsTrond Myklebust
Don't pass the rpc_client as a parameter, when what we really want is the net namespace. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-30SUNRPC: Replace clnt->cl_principalTrond Myklebust
The clnt->cl_principal is being used exclusively to store the service target name for RPCSEC_GSS/krb5 callbacks. Replace it with something that is stored only in the RPCSEC_GSS-specific code. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-30RPCSEC_GSS: Clean up upcall message allocationTrond Myklebust
Optimise away gss_encode_msg: we don't need to look up the pipe version a second time. Save the gss target name in struct gss_auth. It is a property of the auth cache itself, and doesn't really belong in the rpc_client. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-30SUNRPC: Cleanup rpc_setup_pipedirTrond Myklebust
The directory name is _always_ clnt->cl_program->pipe_dir_name. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-30SUNRPC: Remove unused struct rpc_clnt field cl_protnameTrond Myklebust
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-30SUNRPC: Deprecate rpc_client->cl_protnameTrond Myklebust
It just duplicates the cl_program->name, and is not used in any fast paths where the extra dereference will cause a hit. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-30pkt_sched: fq: Fair Queue packet schedulerEric Dumazet
- Uses perfect flow match (not stochastic hash like SFQ/FQ_codel) - Uses the new_flow/old_flow separation from FQ_codel - New flows get an initial credit allowing IW10 without added delay. - Special FIFO queue for high prio packets (no need for PRIO + FQ) - Uses a hash table of RB trees to locate the flows at enqueue() time - Smart on demand gc (at enqueue() time, RB tree lookup evicts old unused flows) - Dynamic memory allocations. - Designed to allow millions of concurrent flows per Qdisc. - Small memory footprint : ~8K per Qdisc, and 104 bytes per flow. - Single high resolution timer for throttled flows (if any). - One RB tree to link throttled flows. - Ability to have a max rate per flow. We might add a socket option to add per socket limitation. Attempts have been made to add TCP pacing in TCP stack, but this seems to add complex code to an already complex stack. TCP pacing is welcomed for flows having idle times, as the cwnd permits TCP stack to queue a possibly large number of packets. This removes the 'slow start after idle' choice, hitting badly large BDP flows, and applications delivering chunks of data as video streams. Nicely spaced packets : Here interface is 10Gbit, but flow bottleneck is ~20Mbit cwin is big, yet FQ avoids the typical bursts generated by TCP (as in netperf TCP_RR -- -r 100000,100000) 15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.545394 IP B > A: . ack 81089 win 3668 <nop,nop,timestamp 11597985 1115> 15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.546565 IP B > A: . ack 83985 win 3668 <nop,nop,timestamp 11597986 1115> 15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.547778 IP B > A: . ack 86881 win 3668 <nop,nop,timestamp 11597987 1115> 15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.548949 IP B > A: . ack 89777 win 3668 <nop,nop,timestamp 11597988 1115> 15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.550182 IP B > A: . ack 92673 win 3668 <nop,nop,timestamp 11597989 1115> 15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.551406 IP B > A: . ack 95569 win 3668 <nop,nop,timestamp 11597991 1115> 15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.552576 IP B > A: . ack 98465 win 3668 <nop,nop,timestamp 11597992 1115> 15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.554204 IP B > A: . ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> TCP timestamps show that most packets from B were queued in the same ms timeframe (TSval 1159799{3,4}), but FQ managed to send them right in time to avoid a big burst. In slow start or steady state, very few packets are throttled [1] FQ gets a bunch of tunables as : limit : max number of packets on whole Qdisc (default 10000) flow_limit : max number of packets per flow (default 100) quantum : the credit per RR round (default is 2 MTU) initial_quantum : initial credit for new flows (default is 10 MTU) maxrate : max per flow rate (default : unlimited) buckets : number of RB trees (default : 1024) in hash table. (consumes 8 bytes per bucket) [no]pacing : disable/enable pacing (default is enable) All of them can be changed on a live qdisc. $ tc qd add dev eth0 root fq help Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ] [ quantum BYTES ] [ initial_quantum BYTES ] [ maxrate RATE ] [ buckets NUMBER ] [ [no]pacing ] $ tc -s -d qd qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140 Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14) backlog 0b 0p requeues 14 511 flows, 511 inactive, 0 throttled 110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit [1] Except if initial srtt is overestimated, as if using cached srtt in tcp metrics. We'll provide a fix for this issue. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29can: gw: add a per rule limitation of frame hopsOliver Hartkopp
Usually the received CAN frames can be processed/routed as much as 'max_hops' times (which is given at module load time of the can-gw module). Introduce a new configuration option to reduce the number of possible hops for a specific gateway rule to a value smaller then max_hops. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2013-08-29net: packet: use reciprocal_divide in fanout_demux_hashDaniel Borkmann
Instead of hard-coding reciprocal_divide function, use the inline function from reciprocal_div.h. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: packet: add randomized fanout schedulerDaniel Borkmann
We currently allow for different fanout scheduling policies in pf_packet such as scheduling by skb's rxhash, round-robin, by cpu, and rollover. Also allow for a random, equidistributed selection of the socket from the fanout process group. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: add netdev_upper_get_next_dev_rcu(dev, iter)Veaceslav Falico
This function returns the next dev in the dev->upper_dev_list after the struct list_head **iter position, and updates *iter accordingly. Returns NULL if there are no devices left. Caller must hold RCU read lock. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: remove search_list from netdev_adjacentVeaceslav Falico
We already don't need it cause we see every upper/lower device in the list already. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: add lower_dev_list to net_device and make a full meshVeaceslav Falico
This patch adds lower_dev_list list_head to net_device, which is the same as upper_dev_list, only for lower devices, and begins to use it in the same way as the upper list. It also changes the way the whole adjacent device lists work - now they contain *all* of upper/lower devices, not only the first level. The first level devices are distinguished by the bool neighbour field in netdev_adjacent, also added by this patch. There are cases when a device can be added several times to the adjacent list, the simplest would be: /---- eth0.10 ---\ eth0- --- bond0 \---- eth0.20 ---/ where both bond0 and eth0 'see' each other in the adjacent lists two times. To avoid duplication of netdev_adjacent structures ref_nr is being kept as the number of times the device was added to the list. The 'full view' is achieved by adding, on link creation, all of the upper_dev's upper_dev_list devices as upper devices to all of the lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice versa. On unlink they are removed using the same logic. I've tested it with thousands vlans/bonds/bridges, everything works ok and no observable lags even on a huge number of interfaces. Memory footprint for 128 devices interconnected with each other via both upper and lower (which is impossible, but for the comparison) lists would be: 128*128*2*sizeof(netdev_adjacent) = 1.5MB but in the real world we usualy have at most several devices with slaves and a lot of vlans, so the footprint will be much lower. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: rename netdev_upper to netdev_adjacentVeaceslav Falico
Rename the structure to reflect the upcoming addition of lower_dev_list. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== This pull request fixes some issues that arise when 6in4 or 4in6 tunnels are used in combination with IPsec, all from Hannes Frederic Sowa and a null pointer dereference when queueing packets to the policy hold queue. 1) We might access the local error handler of the wrong address family if 6in4 or 4in6 tunnel is protected by ipsec. Fix this by addind a pointer to the correct local_error to xfrm_state_afinet. 2) Add a helper function to always refer to the correct interpretation of skb->sk. 3) Call skb_reset_inner_headers to record the position of the inner headers when adding a new one in various ipv6 tunnels. This is needed to identify the addresses where to send back errors in the xfrm layer. 4) Dereference inner ipv6 header if encapsulated to always call the right error handler. 5) Choose protocol family by skb protocol to not call the wrong xfrm{4,6}_local_error handler in case an ipv6 sockets is used in ipv4 mode. 6) Partly revert "xfrm: introduce helper for safe determination of mtu" because this introduced pmtu discovery problems. 7) Set skb->protocol on tcp, raw and ip6_append_data genereated skbs. We need this to get the correct mtu informations in xfrm. 8) Fix null pointer dereference in xdst_queue_output. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29ipv6: Don't depend on per socket memory for neighbour discovery messagesThomas Graf
Allocating skbs when sending out neighbour discovery messages currently uses sock_alloc_send_skb() based on a per net namespace socket and thus share a socket wmem buffer space. If a netdevice is temporarily unable to transmit due to carrier loss or for other reasons, the queued up ndisc messages will cosnume all of the wmem space and will thus prevent from any more skbs to be allocated even for netdevices that are able to transmit packets. The number of neighbour discovery messages sent is very limited, simply use alloc_skb() and don't depend on any socket wmem space any longer. This patch has orginally been posted by Eric Dumazet in a modified form. Signed-off-by: Thomas Graf <tgraf@suug.ch> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29ipv4: sendto/hdrincl: don't use destination address found in headerChris Clark
ipv4: raw_sendmsg: don't use header's destination address A sendto() regression was bisected and found to start with commit f8126f1d5136be1 (ipv4: Adjust semantics of rt->rt_gateway.) The problem is that it tries to ARP-lookup the constructed packet's destination address rather than the explicitly provided address. Fix this using FLOWI_FLAG_KNOWN_NH so that given nexthop is used. cf. commit 2ad5b9e4bd314fc685086b99e90e5de3bc59e26b Reported-by: Chris Clark <chris.clark@alcatel-lucent.com> Bisected-by: Chris Clark <chris.clark@alcatel-lucent.com> Tested-by: Chris Clark <chris.clark@alcatel-lucent.com> Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Chris Clark <chris.clark@alcatel-lucent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: sctp: sctp_verify_init: clean up mandatory checks and add commentDaniel Borkmann
Add a comment related to RFC4960 explaning why we do not check for initial TSN, and while at it, remove yoda notation checks and clean up code from checks of mandatory conditions. That's probably just really minor, but makes reviewing easier. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29tcp: TSO packets automatic sizingEric Dumazet
After hearing many people over past years complaining against TSO being bursty or even buggy, we are proud to present automatic sizing of TSO packets. One part of the problem is that tcp_tso_should_defer() uses an heuristic relying on upcoming ACKS instead of a timer, but more generally, having big TSO packets makes little sense for low rates, as it tends to create micro bursts on the network, and general consensus is to reduce the buffering amount. This patch introduces a per socket sk_pacing_rate, that approximates the current sending rate, and allows us to size the TSO packets so that we try to send one packet every ms. This field could be set by other transports. Patch has no impact for high speed flows, where having large TSO packets makes sense to reach line rate. For other flows, this helps better packet scheduling and ACK clocking. This patch increases performance of TCP flows in lossy environments. A new sysctl (tcp_min_tso_segs) is added, to specify the minimal size of a TSO packet (default being 2). A follow-up patch will provide a new packet scheduler (FQ), using sk_pacing_rate as an input to perform optional per flow pacing. This explains why we chose to set sk_pacing_rate to twice the current rate, allowing 'slow start' ramp up. sk_pacing_rate = 2 * cwnd * mss / srtt v2: Neal Cardwell reported a suspect deferring of last two segments on initial write of 10 MSS, I had to change tcp_tso_should_defer() to take into account tp->xmit_size_goal_segs Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Van Jacobson <vanj@google.com> Cc: Tom Herbert <therbert@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29ipv6: drop fragmented ndisc packets by default (RFC 6980)Hannes Frederic Sowa
This patch implements RFC6980: Drop fragmented ndisc packets by default. If a fragmented ndisc packet is received the user is informed that it is possible to disable the check. Cc: Fernando Gont <fernando@gont.com.ar> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29bridge: inherit slave devices needed_headroomFlorian Fainelli
Some slave devices may have set a dev->needed_headroom value which is different than the default one, most likely in order to prepend a hardware descriptor in front of the Ethernet frame to send. Whenever a new slave is added to a bridge, ensure that we update the needed_headroom value accordingly to account for the slave needed_headroom value. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29tcp: don't apply tsoffset if rcv_tsecr is zeroAndrew Vagin
The zero value means that tsecr is not valid, so it's a special case. tsoffset is used to customize tcp_time_stamp for one socket. tsoffset is usually zero, it's used when a socket was moved from one host to another host. Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to TCP_RTO_MAX. Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Reported-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29tcp: initialize rcv_tstamp for restored socketsAndrew Vagin
u32 rcv_tstamp; /* timestamp of last received ACK */ Its value used in tcp_retransmit_timer, which closes socket if the last ack was received more then TCP_RTO_MAX ago. Currently rcv_tstamp is initialized to zero and if tcp_retransmit_timer is called before receiving a first ack, the connection is closed. This patch initializes rcv_tstamp to a timestamp, when a socket was restored. Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Reported-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem Conflicts: drivers/net/wireless/iwlwifi/pcie/trans.c
2013-08-29sysfs: Restrict mounting sysfsEric W. Biederman
Don't allow mounting sysfs unless the caller has CAP_SYS_ADMIN rights over the net namespace. The principle here is if you create or have capabilities over it you can mount it, otherwise you get to live with what other people have mounted. Instead of testing this with a straight forward ns_capable call, perform this check the long and torturous way with kobject helpers, this keeps direct knowledge of namespaces out of sysfs, and preserves the existing sysfs abstractions. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-08-28genl: Hold reference on correct module while netlink-dump.Pravin B Shelar
netlink dump operations take module as parameter to hold reference for entire netlink dump duration. Currently it holds ref only on genl module which is not correct when we use ops registered to genl from another module. Following patch adds module pointer to genl_ops so that netlink can hold ref count on it. CC: Jesse Gross <jesse@nicira.com> CC: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28genl: Fix genl dumpit() locking.Pravin B Shelar
In case of genl-family with parallel ops off, dumpif() callback is expected to run under genl_lock, But commit def3117493eafd9df (genl: Allow concurrent genl callbacks.) changed this behaviour where only first dumpit() op was called under genl-lock. For subsequent dump, only nlk->cb_lock was taken. Following patch fixes it by defining locked dumpit() and done() callback which takes care of genl-locking. CC: Jesse Gross <jesse@nicira.com> CC: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28SUNRPC: Fix memory corruption issue on 32-bit highmem systemsTrond Myklebust
Some architectures, such as ARM-32 do not return the same base address when you call kmap_atomic() twice on the same page. This causes problems for the memmove() call in the XDR helper routine "_shift_data_right_pages()", since it defeats the detection of overlapping memory ranges, and has been seen to corrupt memory. The fix is to distinguish between the case where we're doing an inter-page copy or not. In the former case of we know that the memory ranges cannot possibly overlap, so we can additionally micro-optimise by replacing memmove() with memcpy(). Reported-by: Mark Young <MYoung@nvidia.com> Reported-by: Matt Craighead <mcraighead@nvidia.com> Cc: Bruce Fields <bfields@fieldses.org> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: Matt Craighead <mcraighead@nvidia.com>
2013-08-28Merge branch 'for-john' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
2013-08-28Merge branch 'for-john' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
2013-08-28Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless Conflicts: drivers/net/wireless/iwlwifi/pcie/trans.c net/mac80211/ibss.c
2013-08-28batman-adv: send GW_DEL event when the gw client mode is deselectedAntonio Quartulli
Whenever the GW client mode is deselected, a DEL event has to be sent in order to tell userspace that the current gateway has been lost. Send the uevent on state change only if a gateway was currently selected. Reported-by: Marek Lindner <lindner_marek@yahoo.de> Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
2013-08-28batman-adv: Start new development cycleSimon Wunderlich
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2013-08-28batman-adv: move enum definition at the top of the fileAntonio Quartulli
Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2013-08-28batman-adv: set skb priority according to contentSimon Wunderlich
The skb priority field may help the wireless driver to choose the right queue (e.g. WMM queues). This should be set in batman-adv, as this information is only available here. This patch adds support for IPv4/IPv6 DS fields and VLAN PCP. Note that only VLAN PCP is used if a VLAN header is present. Also initially set TC_PRIO_CONTROL only for self-generated packets, and keep the priority set by higher layers. Signed-off-by: Simon Wunderlich <simon@open-mesh.com> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de> Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2013-08-28xfrm: Fix potential null pointer dereference in xdst_queue_outputSteffen Klassert
The net_device might be not set on the skb when we try refcounting. This leads to a null pointer dereference in xdst_queue_output(). It turned out that the refcount to the net_device is not needed after all. The dst_entry has a refcount to the net_device before we queue the skb, so it can't go away. Therefore we can remove the refcount on queueing to fix the null pointer dereference. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-08-28Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch Jesse Gross says: ==================== A number of significant new features and optimizations for net-next/3.12. Highlights are: * "Megaflows", an optimization that allows userspace to specify which flow fields were used to compute the results of the flow lookup. This allows for a major reduction in flow setups (the major performance bottleneck in Open vSwitch) without reducing flexibility. * Converting netlink dump operations to use RCU, allowing for additional parallelism in userspace. * Matching and modifying SCTP protocol fields. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-27netfilter: ctnetlink: fix uninitialized variableFlorian Westphal
net/netfilter/nf_conntrack_netlink.c: In function 'ctnetlink_nfqueue_attach_expect': 'helper' may be used uninitialized in this function It was only initialized in if CTA_EXPECT_HELP_NAME attribute was present, it must be NULL otherwise. Problem added recently in bd077937 (netfilter: nfnetlink_queue: allow to attach expectations to conntracks). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-27netfilter: add IPv6 SYNPROXY targetPatrick McHardy
Add an IPv6 version of the SYNPROXY target. The main differences to the IPv4 version is routing and IP header construction. Signed-off-by: Patrick McHardy <kaber@trash.net> Tested-by: Martin Topholm <mph@one.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-27net: syncookies: export cookie_v6_init_sequence/cookie_v6_checkPatrick McHardy
Extract the local TCP stack independant parts of tcp_v6_init_sequence() and cookie_v6_check() and export them for use by the upcoming IPv6 SYNPROXY target. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: David S. Miller <davem@davemloft.net> Tested-by: Martin Topholm <mph@one.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-27netfilter: add SYNPROXY core/targetPatrick McHardy
Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy core with common functions and an address family specific target. The SYNPROXY receives the connection request from the client, responds with a SYN/ACK containing a SYN cookie and announcing a zero window and checks whether the final ACK from the client contains a valid cookie. It then establishes a connection to the original destination and, if successful, sends a window update to the client with the window size announced by the server. Support for timestamps, SACK, window scaling and MSS options can be statically configured as target parameters if the features of the server are known. If timestamps are used, the timestamp value sent back to the client in the SYN/ACK will be different from the real timestamp of the server. In order to now break PAWS, the timestamps are translated in the direction server->client. Signed-off-by: Patrick McHardy <kaber@trash.net> Tested-by: Martin Topholm <mph@one.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-27net: syncookies: export cookie_v4_init_sequence/cookie_v4_checkPatrick McHardy
Extract the local TCP stack independant parts of tcp_v4_init_sequence() and cookie_v4_check() and export them for use by the upcoming SYNPROXY target. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: David S. Miller <davem@davemloft.net> Tested-by: Martin Topholm <mph@one.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-27netfilter: nf_conntrack: make sequence number adjustments usuable without NATPatrick McHardy
Split out sequence number adjustments from NAT and move them to the conntrack core to make them usable for SYN proxying. The sequence number adjustment information is moved to a seperate extend. The extend is added to new conntracks when a NAT mapping is set up for a connection using a helper. As a side effect, this saves 24 bytes per connection with NAT in the common case that a connection does not have a helper assigned. Signed-off-by: Patrick McHardy <kaber@trash.net> Tested-by: Martin Topholm <mph@one.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-27netfilter: nf_defrag_ipv6.o included twiceNathan Hintz
'nf_defrag_ipv6' is built as a separate module; it shouldn't be included in the 'nf_conntrack_ipv6' module as well. Signed-off-by: Nathan Hintz <nlhintz@hotmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-27netfilter: ip[6]t_REJECT: tcp-reset using wrong MAC source if bridgedPhil Oester
As reported by Casper Gripenberg, in a bridged setup, using ip[6]t_REJECT with the tcp-reset option sends out reset packets with the src MAC address of the local bridge interface, instead of the MAC address of the intended destination. This causes some routers/firewalls to drop the reset packet as it appears to be spoofed. Fix this by bypassing ip[6]_local_out and setting the MAC of the sender in the tcp reset packet. This closes netfilter bugzilla #531. Signed-off-by: Phil Oester <kernel@linuxace.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-27openvswitch: optimize flow compare and mask functionsAndy Zhou
Make sure the sw_flow_key structure and valid mask boundaries are always machine word aligned. Optimize the flow compare and mask operations using machine word size operations. This patch improves throughput on average by 15% when CPU is the bottleneck of forwarding packets. This patch is inspired by ideas and code from a patch submitted by Peter Klausler titled "replace memcmp() with specialized comparator". However, The original patch only optimizes for architectures support unaligned machine word access. This patch optimizes for all architectures. Signed-off-by: Andy Zhou <azhou@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-08-27Merge branch 'for-davem' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless John W. Linville says: ==================== This is one more set of fixes intended for the 3.11 stream... For the mac80211 bits, Johannes says: "I have three more patches for the 3.11 stream: Felix's fix for the fairly visible brcmsmac crash, a fix from Simon for an IBSS join bug I found and a fix for a channel context bug in IBSS I'd introduced." Along with those... Sujith Manoharan makes a minor change to not use a PLL hang workaroun for AR9550. This one-liner fixes a couple of bugs reported in the Red Hat bugzilla. Helmut Schaa addresses an ath9k_htc bug that mangles frame headers during Tx. This fix is small, tested by the bug reported and isolated to ath9k_htc. Stanislaw Gruszka reverts a recent iwl4965 change that broke rfkill notification to user space. Please let me know if there are problems! ==================== Signed-off-by: David S. Miller <davem@davemloft.net>