summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2014-04-10net-netif_rx_ni-migrate-disable.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10softirq-thread-do-softirq.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10net-flip-lock-dep-thingy.patchThomas Gleixner
======================================================= [ INFO: possible circular locking dependency detected ] 3.0.0-rc3+ #26 ------------------------------------------------------- ip/1104 is trying to acquire lock: (local_softirq_lock){+.+...}, at: [<ffffffff81056d12>] __local_lock+0x25/0x68 but task is already holding lock: (sk_lock-AF_INET){+.+...}, at: [<ffffffff81433308>] lock_sock+0x10/0x12 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (sk_lock-AF_INET){+.+...}: [<ffffffff810836e5>] lock_acquire+0x103/0x12e [<ffffffff813e2781>] lock_sock_nested+0x82/0x92 [<ffffffff81433308>] lock_sock+0x10/0x12 [<ffffffff81433afa>] tcp_close+0x1b/0x355 [<ffffffff81453c99>] inet_release+0xc3/0xcd [<ffffffff813dff3f>] sock_release+0x1f/0x74 [<ffffffff813dffbb>] sock_close+0x27/0x2b [<ffffffff81129c63>] fput+0x11d/0x1e3 [<ffffffff81126577>] filp_close+0x70/0x7b [<ffffffff8112667a>] sys_close+0xf8/0x13d [<ffffffff814ae882>] system_call_fastpath+0x16/0x1b -> #0 (local_softirq_lock){+.+...}: [<ffffffff81082ecc>] __lock_acquire+0xacc/0xdc8 [<ffffffff810836e5>] lock_acquire+0x103/0x12e [<ffffffff814a7e40>] _raw_spin_lock+0x3b/0x4a [<ffffffff81056d12>] __local_lock+0x25/0x68 [<ffffffff81056d8b>] local_bh_disable+0x36/0x3b [<ffffffff814a7fc4>] _raw_write_lock_bh+0x16/0x4f [<ffffffff81433c38>] tcp_close+0x159/0x355 [<ffffffff81453c99>] inet_release+0xc3/0xcd [<ffffffff813dff3f>] sock_release+0x1f/0x74 [<ffffffff813dffbb>] sock_close+0x27/0x2b [<ffffffff81129c63>] fput+0x11d/0x1e3 [<ffffffff81126577>] filp_close+0x70/0x7b [<ffffffff8112667a>] sys_close+0xf8/0x13d [<ffffffff814ae882>] system_call_fastpath+0x16/0x1b other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sk_lock-AF_INET); lock(local_softirq_lock); lock(sk_lock-AF_INET); lock(local_softirq_lock); *** DEADLOCK *** 1 lock held by ip/1104: #0: (sk_lock-AF_INET){+.+...}, at: [<ffffffff81433308>] lock_sock+0x10/0x12 stack backtrace: Pid: 1104, comm: ip Not tainted 3.0.0-rc3+ #26 Call Trace: [<ffffffff81081649>] print_circular_bug+0x1f8/0x209 [<ffffffff81082ecc>] __lock_acquire+0xacc/0xdc8 [<ffffffff81056d12>] ? __local_lock+0x25/0x68 [<ffffffff810836e5>] lock_acquire+0x103/0x12e [<ffffffff81056d12>] ? __local_lock+0x25/0x68 [<ffffffff81046c75>] ? get_parent_ip+0x11/0x41 [<ffffffff814a7e40>] _raw_spin_lock+0x3b/0x4a [<ffffffff81056d12>] ? __local_lock+0x25/0x68 [<ffffffff81046c8c>] ? get_parent_ip+0x28/0x41 [<ffffffff81056d12>] __local_lock+0x25/0x68 [<ffffffff81056d8b>] local_bh_disable+0x36/0x3b [<ffffffff81433308>] ? lock_sock+0x10/0x12 [<ffffffff814a7fc4>] _raw_write_lock_bh+0x16/0x4f [<ffffffff81433c38>] tcp_close+0x159/0x355 [<ffffffff81453c99>] inet_release+0xc3/0xcd [<ffffffff813dff3f>] sock_release+0x1f/0x74 [<ffffffff813dffbb>] sock_close+0x27/0x2b [<ffffffff81129c63>] fput+0x11d/0x1e3 [<ffffffff81126577>] filp_close+0x70/0x7b [<ffffffff8112667a>] sys_close+0xf8/0x13d [<ffffffff814ae882>] system_call_fastpath+0x16/0x1b Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-13neigh: recompute reachabletime before returning from neigh_periodic_work()Duan Jiong
[ Upstream commit feff9ab2e7fa773b6a3965f77375fe89f7fd85cf ] If the neigh table's entries is less than gc_thresh1, the function will return directly, and the reachabletime will not be recompute, so the reachabletime can be guessed. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-02-26net: use __GFP_NORETRY for high order allocationsEric Dumazet
[ Upstream commit ed98df3361f059db42786c830ea96e2d18b8d4db ] sock_alloc_send_pskb() & sk_page_frag_refill() have a loop trying high order allocations to prepare skb with low number of fragments as this increases performance. Problem is that under memory pressure/fragmentation, this can trigger OOM while the intent was only to try the high order allocations, then fallback to order-0 allocations. We had various reports from unexpected regressions. According to David, setting __GFP_NORETRY should be fine, as the asynchronous compaction is still enabled, and this will prevent OOM from kicking as in : CFSClientEventm invoked oom-killer: gfp_mask=0x42d0, order=3, oom_adj=0, oom_score_adj=0, oom_score_badness=2 (enabled),memcg_scoring=disabled CFSClientEventm Call Trace: [<ffffffff8043766c>] dump_header+0xe1/0x23e [<ffffffff80437a02>] oom_kill_process+0x6a/0x323 [<ffffffff80438443>] out_of_memory+0x4b3/0x50d [<ffffffff8043a4a6>] __alloc_pages_may_oom+0xa2/0xc7 [<ffffffff80236f42>] __alloc_pages_nodemask+0x1002/0x17f0 [<ffffffff8024bd23>] alloc_pages_current+0x103/0x2b0 [<ffffffff8028567f>] sk_page_frag_refill+0x8f/0x160 [<ffffffff80295fa0>] tcp_sendmsg+0x560/0xee0 [<ffffffff802a5037>] inet_sendmsg+0x67/0x100 [<ffffffff80283c9c>] __sock_sendmsg_nosec+0x6c/0x90 [<ffffffff80283e85>] sock_sendmsg+0xc5/0xf0 [<ffffffff802847b6>] __sys_sendmsg+0x136/0x430 [<ffffffff80284ec8>] sys_sendmsg+0x88/0x110 [<ffffffff80711472>] system_call_fastpath+0x16/0x1b Out of Memory: Kill process 2856 (bash) score 9999 or sacrifice child Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-02-26net: core: introduce netif_skb_dev_featuresFlorian Westphal
commit d206940319c41df4299db75ed56142177bb2e5f6 upstream. Will be used by upcoming ipv4 forward path change that needs to determine feature mask using skb->dst->dev instead of skb->dev. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-02-26net: add and use skb_gso_transport_seglen()Florian Westphal
commit de960aa9ab4decc3304959f69533eef64d05d8e8 upstream. This moves part of Eric Dumazets skb_gso_seglen helper from tbf sched to skbuff core so it may be reused by upcoming ip forwarding path patch. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-02-26netpoll: fix netconsole IPv6 setupSabrina Dubroca
[ Upstream commit 00fe11b3c67dc670fe6391d22f1fe64e7c99a8ec ] Currently, to make netconsole start over IPv6, the source address needs to be specified. Without a source address, netpoll_parse_options assumes we're setting up over IPv4 and the destination IPv6 address is rejected. Check if the IP version has been forced by a source address before checking for a version mismatch when parsing the destination address. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-02-26net: fix 'ip rule' iif/oif device renameMaciej Żenczykowski
[ Upstream commit 946c032e5a53992ea45e062ecb08670ba39b99e3 ] ip rules with iif/oif references do not update: (detach/attach) across interface renames. Signed-off-by: Maciej Żenczykowski <maze@google.com> CC: Willem de Bruijn <willemb@google.com> CC: Eric Dumazet <edumazet@google.com> CC: Chris Davis <chrismd@google.com> CC: Carlo Contavalli <ccontavalli@google.com> Google-Bug-Id: 12936021 Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-02-13fuse: fix pipe_buf_operationsMiklos Szeredi
commit 28a625cbc2a14f17b83e47ef907b2658576a32aa upstream. Having this struct in module memory could Oops when if the module is unloaded while the buffer still persists in a pipe. Since sock_pipe_buf_ops is essentially the same as fuse_dev_pipe_buf_steal merge them into nosteal_pipe_buf_ops (this is the same as default_pipe_buf_ops except stealing the page from the buffer is not allowed). Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06bpf: do not use reciprocal divideEric Dumazet
[ Upstream commit aee636c4809fa54848ff07a899b326eb1f9987a2 ] At first Jakub Zawadzki noticed that some divisions by reciprocal_divide were not correct. (off by one in some cases) http://www.wireshark.org/~darkjames/reciprocal-buggy.c He could also show this with BPF: http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c The reciprocal divide in linux kernel is not generic enough, lets remove its use in BPF, as it is not worth the pain with current cpus. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl> Cc: Mircea Gherzan <mgherzan@gmail.com> Cc: Daniel Borkmann <dxchgb@gmail.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Matt Evans <matt@ozlabs.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-15netpoll: Fix missing TXQ unlock and and OOPS.David S. Miller
[ Upstream commit aca5f58f9ba803ec8c2e6bcf890db17589e8dfcc ] The VLAN tag handling code in netpoll_send_skb_on_dev() has two problems. 1) It exits without unlocking the TXQ. 2) It then tries to queue a NULL skb to npinfo->txq. Reported-by: Ahmed Tamrawi <atamrawi@iastate.edu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-15vlan: Fix header ops passthru when doing TX VLAN offload.David S. Miller
[ Upstream commit 2205369a314e12fcec4781cc73ac9c08fc2b47de ] When the vlan code detects that the real device can do TX VLAN offloads in hardware, it tries to arrange for the real device's header_ops to be invoked directly. But it does so illegally, by simply hooking the real device's header_ops up to the VLAN device. This doesn't work because we will end up invoking a set of header_ops routines which expect a device type which matches the real device, but will see a VLAN device instead. Fix this by providing a pass-thru set of header_ops which will arrange to pass the proper real device instead. To facilitate this add a dev_rebuild_header(). There are implementations which provide a ->cache and ->create but not a ->rebuild (f.e. PLIP). So we need a helper function just like dev_hard_header() to avoid crashes. Use this helper in the one existing place where the header_ops->rebuild was being invoked, the neighbour code. With lots of help from Florian Westphal. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-15net: unix: allow set_peek_off to failSasha Levin
[ Upstream commit 12663bfc97c8b3fdb292428105dd92d563164050 ] unix_dgram_recvmsg() will hold the readlock of the socket until recv is complete. In the same time, we may try to setsockopt(SO_PEEK_OFF) which will hang until unix_dgram_recvmsg() will complete (which can take a while) without allowing us to break out of it, triggering a hung task spew. Instead, allow set_peek_off to fail, this way userspace will not hang. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-15net: drop_monitor: fix the value of maxattrChangli Gao
[ Upstream commit d323e92cc3f4edd943610557c9ea1bb4bb5056e8 ] maxattr in genl_family should be used to save the max attribute type, but not the max command type. Drop monitor doesn't support any attributes, so we should leave it as zero. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-15net: clear local_df when passing skb between namespacesHannes Frederic Sowa
[ Upstream commit 239c78db9c41a8f524cce60507440d72229d73bc ] We must clear local_df when passing the skb between namespaces as the packet is not local to the new namespace any more and thus may not get fragmented by local rules. Fred Templin noticed that other namespaces do fragment IPv6 packets while forwarding. Instead they should have send back a PTB. The same problem should be present when forwarding DF-IPv4 packets between namespaces. Reported-by: Templin, Fred L <Fred.L.Templin@boeing.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08{pktgen, xfrm} Update IPv4 header total len and checksum after tranformationfan.du
[ Upstream commit 3868204d6b89ea373a273e760609cb08020beb1a ] commit a553e4a6317b2cfc7659542c10fe43184ffe53da ("[PKTGEN]: IPSEC support") tried to support IPsec ESP transport transformation for pktgen, but acctually this doesn't work at all for two reasons(The orignal transformed packet has bad IPv4 checksum value, as well as wrong auth value, reported by wireshark) - After transpormation, IPv4 header total length needs update, because encrypted payload's length is NOT same as that of plain text. - After transformation, IPv4 checksum needs re-caculate because of payload has been changed. With this patch, armmed pktgen with below cofiguration, Wireshark is able to decrypted ESP packet generated by pktgen without any IPv4 checksum error or auth value error. pgset "flag IPSEC" pgset "flows 1" Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08gso: handle new frag_list of frags GRO packetsHerbert Xu
[ Upstream commit 9d8506cc2d7ea1f911c72c100193a3677f6668c3 ] Recently GRO started generating packets with frag_lists of frags. This was not handled by GSO, thus leading to a crash. Thankfully these packets are of a regular form and are easy to handle. This patch handles them in two ways. For completely non-linear frag_list entries, we simply continue to iterate over the frag_list frags once we exhaust the normal frags. For frag_list entries with linear parts, we call pskb_trim on the first part of the frag_list skb, and then process the rest of the frags in the usual way. This patch also kills a chunk of dead frag_list code that has obviously never ever been run since it ends up generating a bogus GSO-segmented packet with a frag_list entry. Future work is planned to split super big packets into TSO ones. Fixes: 8a29111c7ca6 ("net: gro: allow to build full sized skb") Reported-by: Christoph Paasch <christoph.paasch@uclouvain.be> Reported-by: Jerry Chu <hkchu@google.com> Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Tested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08net: core: Always propagate flag changes to interfacesVlad Yasevich
[ Upstream commit d2615bf450694c1302d86b9cc8a8958edfe4c3a4 ] The following commit: b6c40d68ff6498b7f63ddf97cf0aa818d748dee7 net: only invoke dev->change_rx_flags when device is UP tried to fix a problem with VLAN devices and promiscuouse flag setting. The issue was that VLAN device was setting a flag on an interface that was down, thus resulting in bad promiscuity count. This commit blocked flag propagation to any device that is currently down. A later commit: deede2fabe24e00bd7e246eb81cd5767dc6fcfc7 vlan: Don't propagate flag changes on down interfaces fixed VLAN code to only propagate flags when the VLAN interface is up, thus fixing the same issue as above, only localized to VLAN. The problem we have now is that if we have create a complex stack involving multiple software devices like bridges, bonds, and vlans, then it is possible that the flags would not propagate properly to the physical devices. A simple examle of the scenario is the following: eth0----> bond0 ----> bridge0 ---> vlan50 If bond0 or eth0 happen to be down at the time bond0 is added to the bridge, then eth0 will never have promisc mode set which is currently required for operation as part of the bridge. As a result, packets with vlan50 will be dropped by the interface. The only 2 devices that implement the special flag handling are VLAN and DSA and they both have required code to prevent incorrect flag propagation. As a result we can remove the generic solution introduced in b6c40d68ff6498b7f63ddf97cf0aa818d748dee7 and leave it to the individual devices to decide whether they will block flag propagation or not. Reported-by: Stefan Priebe <s.priebe@profihost.ag> Suggested-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08netfilter: push reasm skb through instead of original frag skbsJiri Pirko
[ Upstream commit 6aafeef03b9d9ecf255f3a80ed85ee070260e1ae ] Pushing original fragments through causes several problems. For example for matching, frags may not be matched correctly. Take following example: <example> On HOSTA do: ip6tables -I INPUT -p icmpv6 -j DROP ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT and on HOSTB you do: ping6 HOSTA -s2000 (MTU is 1500) Incoming echo requests will be filtered out on HOSTA. This issue does not occur with smaller packets than MTU (where fragmentation does not happen) </example> As was discussed previously, the only correct solution seems to be to use reassembled skb instead of separete frags. Doing this has positive side effects in reducing sk_buff by one pointer (nfct_reasm) and also the reams dances in ipvs and conntrack can be removed. Future plan is to remove net/ipv6/netfilter/nf_conntrack_reasm.c entirely and use code in net/ipv6/reassembly.c instead. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08net: rework recvmsg handler msg_name and msg_namelen logicHannes Frederic Sowa
[ Upstream commit f3d3342602f8bcbf37d7c46641cb9bca7618eb1c ] This patch now always passes msg->msg_namelen as 0. recvmsg handlers must set msg_namelen to the proper size <= sizeof(struct sockaddr_storage) to return msg_name to the user. This prevents numerous uninitialized memory leaks we had in the recvmsg handlers and makes it harder for new code to accidentally leak uninitialized memory. Optimize for the case recvfrom is called with NULL as address. We don't need to copy the address at all, so set it to NULL before invoking the recvmsg handler. We can do so, because all the recvmsg handlers must cope with the case a plain read() is called on them. read() also sets msg_name to NULL. Also document these changes in include/linux/net.h as suggested by David Miller. Changes since RFC: Set msg->msg_name = NULL if user specified a NULL in msg_name but had a non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't affect sendto as it would bail out earlier while trying to copy-in the address. It also more naturally reflects the logic by the callers of verify_iovec. With this change in place I could remove " if (!uaddr || msg_sys->msg_namelen == 0) msg->msg_name = NULL ". This change does not alter the user visible error logic as we ignore msg_namelen as long as msg_name is NULL. Also remove two unnecessary curly brackets in ___sys_recvmsg and change comments to netdev style. Cc: David Miller <davem@davemloft.net> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08core/dev: do not ignore dmac in dev_forward_skb()Alexei Starovoitov
[ Upstream commit 81b9eab5ebbf0d5d54da4fc168cfb02c2adc76b8 ] commit 06a23fe31ca3 ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()") and refactoring 64261f230a91 ("dev: move skb_scrub_packet() after eth_type_trans()") are forcing pkt_type to be PACKET_HOST when skb traverses veth. which means that ip forwarding will kick in inside netns even if skb->eth->h_dest != dev->dev_addr Fix order of eth_type_trans() and skb_scrub_packet() in dev_forward_skb() and in ip_tunnel_rcv() Fixes: 06a23fe31ca3 ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()") CC: Isaku Yamahata <yamahatanetdev@gmail.com> CC: Maciej Zenczykowski <zenczykowski@gmail.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08net: Fix "ip rule delete table 256"Andreas Henriksson
[ Upstream commit 13eb2ab2d33c57ebddc57437a7d341995fc9138c ] When trying to delete a table >= 256 using iproute2 the local table will be deleted. The table id is specified as a netlink attribute when it needs more then 8 bits and iproute2 then sets the table field to RT_TABLE_UNSPEC (0). Preconditions to matching the table id in the rule delete code doesn't seem to take the "table id in netlink attribute" into condition so the frh_get_table helper function never gets to do its job when matching against current rule. Use the helper function twice instead of peaking at the table value directly. Originally reported at: http://bugs.debian.org/724783 Reported-by: Nicolas HICHER <nhicher@avencall.com> Signed-off-by: Andreas Henriksson <andreas@fatal.se> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-20net: flow_dissector: fail on evil iph->ihlJason Wang
[ Upstream commit 6f092343855a71e03b8d209815d8c45bf3a27fcd ] We don't validate iph->ihl which may lead a dead loop if we meet a IPIP skb whose iph->ihl is zero. Fix this by failing immediately when iph->ihl is evil (less than 5). This issue were introduced by commit ec5efe7946280d1e84603389a1030ccec0a767ae (rps: support IPIP encapsulation). Signed-off-by: Jason Wang <jasowang@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Daniel Borkmann <dborkman@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-09net: secure_seq: Fix warning when CONFIG_IPV6 and CONFIG_INET are not selectedFabio Estevam
net_secret() is only used when CONFIG_IPV6 or CONFIG_INET are selected. Building a defconfig with both of these symbols unselected (Using the ARM at91sam9rl_defconfig, for example) leads to the following build warning: $ make at91sam9rl_defconfig # # configuration written to .config # $ make net/core/secure_seq.o scripts/kconfig/conf --silentoldconfig Kconfig CHK include/config/kernel.release CHK include/generated/uapi/linux/version.h CHK include/generated/utsrelease.h make[1]: `include/generated/mach-types.h' is up to date. CALL scripts/checksyscalls.sh CC net/core/secure_seq.o net/core/secure_seq.c:17:13: warning: 'net_secret_init' defined but not used [-Wunused-function] Fix this warning by protecting the definition of net_secret() with these symbols. Reported-by: Olof Johansson <olof@lixom.net> Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09pkt_sched: fq: fix non TCP flows pacingEric Dumazet
Steinar reported FQ pacing was not working for UDP flows. It looks like the initial sk->sk_pacing_rate value of 0 was a wrong choice. We should init it to ~0U (unlimited) Then, TCA_FQ_FLOW_DEFAULT_RATE should be removed because it makes no real sense. The default rate is really unlimited, and we need to avoid a zero divide. Reported-by: Steinar H. Gunderson <sesse@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-07net: fix unsafe set_memory_rw from softirqAlexei Starovoitov
on x86 system with net.core.bpf_jit_enable = 1 sudo tcpdump -i eth1 'tcp port 22' causes the warning: [ 56.766097] Possible unsafe locking scenario: [ 56.766097] [ 56.780146] CPU0 [ 56.786807] ---- [ 56.793188] lock(&(&vb->lock)->rlock); [ 56.799593] <Interrupt> [ 56.805889] lock(&(&vb->lock)->rlock); [ 56.812266] [ 56.812266] *** DEADLOCK *** [ 56.812266] [ 56.830670] 1 lock held by ksoftirqd/1/13: [ 56.836838] #0: (rcu_read_lock){.+.+..}, at: [<ffffffff8118f44c>] vm_unmap_aliases+0x8c/0x380 [ 56.849757] [ 56.849757] stack backtrace: [ 56.862194] CPU: 1 PID: 13 Comm: ksoftirqd/1 Not tainted 3.12.0-rc3+ #45 [ 56.868721] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012 [ 56.882004] ffffffff821944c0 ffff88080bbdb8c8 ffffffff8175a145 0000000000000007 [ 56.895630] ffff88080bbd5f40 ffff88080bbdb928 ffffffff81755b14 0000000000000001 [ 56.909313] ffff880800000001 ffff880800000000 ffffffff8101178f 0000000000000001 [ 56.923006] Call Trace: [ 56.929532] [<ffffffff8175a145>] dump_stack+0x55/0x76 [ 56.936067] [<ffffffff81755b14>] print_usage_bug+0x1f7/0x208 [ 56.942445] [<ffffffff8101178f>] ? save_stack_trace+0x2f/0x50 [ 56.948932] [<ffffffff810cc0a0>] ? check_usage_backwards+0x150/0x150 [ 56.955470] [<ffffffff810ccb52>] mark_lock+0x282/0x2c0 [ 56.961945] [<ffffffff810ccfed>] __lock_acquire+0x45d/0x1d50 [ 56.968474] [<ffffffff810cce6e>] ? __lock_acquire+0x2de/0x1d50 [ 56.975140] [<ffffffff81393bf5>] ? cpumask_next_and+0x55/0x90 [ 56.981942] [<ffffffff810cef72>] lock_acquire+0x92/0x1d0 [ 56.988745] [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380 [ 56.995619] [<ffffffff817628f1>] _raw_spin_lock+0x41/0x50 [ 57.002493] [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380 [ 57.009447] [<ffffffff8118f52a>] vm_unmap_aliases+0x16a/0x380 [ 57.016477] [<ffffffff8118f44c>] ? vm_unmap_aliases+0x8c/0x380 [ 57.023607] [<ffffffff810436b0>] change_page_attr_set_clr+0xc0/0x460 [ 57.030818] [<ffffffff810cfb8d>] ? trace_hardirqs_on+0xd/0x10 [ 57.037896] [<ffffffff811a8330>] ? kmem_cache_free+0xb0/0x2b0 [ 57.044789] [<ffffffff811b59c3>] ? free_object_rcu+0x93/0xa0 [ 57.051720] [<ffffffff81043d9f>] set_memory_rw+0x2f/0x40 [ 57.058727] [<ffffffff8104e17c>] bpf_jit_free+0x2c/0x40 [ 57.065577] [<ffffffff81642cba>] sk_filter_release_rcu+0x1a/0x30 [ 57.072338] [<ffffffff811108e2>] rcu_process_callbacks+0x202/0x7c0 [ 57.078962] [<ffffffff81057f17>] __do_softirq+0xf7/0x3f0 [ 57.085373] [<ffffffff81058245>] run_ksoftirqd+0x35/0x70 cannot reuse jited filter memory, since it's readonly, so use original bpf insns memory to hold work_struct defer kfree of sk_filter until jit completed freeing tested on x86_64 and i386 Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-07netif_set_xps_queue: make cpu mask constMichael S. Tsirkin
virtio wants to pass in cpumask_of(cpu), make parameter const to avoid build warnings. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30net: flow_dissector: fix thoff for IPPROTO_AHEric Dumazet
In commit 8ed781668dd49 ("flow_keys: include thoff into flow_keys for later usage"), we missed that existing code was using nhoff as a temporary variable that could not always contain transport header offset. This is not a problem for TCP/UDP because port offset (@poff) is 0 for these protocols. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-28net: net_secret should not depend on TCPEric Dumazet
A host might need net_secret[] and never open a single socket. Problem added in commit aebda156a570782 ("net: defer net_secret[] initialization") Based on prior patch from Hannes Frederic Sowa. Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@strressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-28net: Delay default_device_exit_batch until no devices are unregistering v2Eric W. Biederman
There is currently serialization network namespaces exiting and network devices exiting as the final part of netdev_run_todo does not happen under the rtnl_lock. This is compounded by the fact that the only list of devices unregistering in netdev_run_todo is local to the netdev_run_todo. This lack of serialization in extreme cases results in network devices unregistering in netdev_run_todo after the loopback device of their network namespace has been freed (making dst_ifdown unsafe), and after the their network namespace has exited (making the NETDEV_UNREGISTER, and NETDEV_UNREGISTER_FINAL callbacks unsafe). Add the missing serialization by a per network namespace count of how many network devices are unregistering and having a wait queue that is woken up whenever the count is decreased. The count and wait queue allow default_device_exit_batch to wait until all of the unregistration activity for a network namespace has finished before proceeding to unregister the loopback device and then allowing the network namespace to exit. Only a single global wait queue is used because there is a single global lock, and there is a single waiter, per network namespace wait queues would be a waste of resources. The per network namespace count of unregistering devices gives a progress guarantee because the number of network devices unregistering in an exiting network namespace must ultimately drop to zero (assuming network device unregistration completes). The basic logic remains the same as in v1. This patch is now half comment and half rtnl_lock_unregistering an expanded version of wait_event performs no extra work in the common case where no network devices are unregistering when we get to default_device_exit_batch. Reported-by: Francesco Ruggeri <fruggeri@aristanetworks.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-19netpoll: fix NULL pointer dereference in netpoll_cleanupNikolay Aleksandrov
I've been hitting a NULL ptr deref while using netconsole because the np->dev check and the pointer manipulation in netpoll_cleanup are done without rtnl and the following sequence happens when having a netconsole over a vlan and we remove the vlan while disabling the netconsole: CPU 1 CPU2 removes vlan and calls the notifier enters store_enabled(), calls netdev_cleanup which checks np->dev and then waits for rtnl executes the netconsole netdev release notifier making np->dev == NULL and releases rtnl continues to dereference a member of np->dev which at this point is == NULL Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-12netpoll: Should handle ETH_P_ARP other than ETH_P_IP in netpoll_neigh_replySonic Zhang
The received ARP request type in the Ethernet packet head is ETH_P_ARP other than ETH_P_IP. [ Bug introduced by commit b7394d2429c198b1da3d46ac39192e891029ec0f ("netpoll: prepare for ipv6") ] Signed-off-by: Sonic Zhang <sonic.zhang@analog.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-11net: fix multiqueue selectionEric Dumazet
commit 416186fbf8c5b4e4465 ("net: Split core bits of netdev_pick_tx into __netdev_pick_tx") added a bug that disables caching of queue index in the socket. This is the source of packet reorders for TCP flows, and again this is happening more often when using FQ pacing. Old code was doing if (queue_index != old_index) sk_tx_queue_set(sk, queue_index); Alexander renamed the variables but forgot to change sk_tx_queue_set() 2nd parameter. if (queue_index != new_index) sk_tx_queue_set(sk, queue_index); This means we store -1 over and over in sk->sk_tx_queue_mapping Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-07Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace changes from Eric Biederman: "This is an assorted mishmash of small cleanups, enhancements and bug fixes. The major theme is user namespace mount restrictions. nsown_capable is killed as it encourages not thinking about details that need to be considered. A very hard to hit pid namespace exiting bug was finally tracked and fixed. A couple of cleanups to the basic namespace infrastructure. Finally there is an enhancement that makes per user namespace capabilities usable as capabilities, and an enhancement that allows the per userns root to nice other processes in the user namespace" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: userns: Kill nsown_capable it makes the wrong thing easy capabilities: allow nice if we are privileged pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD userns: Allow PR_CAPBSET_DROP in a user namespace. namespaces: Simplify copy_namespaces so it is clear what is going on. pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup sysfs: Restrict mounting sysfs userns: Better restrictions on when proc and sysfs can be mounted vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces kernel/nsproxy.c: Improving a snippet of code. proc: Restrict mounting the proc filesystem vfs: Lock in place mounts from more privileged users
2013-09-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking changes from David Miller: "Noteworthy changes this time around: 1) Multicast rejoin support for team driver, from Jiri Pirko. 2) Centralize and simplify TCP RTT measurement handling in order to reduce the impact of bad RTO seeding from SYN/ACKs. Also, when both timestamps and local RTT measurements are available prefer the later because there are broken middleware devices which scramble the timestamp. From Yuchung Cheng. 3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel memory consumed to queue up unsend user data. From Eric Dumazet. 4) Add a "physical port ID" abstraction for network devices, from Jiri Pirko. 5) Add a "suppress" operation to influence fib_rules lookups, from Stefan Tomanek. 6) Add a networking development FAQ, from Paul Gortmaker. 7) Extend the information provided by tcp_probe and add ipv6 support, from Daniel Borkmann. 8) Use RCU locking more extensively in openvswitch data paths, from Pravin B Shelar. 9) Add SCTP support to openvswitch, from Joe Stringer. 10) Add EF10 chip support to SFC driver, from Ben Hutchings. 11) Add new SYNPROXY netfilter target, from Patrick McHardy. 12) Compute a rate approximation for sending in TCP sockets, and use this to more intelligently coalesce TSO frames. Furthermore, add a new packet scheduler which takes advantage of this estimate when available. From Eric Dumazet. 13) Allow AF_PACKET fanouts with random selection, from Daniel Borkmann. 14) Add ipv6 support to vxlan driver, from Cong Wang" Resolved conflicts as per discussion. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits) openvswitch: Fix alignment of struct sw_flow_key. netfilter: Fix build errors with xt_socket.c tcp: Add missing braces to do_tcp_setsockopt caif: Add missing braces to multiline if in cfctrl_linkup_request bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize vxlan: Fix kernel panic on device delete. net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls net: mvneta: properly disable HW PHY polling and ensure adjust_link() works icplus: Use netif_running to determine device state ethernet/arc/arc_emac: Fix huge delays in large file copies tuntap: orphan frags before trying to set tx timestamp tuntap: purge socket error queue on detach qlcnic: use standard NAPI weights ipv6:introduce function to find route for redirect bnx2x: VF RSS support - VF side bnx2x: VF RSS support - PF side vxlan: Notify drivers for listening UDP port changes net: usbnet: update addr_assign_type if appropriate driver/net: enic: update enic maintainers and driver driver/net: enic: Exposing symbols for Cisco's low latency driver ...
2013-09-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c net/bridge/br_multicast.c net/ipv6/sit.c The conflicts were minor: 1) sit.c changes overlap with change to ip_tunnel_xmit() signature. 2) br_multicast.c had an overlap between computing max_delay using msecs_to_jiffies and turning MLDV2_MRC() into an inline function with a name using lowercase instead of uppercase letters. 3) stmmac had two overlapping changes, one which conditionally allocated and hooked up a dma_cfg based upon the presence of the pbl OF property, and another one handling store-and-forward DMA made. The latter of which should not go into the new of_find_property() basic block. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04net: correctly interlink lower/upper devicesVeaceslav Falico
Currently we're linking upper devices to lower ones, which results in upside-down relationship: upper devices seeing lower devices via its upper lists. Fix this by correctly linking lower devices to the upper ones. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04skb: allow skb_scrub_packet() to be used by tunnelsNicolas Dichtel
This function was only used when a packet was sent to another netns. Now, it can also be used after tunnel encapsulation or decapsulation. Only skb_orphan() should not be done when a packet is not crossing netns. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04net: neighbour: Remove CONFIG_ARPDTim Gardner
This config option is superfluous in that it only guards a call to neigh_app_ns(). Enabling CONFIG_ARPD by default has no change in behavior. There will now be call to __neigh_notify() for each ARP resolution, which has no impact unless there is a user space daemon waiting to receive the notification, i.e., the case for which CONFIG_ARPD was designed anyways. Suggested-by: Eric W. Biederman <ebiederm@xmission.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Gao feng <gaofeng@cn.fujitsu.com> Cc: Joe Perches <joe@perches.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04Merge branch 'for-3.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on the cgroup front. Most changes aren't visible to userland at all at this point and are laying foundation for the planned unified hierarchy. - The biggest change is decoupling the lifetime management of css (cgroup_subsys_state) from that of cgroup's. Because controllers (cpu, memory, block and so on) will need to be dynamically enabled and disabled, css which is the association point between a cgroup and a controller may come and go dynamically across the lifetime of a cgroup. Till now, css's were created when the associated cgroup was created and stayed till the cgroup got destroyed. Assumptions around this tight coupling permeated through cgroup core and controllers. These assumptions are gradually removed, which consists bulk of patches, and css destruction path is completely decoupled from cgroup destruction path. Note that decoupling of creation path is relatively easy on top of these changes and the patchset is pending for the next window. - cgroup has its own event mechanism cgroup.event_control, which is only used by memcg. It is overly complex trying to achieve high flexibility whose benefits seem dubious at best. Going forward, new events will simply generate file modified event and the existing mechanism is being made specific to memcg. This pull request contains prepatory patches for such change. - Various fixes and cleanups" Fixed up conflict in kernel/cgroup.c as per Tejun. * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits) cgroup: fix cgroup_css() invocation in css_from_id() cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp() cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup cgroup: implement CFTYPE_NO_PREFIX cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax cgroup: fix cgroup_write_event_control() cgroup: fix subsystem file accesses on the root cgroup cgroup: change cgroup_from_id() to css_from_id() cgroup: use css_get() in cgroup_create() to check CSS_ROOT cpuset: remove an unncessary forward declaration cgroup: RCU protect each cgroup_subsys_state release cgroup: move subsys file removal to kill_css() cgroup: factor out kill_css() cgroup: decouple cgroup_subsys_state destruction from cgroup destruction cgroup: replace cgroup->css_kill_cnt with ->nr_css cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item cgroup: move cgroup->subsys[] assignment to online_css() cgroup: reorganize css init / exit paths cgroup: add __rcu modifier to cgroup->subsys[] ...
2013-09-03Merge tag 'driver-core-3.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core patches from Greg KH: "Here's the big driver core pull request for 3.12-rc1. Lots of tiny changes here fixing up the way sysfs attributes are created, to try to make drivers simpler, and fix a whole class race conditions with creations of device attributes after the device was announced to userspace. All the various pieces are acked by the different subsystem maintainers" * tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (119 commits) firmware loader: fix pending_fw_head list corruption drivers/base/memory.c: introduce help macro to_memory_block dynamic debug: line queries failing due to uninitialized local variable sysfs: sysfs_create_groups returns a value. debugfs: provide debugfs_create_x64() when disabled rbd: convert bus code to use bus_groups firmware: dcdbas: use binary attribute groups sysfs: add sysfs_create/remove_groups for when SYSFS is not enabled driver core: add #include <linux/sysfs.h> to core files. HID: convert bus code to use dev_groups Input: serio: convert bus code to use drv_groups Input: gameport: convert bus code to use drv_groups driver core: firmware: use __ATTR_RW() driver core: core: use DEVICE_ATTR_RO driver core: bus: use DRIVER_ATTR_WO() driver core: create write-only attribute macros for devices and drivers sysfs: create __ATTR_WO() driver-core: platform: convert bus code to use dev_groups workqueue: convert bus code to use dev_groups MEI: convert bus code to use dev_groups ...
2013-08-31userns: Kill nsown_capable it makes the wrong thing easyEric W. Biederman
nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and CAP_SETGID. For the existing users it doesn't noticably simplify things and from the suggested patches I have seen it encourages people to do the wrong thing. So remove nsown_capable. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-08-31qdisc: allow setting default queuing disciplinestephen hemminger
By default, the pfifo_fast queue discipline has been used by default for all devices. But we have better choices now. This patch allow setting the default queueing discipline with sysctl. This allows easy use of better queueing disciplines on all devices without having to use tc qdisc scripts. It is intended to allow an easy path for distributions to make fq_codel or sfq the default qdisc. This patch also makes pfifo_fast more of a first class qdisc, since it is now possible to manually override the default and explicitly use pfifo_fast. The behavior for systems who do not use the sysctl is unchanged, they still get pfifo_fast Also removes leftover random # in sysctl net core. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-30net: revert 8728c544a9c ("net: dev_pick_tx() fix")Eric Dumazet
commit 8728c544a9cbdc ("net: dev_pick_tx() fix") and commit b6fe83e9525a ("bonding: refine IFF_XMIT_DST_RELEASE capability") are quite incompatible : Queue selection is disabled because skb dst was dropped before entering bonding device. This causes major performance regression, mainly because TCP packets for a given flow can be sent to multiple queues. This is particularly visible when using the new FQ packet scheduler with MQ + FQ setup on the slaves. We can safely revert the first commit now that 416186fbf8c5b ("net: Split core bits of netdev_pick_tx into __netdev_pick_tx") properly caps the queue_index. Reported-by: Xi Wang <xii@google.com> Diagnosed-by: Xi Wang <xii@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Denys Fedorysychenko <nuclearcat@nuclearcat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: add netdev_upper_get_next_dev_rcu(dev, iter)Veaceslav Falico
This function returns the next dev in the dev->upper_dev_list after the struct list_head **iter position, and updates *iter accordingly. Returns NULL if there are no devices left. Caller must hold RCU read lock. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: remove search_list from netdev_adjacentVeaceslav Falico
We already don't need it cause we see every upper/lower device in the list already. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: add lower_dev_list to net_device and make a full meshVeaceslav Falico
This patch adds lower_dev_list list_head to net_device, which is the same as upper_dev_list, only for lower devices, and begins to use it in the same way as the upper list. It also changes the way the whole adjacent device lists work - now they contain *all* of upper/lower devices, not only the first level. The first level devices are distinguished by the bool neighbour field in netdev_adjacent, also added by this patch. There are cases when a device can be added several times to the adjacent list, the simplest would be: /---- eth0.10 ---\ eth0- --- bond0 \---- eth0.20 ---/ where both bond0 and eth0 'see' each other in the adjacent lists two times. To avoid duplication of netdev_adjacent structures ref_nr is being kept as the number of times the device was added to the list. The 'full view' is achieved by adding, on link creation, all of the upper_dev's upper_dev_list devices as upper devices to all of the lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice versa. On unlink they are removed using the same logic. I've tested it with thousands vlans/bonds/bridges, everything works ok and no observable lags even on a huge number of interfaces. Memory footprint for 128 devices interconnected with each other via both upper and lower (which is impossible, but for the comparison) lists would be: 128*128*2*sizeof(netdev_adjacent) = 1.5MB but in the real world we usualy have at most several devices with slaves and a lot of vlans, so the footprint will be much lower. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: rename netdev_upper to netdev_adjacentVeaceslav Falico
Rename the structure to reflect the upcoming addition of lower_dev_list. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29sysfs: Restrict mounting sysfsEric W. Biederman
Don't allow mounting sysfs unless the caller has CAP_SYS_ADMIN rights over the net namespace. The principle here is if you create or have capabilities over it you can mount it, otherwise you get to live with what other people have mounted. Instead of testing this with a straight forward ns_capable call, perform this check the long and torturous way with kobject helpers, this keeps direct knowledge of namespaces out of sysfs, and preserves the existing sysfs abstractions. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>