summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2016-05-10batman-adv: Remove hdr_size skb size check in batadv_interface_rxSven Eckelmann
The callers of batadv_interface_rx have to make sure that enough data can be pulled from the skb when they read the batman-adv header. The only two functions using it are either calling pskb_may_pull with hdr_size directly (batadv_recv_bcast_packet) or indirectly via batadv_check_unicast_packet (batadv_recv_unicast_packet). Reported-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-10batman-adv: Remove unused parameter recv_if of batadv_interface_rxSven Eckelmann
Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-05-10netfilter: conntrack: remove uninitialized shadow variableArnd Bergmann
A recent commit introduced an unconditional use of an uninitialized variable, as reported in this gcc warning: net/netfilter/nf_conntrack_core.c: In function '__nf_conntrack_confirm': net/netfilter/nf_conntrack_core.c:632:33: error: 'ctinfo' may be used uninitialized in this function [-Werror=maybe-uninitialized] bytes = atomic64_read(&counter[CTINFO2DIR(ctinfo)].bytes); ^ net/netfilter/nf_conntrack_core.c:628:26: note: 'ctinfo' was declared here enum ip_conntrack_info ctinfo; The problem is that a local variable shadows the function parameter. This removes the local variable, which looks like what Pablo originally intended. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race") Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10ip6_gre: Use correct flags for reading TUNNEL_SEQTom Herbert
Fix two spots where o_flags in a tunnel are being compared to GRE_SEQ instead of TUNNEL_SEQ. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10ip6: Don't set transport header in IPv6 tunnelingTom Herbert
We only need to reset network header here. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10ip6_gre: Set inner protocol correctly in __gre6_xmitTom Herbert
Need to use adjusted protocol value for setting inner protocol. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10gre6: Fix flag translationsTom Herbert
GRE for IPv6 does not properly translate for GRE flags to tunnel flags and vice versa. This patch fixes that. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10ip6_gre: Fix MTU settingTom Herbert
In ip6gre_tnl_link_config set t->tun_len and t->hlen correctly for the configuration. For hard_header_len and mtu calculation include IPv6 header and encapsulation overhead. In ip6gre_tunnel_init_common set t->tun_len and t->hlen correctly for the configuration. Revert to setting hard_header_len instead of needed_headroom. Tested: ./ip link add name tun8 type ip6gretap remote \ 2401:db00:20:911a:face:0:27:0 local \ 2401:db00:20:911a:face:0:25:0 ttl 225 Gives MTU of 1434. That is equal to 1500 - 40 - 14 - 4 - 8. ./ip link add name tun8 type ip6gretap remote \ 2401:db00:20:911a:face:0:27:0 local \ 2401:db00:20:911a:face:0:25:0 ttl 225 okey 123 Gives MTU of 1430. That is equal to 1500 - 40 - 14 - 4 - 8 - 4. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10net: l3mdev: Allow send on enslaved interfaceDavid Ahern
Allow udp and raw sockets to send by oif that is an enslaved interface versus the l3mdev/VRF device. For example, this allows BFD to use ifindex from IP_PKTINFO on a receive to send a response without the need to convert to the VRF index. It also allows ping and ping6 to work when specifying an enslaved interface (e.g., ping -I swp1 <ip>) which is a natural use case. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10net: l3mdev: Move get_saddr and rt6_dstDavid Ahern
Move l3mdev_rt6_dst_by_oif and l3mdev_get_saddr to l3mdev.c. Collapse l3mdev_get_rt6_dst into l3mdev_rt6_dst_by_oif since it is the only user and keep the l3mdev_get_rt6_dst name for consistency with other hooks. A follow-on patch adds more code to these functions making them long for inlined functions. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
In netdevice.h we removed the structure in net-next that is being changes in 'net'. In macsec.c and rtnetlink.c we have overlaps between fixes in 'net' and the u64 attribute changes in 'net-next'. The mlx5 conflicts have to do with vxlan support dependencies. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following large patchset contains Netfilter updates for your net-next tree. My initial intention was to send you this in two goes but when I looked back twice I already had this burden on top of me. Several updates for IPVS from Marco Angaroni: 1) Allow SIP connections originating from real-servers to be load balanced by the SIP persistence engine as is already implemented in the other direction. 2) Release connections immediately for One-packet-scheduling (OPS) in IPVS, instead of making it via timer and rcu callback. 3) Skip deleting conntracks for each one packet in OPS, and don't call nf_conntrack_alter_reply() since no reply is expected. 4) Enable drop on exhaustion for OPS + SIP persistence. Miscelaneous conntrack updates from Florian Westphal, including fix for hash resize: 5) Move conntrack generation counter out of conntrack pernet structure since this is only used by the init_ns to allow hash resizing. 6) Use get_random_once() from packet path to collect hash random seed instead of our compound. 7) Don't disable BH from ____nf_conntrack_find() for statistics, use NF_CT_STAT_INC_ATOMIC() instead. 8) Fix lookup race during conntrack hash resizing. 9) Introduce clash resolution on conntrack insertion for connectionless protocol. Then, Florian's netns rework to get rid of per-netns conntrack table, thus we use one single table for them all. There was consensus on this change during the NFWS 2015 and, on top of that, it has recently been pointed as a source of multiple problems from unpriviledged netns: 11) Use a single conntrack hashtable for all namespaces. Include netns in object comparisons and make it part of the hash calculation. Adapt early_drop() to consider netns. 12) Use single expectation and NAT hashtable for all namespaces. 13) Use a single slab cache for all namespaces for conntrack objects. 14) Skip full table scanning from nf_ct_iterate_cleanup() if the pernet conntrack counter tells us the table is empty (ie. equals zero). Fixes for nf_tables interval set element handling, support to set conntrack connlabels and allow set names up to 32 bytes. 15) Parse element flags from element deletion path and pass it up to the backend set implementation. 16) Allow adjacent intervals in the rbtree set type for dynamic interval updates. 17) Add support to set connlabel from nf_tables, from Florian Westphal. 18) Allow set names up to 32 bytes in nf_tables. Several x_tables fixes and updates: 19) Fix incorrect use of IS_ERR_VALUE() in x_tables, original patch from Andrzej Hajda. And finally, miscelaneous netfilter updates such as: 20) Disable automatic helper assignment by default. Note this proc knob was introduced by a9006892643a ("netfilter: nf_ct_helper: allow to disable automatic helper assignment") 4 years ago to start moving towards explicit conntrack helper configuration via iptables CT target. 21) Get rid of obsolete and inconsistent debugging instrumentation in x_tables. 22) Remove unnecessary check for null after ip6_route_output(). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09netfilter: conntrack: use single slab cacheFlorian Westphal
An earlier patch changed lookup side to also net_eq() namespaces after obtaining a reference on the conntrack, so a single kmemcache can be used. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-09netfilter: conntrack: use a single nat bysource table for all namespacesFlorian Westphal
We already include netns address in the hash, so we only need to use net_eq in find_appropriate_src and can then put all entries into same table. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-09netfilter: conntrack: make netns address part of nat bysrc hashFlorian Westphal
Will be needed soon when we place all in the same hash table. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-09net: make sch_handle_ingress() drop monitor readyEric Dumazet
TC_ACT_STOLEN is used when ingress traffic is mirred/redirected to say ifb. Packet is not dropped, but consumed. Only TC_ACT_SHOT is a clear indication something went wrong. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09fq_codel: add memory limitation per queueEric Dumazet
On small embedded routers, one wants to control maximal amount of memory used by fq_codel, instead of controlling number of packets or bytes, since GRO/TSO make these not practical. Assuming skb->truesize is accurate, we have to keep track of skb->truesize sum for skbs in queue. This patch adds a new TCA_FQ_CODEL_MEMORY_LIMIT attribute. I chose a default value of 32 MBytes, which looks reasonable even for heavy duty usages. (Prior fq_codel users should not be hurt when they upgrade their kernels) Two fields are added to tc_fq_codel_qd_stats to report : - Current memory usage - Number of drops caused by memory limits # tc qd replace dev eth1 root est 1sec 4sec fq_codel memory_limit 4M .. # tc -s -d qd sh dev eth1 qdisc fq_codel 8008: root refcnt 257 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn Sent 2083566791363 bytes 1376214889 pkt (dropped 4994406, overlimits 0 requeues 21705223) rate 9841Mbit 812549pps backlog 3906120b 376p requeues 21705223 maxpacket 68130 drop_overlimit 4994406 new_flow_count 28855414 ecn_mark 0 memory_used 4190048 drop_overmemory 4994406 new_flows_len 1 old_flows_len 177 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Dave Täht <dave.taht@gmail.com> Cc: Sebastian Möller <moeller0@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09net: Add Qualcomm IPC routerCourtney Cavin
Add an implementation of Qualcomm's IPC router protocol, used to communicate with service providing remote processors. Signed-off-by: Courtney Cavin <courtney.cavin@sonymobile.com> Signed-off-by: Bjorn Andersson <bjorn.andersson@sonymobile.com> [bjorn: Cope with 0 being a valid node id and implement RTM_NEWADDR] Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06udp_offload: Set encapsulation before inner completes.Jarno Rajahalme
UDP tunnel segmentation code relies on the inner offsets being set for an UDP tunnel GSO packet, but the inner *_complete() functions will set the inner offsets only if 'encapsulation' is set before calling them. Currently, udp_gro_complete() sets 'encapsulation' only after the inner *_complete() functions are done. This causes the inner offsets having invalid values after udp_gro_complete() returns, which in turn will make it impossible to properly segment the packet in case it needs to be forwarded, which would be visible to the user either as invalid packets being sent or as packet loss. This patch fixes this by setting skb's 'encapsulation' in udp_gro_complete() before calling into the inner complete functions, and by making each possible UDP tunnel gro_complete() callback set the inner_mac_header to the beginning of the tunnel payload. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Reviewed-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06udp_tunnel: Remove redundant udp_tunnel_gro_complete().Jarno Rajahalme
The setting of the UDP tunnel GSO type is already performed by udp[46]_gro_complete(). Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06ipv4: tcp: ip_send_unicast_reply() is not BH safeEric Dumazet
I forgot that ip_send_unicast_reply() is not BH safe (yet). Disabling preemption before calling it was not a good move. Fixes: c10d9310edf5 ("tcp: do not assume TCP code is non preemptible") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andres Lagar-Cavilla <andreslc@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06bpf: wire in data and data_end for cls_act_bpfAlexei Starovoitov
allow cls_bpf and act_bpf programs access skb->data and skb->data_end pointers. The bpf helpers that change skb->data need to update data_end pointer as well. The verifier checks that programs always reload data, data_end pointers after calls to such bpf helpers. We cannot add 'data_end' pointer to struct qdisc_skb_cb directly, since it's embedded as-is by infiniband ipoib, so wrapper struct is needed. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06net: vrf: Create FIB tables on link createDavid Ahern
Tables have to exist for VRFs to function. Ensure they exist when VRF device is created. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06net: ipv6: tcp reset, icmp need to consider L3 domainDavid Ahern
Responses for packets to unused ports are getting lost with L3 domains. IPv4 has ip_send_unicast_reply for sending TCP responses which accounts for L3 domains; update the IPv6 counterpart tcp_v6_send_response. For icmp the L3 master check needs to be moved up in icmp6_send to properly respond to UDP packets to a port with no listener. Fixes: ca254490c8df ("net: Add VRF support to IPv6 stack") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06bridge: fix igmp / mld query parsingLinus Lüssing
With the newly introduced helper functions the skb pulling is hidden in the checksumming function - and undone before returning to the caller. The IGMP and MLD query parsing functions in the bridge still assumed that the skb is pointing to the beginning of the IGMP/MLD message while it is now kept at the beginning of the IPv4/6 header. If there is a querier somewhere else, then this either causes the multicast snooping to stay disabled even though it could be enabled. Or, if we have the querier enabled too, then this can create unnecessary IGMP / MLD query messages on the link. Fixing this by taking the offset between IP and IGMP/MLD header into account, too. Fixes: 9afd85c9e455 ("net: Export IGMP/MLD message validation code") Reported-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06Merge tag 'ipvs2-for-v4.7' of ↵Pablo Neira Ayuso
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== Second Round of IPVS Updates for v4.7 please consider these enhancements to the IPVS. They allow its DoS mitigation strategy effective in conjunction with the SIP persistence engine. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-06netfilter: conntrack: use a single expectation table for all namespacesFlorian Westphal
We already include netns address in the hash and compare the netns pointers during lookup, so even if namespaces have overlapping addresses entries will be spread across the expectation table. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-06netfilter: conntrack: make netns address part of expect hashFlorian Westphal
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-06netfilter: conntrack: check netns when walking expect hashFlorian Westphal
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-06ipvs: make drop_entry protection effective for SIP-peMarco Angaroni
DoS protection policy that deletes connections to avoid out of memory is currently not effective for SIP-pe plus OPS-mode for two reasons: 1) connection templates (holding SIP call-id) are always skipped in ip_vs_random_dropentry() 2) in_pkts counter (used by drop_entry algorithm) is not incremented for connection templates This patch addresses such problems with the following changes: a) connection templates associated (via their dest) to virtual-services configured in OPS mode are included in ip_vs_random_dropentry() monitoring. This applies to SIP-pe over UDP (which requires OPS mode), but is more general principle: when OPS is controlled by templates memory can be used only by templates themselves, since OPS conns are deleted after packet is forwarded. b) OPS connections, if controlled by a template, cause increment of in_pkts counter of their template. This is already happening but only in case director is in master-slave mode (see ip_vs_sync_conn()). Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2016-05-06net: bridge: fix old ioctl unlocked net device walkNikolay Aleksandrov
get_bridge_ifindices() is used from the old "deviceless" bridge ioctl calls which aren't called with rtnl held. The comment above says that it is called with rtnl but that is not really the case. Here's a sample output from a test ASSERT_RTNL() which I put in get_bridge_ifindices and executed "brctl show": [ 957.422726] RTNL: assertion failed at net/bridge//br_ioctl.c (30) [ 957.422925] CPU: 0 PID: 1862 Comm: brctl Tainted: G W O 4.6.0-rc4+ #157 [ 957.423009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 [ 957.423009] 0000000000000000 ffff880058adfdf0 ffffffff8138dec5 0000000000000400 [ 957.423009] ffffffff81ce8380 ffff880058adfe58 ffffffffa05ead32 0000000000000001 [ 957.423009] 00007ffec1a444b0 0000000000000400 ffff880053c19130 0000000000008940 [ 957.423009] Call Trace: [ 957.423009] [<ffffffff8138dec5>] dump_stack+0x85/0xc0 [ 957.423009] [<ffffffffa05ead32>] br_ioctl_deviceless_stub+0x212/0x2e0 [bridge] [ 957.423009] [<ffffffff81515beb>] sock_ioctl+0x22b/0x290 [ 957.423009] [<ffffffff8126ba75>] do_vfs_ioctl+0x95/0x700 [ 957.423009] [<ffffffff8126c159>] SyS_ioctl+0x79/0x90 [ 957.423009] [<ffffffff8163a4c0>] entry_SYSCALL_64_fastpath+0x23/0xc1 Since it only reads bridge ifindices, we can use rcu to safely walk the net device list. Also remove the wrong rtnl comment above. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06VSOCK: do not disconnect socket when peer has shutdown SEND onlyIan Campbell
The peer may be expecting a reply having sent a request and then done a shutdown(SHUT_WR), so tearing down the whole socket at this point seems wrong and breaks for me with a client which does a SHUT_WR. Looking at other socket family's stream_recvmsg callbacks doing a shutdown here does not seem to be the norm and removing it does not seem to have had any adverse effects that I can see. I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact on the vmci transport. Signed-off-by: Ian Campbell <ian.campbell@docker.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Cc: Andy King <acking@vmware.com> Cc: Dmitry Torokhov <dtor@vmware.com> Cc: Jorgen Hansen <jhansen@vmware.com> Cc: Adit Ranadive <aditr@vmware.com> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-05netfilter: nf_tables: allow set names up to 32 bytesPablo Neira Ayuso
Currently, we support set names of up to 16 bytes, get this aligned with the maximum length we can use in ipset to make it easier when considering migration to nf_tables. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: x_tables: get rid of old and inconsistent debuggingPablo Neira Ayuso
The dprintf() and duprintf() functions are enabled at compile time, these days we have better runtime debugging through pr_debug() and static keys. On top of this, this debugging is so old that I don't expect anyone using this anymore, so let's get rid of this. IP_NF_ASSERT() is still left in place, although this needs that NETFILTER_DEBUG is enabled, I think these assertions provide useful context information when reading the code. Note that ARP_NF_ASSERT() has been removed as there is no user of this. Kill also DEBUG_ALLOW_ALL and a couple of pr_error() and pr_debug() spots that are inconsistently placed in the code. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05openvswitch: __nf_ct_l{3,4}proto_find() always return a valid pointerPablo Neira Ayuso
If the protocol is not natively supported, this assigns generic protocol tracker so we can always assume a valid pointer after these calls. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Joe Stringer <joe@ovn.org>
2016-05-05netfilter: conntrack: introduce clash resolution on insertion racePablo Neira Ayuso
This patch introduces nf_ct_resolve_clash() to resolve race condition on conntrack insertions. This is particularly a problem for connection-less protocols such as UDP, with no initial handshake. Two or more packets may race to insert the entry resulting in packet drops. Another problematic scenario are packets enqueued to userspace via NFQUEUE after the raw table, that make it easier to trigger this race. To resolve this, the idea is to reset the conntrack entry to the one that won race. Packet and bytes counters are also merged. The 'insert_failed' stats still accounts for this situation, after this patch, the drop counter is bumped whenever we drop packets, so we can watch for unresolved clashes. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: introduce nf_ct_acct_update()Pablo Neira Ayuso
Introduce a helper function to update conntrack counters. __nf_ct_kill_acct() was unnecessarily subtracting skb_network_offset() that is expected to be zero from the ipv4/ipv6 hooks. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: __nf_ct_l4proto_find() always returns valid pointerPablo Neira Ayuso
Remove unnecessary check for non-nul pointer in destroy_conntrack() given that __nf_ct_l4proto_find() returns the generic protocol tracker if the protocol is not supported. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: consider ct netns in early_drop logicFlorian Westphal
When iterating, skip conntrack entries living in a different netns. We could ignore netns and kill some other non-assured one, but it has two problems: - a netns can kill non-assured conntracks in other namespace - we would start to 'over-subscribe' the affected/overlimit netns. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: use a single hashtable for all namespacesFlorian Westphal
We already include netns address in the hash and compare the netns pointers during lookup, so even if namespaces have overlapping addresses entries will be spread across the table. Assuming 64k bucket size, this change saves 0.5 mbyte per namespace on a 64bit system. NAT bysrc and expectation hash is still per namespace, those will changed too soon. Future patch will also make conntrack object slab cache global again. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: make netns address part of hashFlorian Westphal
Once we place all conntracks into a global hash table we want them to be spread across entire hash table, even if namespaces have overlapping ip addresses. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: check netns when comparing conntrack objectsFlorian Westphal
Once we place all conntracks in the same hash table we must also compare the netns pointer to skip conntracks that belong to a different namespace. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: small refactoring of conntrack seq_printfFlorian Westphal
The iteration process is lockless, so we test if the conntrack object is eligible for printing (e.g. is AF_INET) after obtaining the reference count. Once we put all conntracks into same hash table we might see more entries that need to be skipped. So add a helper and first perform the test in a lockless fashion for fast skip. Once we obtain the reference count, just repeat the check. Note that this refactoring also includes a missing check for unconfirmed conntrack entries due to slab rcu object re-usage, so they need to be skipped since they are not part of the listing. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: use nf_ct_key_equal() in more placesFlorian Westphal
This prepares for upcoming change that places all conntracks into a single, global table. For this to work we will need to also compare net pointer during lookup. To avoid open-coding such check use the nf_ct_key_equal helper and then later extend it to also consider net_eq. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: don't attempt to iterate over empty tableFlorian Westphal
Once we place all conntracks into same table iteration becomes more costly because the table contains conntracks that we are not interested in (belonging to other netns). So don't bother scanning if the current namespace has no entries. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: fix lookup race during hash resizeFlorian Westphal
When resizing the conntrack hash table at runtime via echo 42 > /sys/module/nf_conntrack/parameters/hashsize, we are racing with the conntrack lookup path -- reads can happen in parallel and nothing prevents readers from observing a the newly allocated hash but the old size (or vice versa). So access to hash[bucket] can trigger OOB read access in case the table got expanded and we saw the new size but the old hash pointer (or it got shrunk and we got new hash ptr but the size of the old and larger table): kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.6.0-rc2+ #107 [..] Call Trace: [<ffffffff822c3d6a>] ? nf_conntrack_tuple_taken+0x12a/0xe90 [<ffffffff822c3ac1>] ? nf_ct_invert_tuplepr+0x221/0x3a0 [<ffffffff8230e703>] get_unique_tuple+0xfb3/0x2760 Use generation counter to obtain the address/length of the same table. Also add a synchronize_net before freeing the old hash. AFAICS, without it we might access ct_hash[bucket] after ct_hash has been freed, provided that lockless reader got delayed by another event: CPU1 CPU2 seq_begin seq_retry <delay> resize occurs free oldhash for_each(oldhash[size]) Note that resize is only supported in init_netns, it took over 2 minutes of constant resizing+flooding to produce the warning, so this isn't a big problem in practice. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: keep BH enabled during lookupFlorian Westphal
No need to disable BH here anymore: stats are switched to _ATOMIC variant (== this_cpu_inc()), which nowadays generates same code as the non _ATOMIC NF_STAT, at least on x86. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: nftables: add connlabel set supportFlorian Westphal
Conntrack labels are currently sized depending on the iptables ruleset, i.e. if we're asked to test or set bits 1, 2, and 65 then we would allocate enough room to store at least bit 65. However, with nft, the input is just a register with arbitrary runtime content. We therefore ask for the upper ceiling we currently have, which is enough room to store 128 bits. Alternatively, we could alter nf_connlabel_replace to increase net->ct.label_words at run time, but since 128 bits is not that big we'd only save sizeof(long) so it doesn't seem worth it for now. This follows a similar approach that xtables 'connlabel' match uses, so when user inputs ct label set bar then we will set the bit used by the 'bar' label and leave the rest alone. This is done by passing the sreg content to nf_connlabels_replace as both value and mask argument. Labels (bits) already set thus cannot be re-set to zero, but this is not supported by xtables connlabel match either. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05tcp: two more missing bh disableEric Dumazet
percpu_counter only have protection against preemption. TCP stack uses them possibly from BH, so we need BH protection in contexts that could be run in process context Fixes: c10d9310edf5 ("tcp: do not assume TCP code is non preemptible") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04tcp: must block bh in __inet_twsk_hashdance()Eric Dumazet
__inet_twsk_hashdance() might be called from process context, better block BH before acquiring bind hash and established locks Fixes: c10d9310edf5 ("tcp: do not assume TCP code is non preemptible") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>