summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2016-05-25libceph: switch to calc_target(), part 2Ilya Dryomov
The crux of this is getting rid of ceph_osdc_build_request(), so that MOSDOp can be encoded not before but after calc_target() calculates the actual target. Encoding now happens within ceph_osdc_start_request(). Also nuked is the accompanying bunch of pointers into the encoded buffer that was used to update fields on each send - instead, the entire front is re-encoded. If we want to support target->name_len != base->name_len in the future, there is no other way, because oid is surrounded by other fields in the encoded buffer. Encoding OSD ops and adding data items to the request message were mixed together in osd_req_encode_op(). While we want to re-encode OSD ops, we don't want to add duplicate data items to the message when resending, so all call to ceph_osdc_msg_data_add() are factored out into a new setup_request_data(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: switch to calc_target(), part 1Ilya Dryomov
Replace __calc_request_pg() and most of __map_request() with calc_target() and start using req->r_t. ceph_osdc_build_request() however still encodes base_oid, because it's called before calc_target() is and target_oid is empty at that point in time; a printf in osdc_show() also shows base_oid. This is fixed in "libceph: switch to calc_target(), part 2". Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: introduce ceph_osd_request_target, calc_target()Ilya Dryomov
Introduce ceph_osd_request_target, containing all mapping-related fields of ceph_osd_request and calc_target() for calculating mappings and populating it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: pi->min_size, pi->last_force_request_resendIlya Dryomov
Add and decode pi->min_size and pi->last_force_request_resend. These are going to be used by calc_target(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: make pgid_cmp() globalIlya Dryomov
calc_target() code is going to need to know how to compare PGs. Take lhs and rhs pgid by const * while at it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: rename ceph_calc_pg_primary()Ilya Dryomov
Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to emphasise that it returns acting primary. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: ceph_osds, ceph_pg_to_up_acting_osds()Ilya Dryomov
Knowning just acting set isn't enough, we need to be able to record up set as well to detect interval changes. This means returning (up[], up_len, up_primary, acting[], acting_len, acting_primary) and passing it around. Introduce and switch to ceph_osds to help with that. Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return both up and acting sets from it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: rename ceph_oloc_oid_to_pg()Ilya Dryomov
Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg(). Emphasise that returned is raw PG and return -ENOENT instead of -EIO if the pool doesn't exist. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: DEFINE_RB_FUNCS macroIlya Dryomov
Given struct foo { u64 id; struct rb_node bar_node; }; generate insert_bar(), erase_bar() and lookup_bar() functions with DEFINE_RB_FUNCS(bar, struct foo, id, bar_node) The key is assumed to be an integer (u64, int, etc), compared with < and >. nodefld has to be initialized with RB_CLEAR_NODE(). Start using it for MDS, MON and OSD requests and OSD sessions. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: open-code remove_{all,old}_osds()Ilya Dryomov
They are called only once, from ceph_osdc_stop() and handle_osds_timeout() respectively. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: nuke unused fields and functionsIlya Dryomov
Either unused or useless: osdmap->mkfs_epoch osd->o_marked_for_keepalive monc->num_generic_requests osdc->map_waiters osdc->last_requested_map osdc->timeout_tid osd_req_op_cls_response_data() osdmap_apply_incremental() @msgr arg Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: variable-sized ceph_object_idIlya Dryomov
Currently ceph_object_id can hold object names of up to 100 (CEPH_MAX_OID_NAME_LEN) characters. This is enough for all use cases, expect one - long rbd image names: - a format 1 header is named "<imgname>.rbd" - an object that points to a format 2 header is named "rbd_id.<imgname>" We operate on these potentially long-named objects during rbd map, and, for format 1 images, during header refresh. (A format 2 header name is a small system-generated string.) Lift this 100 character limit by making ceph_object_id be able to point to an externally-allocated string. Apart from being able to work with almost arbitrarily-long named objects, this allows us to reduce the size of ceph_object_id from >100 bytes to 64 bytes. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: change how osd_op_reply message size is calculatedIlya Dryomov
For a message pool message, preallocate a page, just like we do for osd_op. For a normal message, take ceph_object_id into account and don't bother subtracting CEPH_OSD_SLAB_OPS ceph_osd_ops. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: move message allocation out of ceph_osdc_alloc_request()Ilya Dryomov
The size of ->r_request and ->r_reply messages depends on the size of the object name (ceph_object_id), while the size of ceph_osd_request is fixed. Move message allocation into a separate function that would have to be called after ceph_object_id and ceph_object_locator (which is also going to become variable in size with RADOS namespaces) have been filled in: req = ceph_osdc_alloc_request(...); <fill in req->r_base_oid> <fill in req->r_base_oloc> ceph_osdc_alloc_messages(req); Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: grab snapc in ceph_osdc_alloc_request()Ilya Dryomov
ceph_osdc_build_request() is going away. Grab snapc and initialize ->r_snapid in ceph_osdc_alloc_request(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-25libceph: make ceph_osdc_put_request() accept NULLIlya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-14net/route: enforce hoplimit max valuePaolo Abeni
Currently, when creating or updating a route, no check is performed in both ipv4 and ipv6 code to the hoplimit value. The caller can i.e. set hoplimit to 256, and when such route will be used, packets will be sent with hoplimit/ttl equal to 0. This commit adds checks for the RTAX_HOPLIMIT value, in both ipv4 ipv6 route code, substituting any value greater than 255 with 255. This is consistent with what is currently done for ADVMSS and MTU in the ipv4 code. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14nf_conntrack: avoid kernel pointer value leak in slab nameLinus Torvalds
The slab name ends up being visible in the directory structure under /sys, and even if you don't have access rights to the file you can see the filenames. Just use a 64-bit counter instead of the pointer to the 'net' structure to generate a unique name. This code will go away in 4.7 when the conntrack code moves to a single kmemcache, but this is the backportable simple solution to avoiding leaking kernel pointers to user space. Fixes: 5b3501faa874 ("netfilter: nf_conntrack: per netns nf_conntrack_cachep") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11gre: do not keep the GRE header around in collect medata modeJiri Benc
For ipgre interface in collect metadata mode, it doesn't make sense for the interface to be of ARPHRD_IPGRE type. The outer header of received packets is not needed, as all the information from it is present in metadata_dst. We already don't set ipgre_header_ops for collect metadata interfaces, which is the only consumer of mac_header pointing to the outer IP header. Just set the interface type to ARPHRD_NONE in collect metadata mode for ipgre (not gretap, that still correctly stays ARPHRD_ETHER) and reset mac_header. Fixes: a64b04d86d14 ("gre: do not assign header_ops in collect metadata mode") Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.") Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11openvswitch: Fix cached ct with helper.Joe Stringer
When using conntrack helpers from OVS, a common configuration is to perform a lookup without specifying a helper, then go through a firewalling policy, only to decide to attach a helper afterwards. In this case, the initial lookup will cause a ct entry to be attached to the skb, then the later commit with helper should attach the helper and confirm the connection. However, the helper attachment has been missing. If the user has enabled automatic helper attachment, then this issue will be masked as it will be applied in init_conntrack(). It is also masked if the action is executed from ovs_packet_cmd_execute() as that will construct a fresh skb. This patch fixes the issue by making an explicit call to try to assign the helper if there is a discrepancy between the action's helper and the current skb->nfct. Fixes: cae3a2627520 ("openvswitch: Allow attaching helpers to ct action") Signed-off-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11net sched: ife action fix late bindingJamal Hadi Salim
The process below was broken and is fixed with this patch. //add an ife action and give it an instance id of 1 sudo tc actions add action ife encode \ type 0xDEAD allow mark dst 02:15:15:15:15:15 index 1 //create a filter which binds to ife action id 1 sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:11 action ife index 1 Message before fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11net sched: skbedit action fix late bindingJamal Hadi Salim
The process below was broken and is fixed with this patch. //add a skbedit action and give it an instance id of 1 sudo tc actions add action skbedit mark 10 index 1 //create a filter which binds to skbedit action id 1 sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:10 action skbedit index 1 Message before fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11net sched: simple action fix late bindingJamal Hadi Salim
The process below was broken and is fixed with this patch. //add a simple action and give it an instance id of 1 sudo tc actions add action simple sdata "foobar" index 1 //create a filter which binds to simple action id 1 sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:10 action simple index 1 Message before fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11net sched: mirred action fix late bindingJamal Hadi Salim
The process below was broken and is fixed with this patch. //add an mirred action and give it an instance id of 1 sudo tc actions add action mirred egress mirror dev $MDEV index 1 //create a filter which binds to mirred action id 1 sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:10 action mirred index 1 Message before bug fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11net sched: ipt action fix late bindingJamal Hadi Salim
This was broken and is fixed with this patch. //add an ipt action and give it an instance id of 1 sudo tc actions add action ipt -j mark --set-mark 2 index 1 //create a filter which binds to ipt action id 1 sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:10 action ipt index 1 Message before bug fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11net sched: vlan action fix late bindingJamal Hadi Salim
Late vlan action binding was broken and is fixed with this patch. //add a vlan action to pop and give it an instance id of 1 sudo tc actions add action vlan pop index 1 //create filter which binds to vlan action id 1 sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32 \ match ip dst 17.0.0.1/32 flowid 1:1 action vlan index 1 current message(before bug fix) was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10tcp: refresh skb timestamp at retransmit timeEric Dumazet
In the very unlikely case __tcp_retransmit_skb() can not use the cloning done in tcp_transmit_skb(), we need to refresh skb_mstamp before doing the copy and transmit, otherwise TCP TS val will be an exact copy of original transmit. Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contain Netfilter simple fixes for your net tree, two one-liner and one two-liner: 1) Oneliner to fix missing spinlock definition that triggers 'BUG: spinlock bad magic on CPU#' when spinlock debugging is enabled, from Florian Westphal. 2) Fix missing workqueue cancelation on IDLETIMER removal, from Liping Zhang. 3) Fix insufficient validation of netlink of NFACCT_QUOTA in nfnetlink_acct, from Phil Turnbull. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10net: fix a kernel infoleak in x25 moduleKangjie Lu
Stack object "dte_facilities" is allocated in x25_rx_call_request(), which is supposed to be initialized in x25_negotiate_facilities. However, 5 fields (8 bytes in total) are not initialized. This object is then copied to userland via copy_to_user, thus infoleak occurs. Signed-off-by: Kangjie Lu <kjlu@gatech.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06udp_offload: Set encapsulation before inner completes.Jarno Rajahalme
UDP tunnel segmentation code relies on the inner offsets being set for an UDP tunnel GSO packet, but the inner *_complete() functions will set the inner offsets only if 'encapsulation' is set before calling them. Currently, udp_gro_complete() sets 'encapsulation' only after the inner *_complete() functions are done. This causes the inner offsets having invalid values after udp_gro_complete() returns, which in turn will make it impossible to properly segment the packet in case it needs to be forwarded, which would be visible to the user either as invalid packets being sent or as packet loss. This patch fixes this by setting skb's 'encapsulation' in udp_gro_complete() before calling into the inner complete functions, and by making each possible UDP tunnel gro_complete() callback set the inner_mac_header to the beginning of the tunnel payload. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Reviewed-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06udp_tunnel: Remove redundant udp_tunnel_gro_complete().Jarno Rajahalme
The setting of the UDP tunnel GSO type is already performed by udp[46]_gro_complete(). Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06net: ipv6: tcp reset, icmp need to consider L3 domainDavid Ahern
Responses for packets to unused ports are getting lost with L3 domains. IPv4 has ip_send_unicast_reply for sending TCP responses which accounts for L3 domains; update the IPv6 counterpart tcp_v6_send_response. For icmp the L3 master check needs to be moved up in icmp6_send to properly respond to UDP packets to a port with no listener. Fixes: ca254490c8df ("net: Add VRF support to IPv6 stack") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06bridge: fix igmp / mld query parsingLinus Lüssing
With the newly introduced helper functions the skb pulling is hidden in the checksumming function - and undone before returning to the caller. The IGMP and MLD query parsing functions in the bridge still assumed that the skb is pointing to the beginning of the IGMP/MLD message while it is now kept at the beginning of the IPv4/6 header. If there is a querier somewhere else, then this either causes the multicast snooping to stay disabled even though it could be enabled. Or, if we have the querier enabled too, then this can create unnecessary IGMP / MLD query messages on the link. Fixing this by taking the offset between IP and IGMP/MLD header into account, too. Fixes: 9afd85c9e455 ("net: Export IGMP/MLD message validation code") Reported-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06net: bridge: fix old ioctl unlocked net device walkNikolay Aleksandrov
get_bridge_ifindices() is used from the old "deviceless" bridge ioctl calls which aren't called with rtnl held. The comment above says that it is called with rtnl but that is not really the case. Here's a sample output from a test ASSERT_RTNL() which I put in get_bridge_ifindices and executed "brctl show": [ 957.422726] RTNL: assertion failed at net/bridge//br_ioctl.c (30) [ 957.422925] CPU: 0 PID: 1862 Comm: brctl Tainted: G W O 4.6.0-rc4+ #157 [ 957.423009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 [ 957.423009] 0000000000000000 ffff880058adfdf0 ffffffff8138dec5 0000000000000400 [ 957.423009] ffffffff81ce8380 ffff880058adfe58 ffffffffa05ead32 0000000000000001 [ 957.423009] 00007ffec1a444b0 0000000000000400 ffff880053c19130 0000000000008940 [ 957.423009] Call Trace: [ 957.423009] [<ffffffff8138dec5>] dump_stack+0x85/0xc0 [ 957.423009] [<ffffffffa05ead32>] br_ioctl_deviceless_stub+0x212/0x2e0 [bridge] [ 957.423009] [<ffffffff81515beb>] sock_ioctl+0x22b/0x290 [ 957.423009] [<ffffffff8126ba75>] do_vfs_ioctl+0x95/0x700 [ 957.423009] [<ffffffff8126c159>] SyS_ioctl+0x79/0x90 [ 957.423009] [<ffffffff8163a4c0>] entry_SYSCALL_64_fastpath+0x23/0xc1 Since it only reads bridge ifindices, we can use rcu to safely walk the net device list. Also remove the wrong rtnl comment above. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06VSOCK: do not disconnect socket when peer has shutdown SEND onlyIan Campbell
The peer may be expecting a reply having sent a request and then done a shutdown(SHUT_WR), so tearing down the whole socket at this point seems wrong and breaks for me with a client which does a SHUT_WR. Looking at other socket family's stream_recvmsg callbacks doing a shutdown here does not seem to be the norm and removing it does not seem to have had any adverse effects that I can see. I'm using Stefan's RFC virtio transport patches, I'm unsure of the impact on the vmci transport. Signed-off-by: Ian Campbell <ian.campbell@docker.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Cc: Andy King <acking@vmware.com> Cc: Dmitry Torokhov <dtor@vmware.com> Cc: Jorgen Hansen <jhansen@vmware.com> Cc: Adit Ranadive <aditr@vmware.com> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-05netfilter: nfnetlink_acct: validate NFACCT_QUOTA parameterPhil Turnbull
If a quota bit is set in NFACCT_FLAGS but the NFACCT_QUOTA parameter is missing then a NULL pointer dereference is triggered. CAP_NET_ADMIN is required to trigger the bug. Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-04Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2016-05-04 1) The flowcache can hit an OOM condition if too many entries are in the gc_list. Fix this by counting the entries in the gc_list and refuse new allocations if the value is too high. 2) The inner headers are invalid after a xfrm transformation, so reset the skb encapsulation field to ensure nobody tries access the inner headers. Otherwise tunnel devices stacked on top of xfrm may build the outer headers based on wrong informations. 3) Add pmtu handling to vti, we need it to report pmtu informations for local generated packets. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04net: fix infoleak in rtnetlinkKangjie Lu
The stack object “map” has a total size of 32 bytes. Its last 4 bytes are padding generated by compiler. These padding bytes are not initialized and sent out via “nla_put”. Signed-off-by: Kangjie Lu <kjlu@gatech.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04net: fix infoleak in llcKangjie Lu
The stack object “info” has a total size of 12 bytes. Its last byte is padding which is not initialized and leaked via “put_cmsg”. Signed-off-by: Kangjie Lu <kjlu@gatech.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03ipv6/ila: fix nlsize calculation for lwtunnelNicolas Dichtel
The handler 'ila_fill_encap_info' adds one attribute: ILA_ATTR_LOCATOR. Fixes: 65d7ab8de582 ("net: Identifier Locator Addressing module") CC: Tom Herbert <tom@herbertland.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock.Sowmini Varadhan
An arbitration scheme for duelling SYNs is implemented as part of commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one()") which ensures that both nodes involved will arrive at the same arbitration decision. However, this needs to be synchronized with an outgoing SYN to be generated by rds_tcp_conn_connect(). This commit achieves the synchronization through the t_conn_lock mutex in struct rds_tcp_connection. The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring the t_conn_lock mutex. A SYN is sent out only if the RDS connection is not already UP (an UP would indicate that rds_tcp_accept_one() has completed 3WH, so no SYN needs to be generated). Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after acquiring the t_conn_lock mutex. The only acceptable states (to allow continuation of the arbitration logic) are UP (i.e., outgoing SYN was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent outgoing SYN before we saw incoming SYN). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sockSowmini Varadhan
There is a race condition between rds_send_xmit -> rds_tcp_xmit and the code that deals with resolution of duelling syns added by commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one()"). Specifically, we may end up derefencing a null pointer in rds_send_xmit if we have the interleaving sequence: rds_tcp_accept_one rds_send_xmit conn is RDS_CONN_UP, so invoke rds_tcp_xmit tc = conn->c_transport_data rds_tcp_restore_callbacks /* reset t_sock */ null ptr deref from tc->t_sock The race condition can be avoided without adding the overhead of additional locking in the xmit path: have rds_tcp_accept_one wait for rds_tcp_xmit threads to complete before resetting callbacks. The synchronization can be done in the same manner as rds_conn_shutdown(). First set the rds_conn_state to something other than RDS_CONN_UP (so that new threads cannot get into rds_tcp_xmit()), then wait for RDS_IN_XMIT to be cleared in the conn->c_flags indicating that any threads in rds_tcp_xmit are done. Fixes: 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one()") Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03net: Disable segmentation if checksumming is not supportedAlexander Duyck
In the case of the mlx4 and mlx5 driver they do not support IPv6 checksum offload for tunnels. With this being the case we should disable GSO in addition to the checksum offload features when we find that a device cannot perform a checksum on a given packet type. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03netem: Segment GSO packets on enqueueNeil Horman
This was recently reported to me, and reproduced on the latest net kernel, when attempting to run netperf from a host that had a netem qdisc attached to the egress interface: [ 788.073771] ---------------------[ cut here ]--------------------------- [ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda() [ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962 data_len=0 gso_size=1448 gso_type=1 ip_summed=3 [ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log dm_mod [ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W ------------ 3.10.0-327.el7.x86_64 #1 [ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012 [ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670 ffffffff816351f1 [ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200 ffff880231674000 [ 788.611943] 0000000000000001 0000000000000003 0000000000000000 ffff880437c03710 [ 788.647241] Call Trace: [ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b [ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0 [ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80 [ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100 [ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda [ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190 [ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem] [ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570 [ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0 ... The problem occurs because netem is not prepared to handle GSO packets (as it uses skb_checksum_help in its enqueue path, which cannot manipulate these frames). The solution I think is to simply segment the skb in a simmilar fashion to the way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes. When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt the first segment, and enqueue the remaining ones. tested successfully by myself on the latest net kernel, to which this applies Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Jamal Hadi Salim <jhs@mojatatu.com> CC: "David S. Miller" <davem@davemloft.net> CC: netem@lists.linux-foundation.org CC: eric.dumazet@gmail.com CC: stephen@networkplumber.org Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-mergeDavid S. Miller
Antonio Quartulli says: ==================== In this small batch of patches you have: - a fix for our Distributed ARP Table that makes sure that the input provided to the hash function during a query is the same as the one provided during an insert (so to prevent false negatives), by Antonio Quartulli - a fix for our new protocol implementation B.A.T.M.A.N. V that ensures that a hard interface is properly re-activated when it is brought down and then up again, by Antonio Quartulli - two fixes respectively to the reference counting of the tt_local_entry and neigh_node objects, by Sven Eckelmann. Such bug is rather severe as it would prevent the netdev objects references by batman-adv from being released after shutdown. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) MODULE_FIRMWARE firmware string not correct for iwlwifi 8000 chips, from Sara Sharon. 2) Fix SKB size checks in batman-adv stack on receive, from Sven Eckelmann. 3) Leak fix on mac80211 interface add error paths, from Johannes Berg. 4) Cannot invoke napi_disable() with BH disabled in myri10ge driver, fix from Stanislaw Gruszka. 5) Fix sign extension problem when computing feature masks in net_gso_ok(), from Marcelo Ricardo Leitner. 6) lan78xx driver doesn't count packets and packet lengths in its statistics properly, fix from Woojung Huh. 7) Fix the buffer allocation sizes in pegasus USB driver, from Petko Manolov. 8) Fix refcount overflows in bpf, from Alexei Starovoitov. 9) Unified dst cache handling introduced a preempt warning in ip_tunnel, fix by resetting rather then setting the cached route. From Paolo Abeni. 10) Listener hash collision test fix in soreuseport, from Craig Gallak * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits) gre: do not pull header in ICMP error processing net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case tipc: only process unicast on intended node cxgb3: fix out of bounds read net/smscx5xx: use the device tree for mac address soreuseport: Fix TCP listener hash collision net: l2tp: fix reversed udp6 checksum flags ip_tunnel: fix preempt warning in ip tunnel creation/updating samples/bpf: fix trace_output example bpf: fix check_map_func_compatibility logic bpf: fix refcnt overflow drivers: net: cpsw: use of_phy_connect() in fixed-link case dt: cpsw: phy-handle, phy_id, and fixed-link are mutually exclusive drivers: net: cpsw: don't ignore phy-mode if phy-handle is used drivers: net: cpsw: fix segfault in case of bad phy-handle drivers: net: cpsw: fix parsing of phy-handle DT property in dual_emac config MAINTAINERS: net: Change maintainer for GRETH 10/100/1G Ethernet MAC device driver gre: reject GUE and FOU in collect metadata mode pegasus: fixes reported packet length pegasus: fixes URB buffer allocation size; ...
2016-05-02gre: do not pull header in ICMP error processingJiri Benc
iptunnel_pull_header expects that IP header was already pulled; with this expectation, it pulls the tunnel header. This is not true in gre_err. Furthermore, ipv4_update_pmtu and ipv4_redirect expect that skb->data points to the IP header. We cannot pull the tunnel header in this path. It's just a matter of not calling iptunnel_pull_header - we don't need any of its effects. Fixes: bda7bb463436 ("gre: Allow multiple protocol listener for gre protocol.") Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02tipc: only process unicast on intended nodeHamish Martin
We have observed complete lock up of broadcast-link transmission due to unacknowledged packets never being removed from the 'transmq' queue. This is traced to nodes having their ack field set beyond the sequence number of packets that have actually been transmitted to them. Consider an example where node 1 has sent 10 packets to node 2 on a link and node 3 has sent 20 packets to node 2 on another link. We see examples of an ack from node 2 destined for node 3 being treated as an ack from node 2 at node 1. This leads to the ack on the node 1 to node 2 link being increased to 20 even though we have only sent 10 packets. When node 1 does get around to sending further packets, none of the packets with sequence numbers less than 21 are actually removed from the transmq. To resolve this we reinstate some code lost in commit d999297c3dbb ("tipc: reduce locking scope during packet reception") which ensures that only messages destined for the receiving node are processed by that node. This prevents the sequence numbers from getting out of sync and resolves the packet leakage, thereby resolving the broadcast-link transmission lock-ups we observed. While we are aware that this change only patches over a root problem that we still haven't identified, this is a sanity test that it is always legitimate to do. It will remain in the code even after we identify and fix the real problem. Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: John Thompson <john.thompson@alliedtelesis.co.nz> Signed-off-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01soreuseport: Fix TCP listener hash collisionCraig Gallek
I forgot to include a check for listener port equality when deciding if two sockets should belong to the same reuseport group. This was not caught previously because it's only necessary when two listening sockets for the same user happen to hash to the same listener bucket. The same error does not exist in the UDP path. Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection") Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01net: l2tp: fix reversed udp6 checksum flagsWang Shanker
This patch fixes a bug which causes the behavior of whether to ignore udp6 checksum of udp6 encapsulated l2tp tunnel contrary to what userspace program requests. When the flag `L2TP_ATTR_UDP_ZERO_CSUM6_RX` is set by userspace, it is expected that udp6 checksums of received packets of the l2tp tunnel to create should be ignored. In `l2tp_netlink.c`: `l2tp_nl_cmd_tunnel_create()`, `cfg.udp6_zero_rx_checksums` is set according to the flag, and then passed to `l2tp_core.c`: `l2tp_tunnel_create()` and then `l2tp_tunnel_sock_create()`. In `l2tp_tunnel_sock_create()`, `udp_conf.use_udp6_rx_checksums` is set the same to `cfg.udp6_zero_rx_checksums`. However, if we want the checksum to be ignored, `udp_conf.use_udp6_rx_checksums` should be set to `false`, i.e. be set to the contrary. Similarly, the same should be done to `udp_conf.use_udp6_tx_checksums`. Signed-off-by: Miao Wang <shankerwangmiao@gmail.com> Acked-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>