summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2013-04-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/emulex/benet/be_main.c drivers/net/ethernet/intel/igb/igb_main.c drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c include/net/scm.h net/batman-adv/routing.c net/ipv4/tcp_input.c The e{uid,gid} --> {uid,gid} credentials fix conflicted with the cleanup in net-next to now pass cred structs around. The be2net driver had a bug fix in 'net' that overlapped with the VLAN interface changes by Patrick McHardy in net-next. An IGB conflict existed because in 'net' the build_skb() support was reverted, and in 'net-next' there was a comment style fix within that code. Several batman-adv conflicts were resolved by making sure that all calls to batadv_is_my_mac() are changed to have a new bat_priv first argument. Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO rewrite in 'net-next', mostly overlapping changes. Thanks to Stephen Rothwell and Antonio Quartulli for help with several of these merge resolutions. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-22net: sctp: minor: remove dead code from sctp_packetDaniel Borkmann
struct sctp_packet is currently embedded into sctp_transport or sits on the stack as 'singleton' in sctp_outq_flush(). Therefore, its member 'malloced' is always 0, thus a kfree() is never called. Because of that, we can just remove this code. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-22net: remove a stale comment for dl_nextEric Dumazet
dl_next member in struct request_sock doesn't need to be first. We expect to insert a "struct common_sock" or a subset of it, so this claim had to be verified. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-20net: fix incorrect credentials passingLinus Torvalds
Commit 257b5358b32f ("scm: Capture the full credentials of the scm sender") changed the credentials passing code to pass in the effective uid/gid instead of the real uid/gid. Obviously this doesn't matter most of the time (since normally they are the same), but it results in differences for suid binaries when the wrong uid/gid ends up being used. This just undoes that (presumably unintentional) part of the commit. Reported-by: Andy Lutomirski <luto@amacapital.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: David S. Miller <davem@davemloft.net> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19irda: small read past the end of array in debug codeDan Carpenter
The "reason" can come from skb->data[] and it hasn't been capped so it can be from 0-255 instead of just 0-6. For example in irlmp_state_dtr() the code does: reason = skb->data[3]; ... irlmp_disconnect_indication(self, reason, skb); Also LMREASON has a couple other values which don't have entries in the irlmp_reasons[] array. And 0xff is a valid reason as well which means "unknown". So far as I can see we don't actually care about "reason" except for in the debug code. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19netfilter: rename netlink related "pid" variables to "portid"Patrick McHardy
Get rid of the confusing mix of pid and portid and use portid consistently for all netlink related socket identities. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-17net: sctp: sctp_ulpq: remove 'malloced' struct memberDaniel Borkmann
The structure sctp_ulpq is embedded into sctp_association and never separately allocated, also ulpq->malloced is always 0, so that kfree() is never called. Therefore, remove this code. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-17net: sctp: sctp_bind_addr: remove dead codeDaniel Borkmann
The sctp_bind_addr structure has a 'malloced' member that is always set to 0, thus in sctp_bind_addr_free() the kfree() part can never be called. This part is embedded into sctp_ep_common anyway and never alloced. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-17net: sctp: sctp_transport: remove unused variableDaniel Borkmann
sctp_transport's member 'malloced' is set to 1, never evaluated and the structure is kfreed anyway. So just remove it. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-17net: sctp: sctp_outq: remove 'malloced' from its structDaniel Borkmann
sctp_outq is embedded into sctp_association, and thus never kmalloced in any way. Also, malloced is always 0, thus kfree() is never called. Therefore, remove that dead piece of code. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-17net: sctp: sctp_inq: remove dead codeDaniel Borkmann
sctp_inq is never kmalloced, since it's integrated into sctp_ep_common and only initialized from eps and assocs. Therefore, remove the dead code from there. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-17net: sctp: sctp_ssnmap: remove 'malloced' element from structDaniel Borkmann
sctp_ssnmap_init() can only be called from sctp_ssnmap_new() where malloced is always set to 1. Thus, when we call sctp_ssnmap_free() the test for map->malloced evaluates always to true. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-15net: sctp: minor: make sctp_ep_common's member 'dead' a boolDaniel Borkmann
Since dead only holds two states (0,1), make it a bool instead of a 'char', which is more appropriate for its purpose. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-15net: sctp: remove sctp_ep_common struct member 'malloced'Daniel Borkmann
There is actually no need to keep this member in the structure, because after init it's always 1 anyway, thus always kfree called. This seems to be an ancient leftover from the very initial implementation from 2.5 times. Only in case the initialization of an association fails, we leave base.malloced as 0, but we nevertheless kfree it in the error path in sctp_association_new(). Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-14net: sock: make sock_tx_timestamp voidDaniel Borkmann
Currently, sock_tx_timestamp() always returns 0. The comment that describes the sock_tx_timestamp() function wrongly says that it returns an error when an invalid argument is passed (from commit 20d4947353be, ``net: socket infrastructure for SO_TIMESTAMPING''). Make the function void, so that we can also remove all the unneeded if conditions that check for such a _non-existant_ error case in the output path. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-14ipv6: statically link register_inet6addr_notifier()Cong Wang
Tomas reported the following build error: net/built-in.o: In function `ieee80211_unregister_hw': (.text+0x10f0e1): undefined reference to `unregister_inet6addr_notifier' net/built-in.o: In function `ieee80211_register_hw': (.text+0x10f610): undefined reference to `register_inet6addr_notifier' make: *** [vmlinux] Error 1 when built IPv6 as a module. So we have to statically link these symbols. Reported-by: Tomas Melin <tomas.melin@iki.fi> Cc: Tomas Melin <tomas.melin@iki.fi> Cc: "David S. Miller" <davem@davemloft.net> Cc: YOSHIFUJI Hidaki <yoshfuji@linux-ipv6.org> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-12tcp: GSO should be TSQ friendlyEric Dumazet
I noticed that TSQ (TCP Small queues) was less effective when TSO is turned off, and GSO is on. If BQL is not enabled, TSQ has then no effect. It turns out the GSO engine frees the original gso_skb at the time the fragments are generated and queued to the NIC. We should instead call the tcp_wfree() destructor for the last fragment, to keep the flow control as intended in TSQ. This effectively limits the number of queued packets on qdisc + NIC layers. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-11Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== 1) Allow to avoid copying DSCP during encapsulation by setting a SA flag. From Nicolas Dichtel. 2) Constify the netlink dispatch table, no need to modify it at runtime. From Mathias Krause. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-10Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem Conflicts: drivers/net/wireless/rt2x00/rt2x00pci.c net/mac80211/sta_info.c net/wireless/core.h
2013-04-09net: sctp: introduce uapi header for sctpDaniel Borkmann
This patch introduces an UAPI header for the SCTP protocol, so that we can facilitate the maintenance and development of user land applications or libraries, in particular in terms of header synchronization. To not break compatibility, some fragments from lksctp-tools' netinet/sctp.h have been carefully included, while taking care that neither kernel nor user land breaks, so both compile fine with this change (for lksctp-tools I tested with the old netinet/sctp.h header and with a newly adapted one that includes the uapi sctp header). lksctp-tools smoke test run through successfully as well in both cases. Suggested-by: Neil Horman <nhorman@tuxdriver.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-09netprio_cgroup: remove task_struct parameter from sock_update_netprio()Zefan Li
The callers always pass current to sock_update_netprio(). Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-09cls_cgroup: remove task_struct parameter from sock_update_classid()Zefan Li
The callers always pass current to sock_update_classid(). Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-09net: ipv6: only invalidate previously tokenized addressesDaniel Borkmann
Instead of invalidating all IPv6 addresses with global scope when one decides to use IPv6 tokens, we should only invalidate previous tokens and leave the rest intact until they expire eventually (or are intact forever). For doing this less greedy approach, we're adding a bool at the end of inet6_ifaddr structure instead, for two reasons: i) per-inet6_ifaddr flag space is already used up, making it wider might not be a good idea, since ii) also we do not necessarily need to export this information into user space. Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-08af_iucv: fix recvmsg by replacing skb_pull() functionUrsula Braun
When receiving data messages, the "BUG_ON(skb->len < skb->data_len)" in the skb_pull() function triggers a kernel panic. Replace the skb_pull logic by a per skb offset as advised by Eric Dumazet. Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com> Signed-off-by: Frank Blaschka <blaschka@linux.vnet.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-08net: ipv6: add tokenized interface identifier supportDaniel Borkmann
This patch adds support for IPv6 tokenized IIDs, that allow for administrators to assign well-known host-part addresses to nodes whilst still obtaining global network prefix from Router Advertisements. It is currently in draft status. The primary target for such support is server platforms where addresses are usually manually configured, rather than using DHCPv6 or SLAAC. By using tokenised identifiers, hosts can still determine their network prefix by use of SLAAC, but more readily be automatically renumbered should their network prefix change. [...] The disadvantage with static addresses is that they are likely to require manual editing should the network prefix in use change. If instead there were a method to only manually configure the static identifier part of the IPv6 address, then the address could be automatically updated when a new prefix was introduced, as described in [RFC4192] for example. In such cases a DNS server might be configured with such a tokenised interface identifier of ::53, and SLAAC would use the token in constructing the interface address, using the advertised prefix. [...] http://tools.ietf.org/html/draft-chown-6man-tokenised-ipv6-identifiers-02 The implementation is partially based on top of Mark K. Thompson's proof of concept. However, it uses the Netlink interface for configuration resp. data retrival, so that it can be easily extended in future. Successfully tested by myself. Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-08ieee802154/nl-mac.c: make some MLME operations optionalWerner Almesberger
Check for NULL before calling the following operations from "struct ieee802154_mlme_ops": assoc_req, assoc_resp, disassoc_req, start_req, and scan_req. This fixes a current oops where those functions are called but not implemented. It also updates the documentation to clarify that they are now optional by design. If a call to an unimplemented function is attempted, the kernel returns EOPNOTSUPP via netlink. The following operations are still required: get_phy, get_pan_id, get_short_addr, and get_dsn. Note that the places where this patch changes the initialization of "ret" should not affect the rest of the code since "ret" was always set (again) before returning its value. Signed-off-by: Werner Almesberger <werner@almesberger.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-08IEEE 802.15.4: remove get_bsn from "struct ieee802154_mlme_ops"Werner Almesberger
It served no purpose: we never call it from anywhere in the stack and the only driver that did implement it (fakehard) merely provided a dummy value. There is also considerable doubt whether it would make sense to even attempt beacon processing at this level in the Linux kernel. Signed-off-by: Werner Almesberger <werner@almesberger.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-07scm: Stop passing struct credEric W. Biederman
Now that uids and gids are completely encapsulated in kuid_t and kgid_t we no longer need to pass struct cred which allowed us to test both the uid and the user namespace for equality. Passing struct cred potentially allows us to pass the entire group list as BSD does but I don't believe the cost of cache line misses justifies retaining code for a future potential application. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-07Merge branch 'master' of git://1984.lsi.us.es/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== The following patchset contains Netfilter and IPVS updates for your net-next tree, most relevantly they are: * Add net namespace support to NFLOG, ULOG and ebt_ulog and NFQUEUE. The LOG and ebt_log target has been also adapted, but they still depend on the syslog netnamespace that seems to be missing, from Gao Feng. * Don't lose indications of congestion in IPv6 fragmentation handling, from Hannes Frederic Sowa.i * IPVS conversion to use RCU, including some code consolidation patches and optimizations, also some from Julian Anastasov. * cpu fanout support for NFQUEUE, from Holger Eitzenberger. * Better error reporting to userspace when dropping packets from all our _*_[xfrm|route]_me_harder functions, from Patrick McHardy. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-05netfilter: nf_log: prepare net namespace support for loggersGao feng
This patch adds netns support to nf_log and it prepares netns support for existing loggers. It is composed of four major changes. 1) nf_log_register has been split to two functions: nf_log_register and nf_log_set. The new nf_log_register is used to globally register the nf_logger and nf_log_set is used for enabling pernet support from nf_loggers. Per netns is not yet complete after this patch, it comes in separate follow up patches. 2) Add net as a parameter of nf_log_bind_pf. Per netns is not yet complete after this patch, it only allows to bind the nf_logger to the protocol family from init_net and it skips other cases. 3) Adapt all nf_log_packet callers to pass netns as parameter. After this patch, this function only works for init_net. 4) Make the sysctl net/netfilter/nf_log pernet. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-04-05netfilter: make /proc/net/netfilter pernetGao feng
This patch makes this proc dentry pernet. So far only init_net had a /proc/net/netfilter directory. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-04-04net: frag queue per hash bucket lockingJesper Dangaard Brouer
This patch implements per hash bucket locking for the frag queue hash. This removes two write locks, and the only remaining write lock is for protecting hash rebuild. This essentially reduce the readers-writer lock to a rebuild lock. This patch is part of "net: frag performance followup" http://thread.gmane.org/gmane.linux.network/263644 of which two patches have already been accepted: Same test setup as previous: (http://thread.gmane.org/gmane.linux.network/257155) Two 10G interfaces, on seperate NUMA nodes, are under-test, and uses Ethernet flow-control. A third interface is used for generating the DoS attack (with trafgen). Notice, I have changed the frag DoS generator script to be more efficient/deadly. Before it would only hit one RX queue, now its sending packets causing multi-queue RX, due to "better" RX hashing. Test types summary (netperf UDP_STREAM): Test-20G64K == 2x10G with 65K fragments Test-20G3F == 2x10G with 3x fragments (3*1472 bytes) Test-20G64K+DoS == Same as 20G64K with frag DoS Test-20G3F+DoS == Same as 20G3F with frag DoS Test-20G64K+MQ == Same as 20G64K with Multi-Queue frag DoS Test-20G3F+MQ == Same as 20G3F with Multi-Queue frag DoS When I rebased this-patch(03) (on top of net-next commit a210576c) and removed the _bh spinlock, I saw a performance regression. BUT this was caused by some unrelated change in-between. See tests below. Test (A) is what I reported before for patch-02, accepted in commit 1b5ab0de. Test (B) verifying-retest of commit 1b5ab0de corrospond to patch-02. Test (C) is what I reported before for this-patch Test (D) is net-next master HEAD (commit a210576c), which reveals some (unknown) performance regression (compared against test (B)). Test (D) function as a new base-test. Performance table summary (in Mbit/s): (#) Test-type: 20G64K 20G3F 20G64K+DoS 20G3F+DoS 20G64K+MQ 20G3F+MQ ---------- ------- ------- ---------- --------- -------- ------- (A) Patch-02 : 18848.7 13230.1 4103.04 5310.36 130.0 440.2 (B) 1b5ab0de : 18841.5 13156.8 4101.08 5314.57 129.0 424.2 (C) Patch-03v1: 18838.0 13490.5 4405.11 6814.72 196.6 461.6 (D) a210576c : 18321.5 11250.4 3635.34 5160.13 119.1 405.2 (E) with _bh : 17247.3 11492.6 3994.74 6405.29 166.7 413.6 (F) without bh: 17471.3 11298.7 3818.05 6102.11 165.7 406.3 Test (E) and (F) is this-patch(03), with(V1) and without(V2) the _bh spinlocks. I cannot explain the slow down for 20G64K (but its an artificial "lab-test" so I'm not worried). But the other results does show improvements. And test (E) "with _bh" version is slightly better. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> ---- V2: - By analysis from Hannes Frederic Sowa and Eric Dumazet, we don't need the spinlock _bh versions, as Netfilter currently does a local_bh_disable() before entering inet_fragment. - Fold-in desc from cover-mail V3: - Drop the chain_len counter per hash bucket. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-02tcp: Remove dead sysctl_tcp_cookie_size declarationNeal Cardwell
Remove a declaration left over from the TCPCT-ectomy. This sysctl is no longer referenced anywhere since 1a2c6181c4 ("tcp: Remove TCPCT"). Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-01ipvs: convert services to rcuJulian Anastasov
This is the final step in RCU conversion. Things that are removed: - svc->usecnt: now svc is accessed under RCU read lock - svc->inc: and some unused code - ip_vs_bind_pe and ip_vs_unbind_pe: no ability to replace PE - __ip_vs_svc_lock: replaced with RCU - IP_VS_WAIT_WHILE: now readers lookup svcs and dests under RCU and work in parallel with configuration Other changes: - before now, a RCU read-side critical section included the calling of the schedule method, now it is extended to include service lookup - ip_vs_svc_table and ip_vs_svc_fwm_table are now using hlist - svc->pe and svc->scheduler remain to the end (of grace period), the schedulers are prepared for such RCU readers even after done_service is called but they need to use synchronize_rcu because last ip_vs_scheduler_put can happen while RCU read-side critical sections use an outdated svc->scheduler pointer - as planned, update_service is removed - empty services can be freed immediately after grace period. If dests were present, the services are freed from the dest trash code Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-01ipvs: convert dests to rcuJulian Anastasov
In previous commits the schedulers started to access svc->destinations with _rcu list traversal primitives because the IP_VS_WAIT_WHILE macro still plays the role of grace period. Now it is time to finish the updating part, i.e. adding and deleting of dests with _rcu suffix before removing the IP_VS_WAIT_WHILE in next commit. We use the same rule for conns as for the schedulers: dests can be searched in RCU read-side critical section where ip_vs_dest_hold can be called by ip_vs_bind_dest. Some things are not perfect, for example, calling functions like ip_vs_lookup_dest from updating code under RCU, just because we use some function both from reader and from updater. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-01ipvs: convert sched_lock to spin lockJulian Anastasov
As all read_locks are gone spin lock is preferred. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-01ipvs: do not expect result from done_serviceJulian Anastasov
This method releases the scheduler state, it can not fail. Such change will help to properly replace the scheduler in following patch. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-01ipvs: reorganize dest trashJulian Anastasov
All dests will go to trash, no exceptions. But we have to use new list node t_list for this, due to RCU changes in following patches. Dests will wait there initial grace period and later all conns and schedulers to put their reference. The dests don't get reference for staying in dest trash as before. As result, we do not load ip_vs_dest_put with extra checks for last refcnt and the schedulers do not need to play games with atomic_inc_not_zero while selecting best destination. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-01ipvs: add ip_vs_dest_hold and ip_vs_dest_putJulian Anastasov
ip_vs_dest_hold will be used under RCU lock while ip_vs_dest_put can be called even after dest is removed from service, as it happens for conns and some schedulers. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-01ipvs: preparations for using rcu in schedulersJulian Anastasov
Allow schedulers to use rcu_dereference when returning destination on lookup. The RCU read-side critical section will allow ip_vs_bind_dest to get dest refcnt as preparation for the step where destinations will be deleted without an IP_VS_WAIT_WHILE guard that holds the packet processing during update. Add new optional scheduler methods add_dest, del_dest and upd_dest. For now the methods are called together with update_service but update_service will be removed in a following change. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-01ipvs: avoid kmem_cache_zalloc in ip_vs_conn_newJulian Anastasov
We have many fields to set and few to reset, use kmem_cache_alloc instead to save some cycles. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-01ipvs: reorder keys in connection structureJulian Anastasov
__ip_vs_conn_in_get and ip_vs_conn_out_get are hot places. Optimize them, so that ports are matched first. By moving net and fwmark below, on 32-bit arch we can fit caddr in 32-byte cache line and all addresses in 64-byte cache line. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-01ipvs: convert connection lockingJulian Anastasov
Convert __ip_vs_conntbl_lock_array as follows: - readers that do not modify conn lists will use RCU lock - updaters that modify lists will use spinlock_t Now for conn lookups we will use RCU read-side critical section. Without using __ip_vs_conn_get such places have access to connection fields and can dereference some pointers like pe and pe_data plus the ability to update timer expiration. If full access is required we contend for reference. We add barrier in __ip_vs_conn_put, so that other CPUs see the refcnt operation after other writes. With the introduction of ip_vs_conn_unlink() we try to reorganize ip_vs_conn_expire(), so that unhashing of connections that should stay more time is avoided, even if it is for very short time. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-01ipvs: remove rs_lock by using RCUJulian Anastasov
rs_lock was used to protect rs_table (hash table) from updaters (under global mutex) and readers (packet handlers). We can remove rs_lock by using RCU lock for readers. Reclaiming dest only with kfree_rcu is enough because the readers access only fields from the ip_vs_dest structure. Use hlist for rs_table. As we are now using hlist_del_rcu, introduce in_rs_table flag as replacement for the list_empty checks which do not work with RCU. It is needed because only NAT dests are in the rs_table. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-01ipvs: convert app locksJulian Anastasov
We use locks like tcp_app_lock, udp_app_lock, sctp_app_lock to protect access to the protocol hash tables from readers in packet context while the application instances (inc) are [un]registered under global mutex. As the hash tables are mostly read when conns are created and bound to app, use RCU for readers and reclaim app instance after grace period. Simplify ip_vs_app_inc_get because we use usecnt only for statistics and rely on module refcounting. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-01ipvs: optimize dst usage for real serverJulian Anastasov
Currently when forwarding requests to real servers we use dst_lock and atomic operations when cloning the dst_cache value. As the dst_cache value does not change most of the time it is better to use RCU and to lock dst_lock only when we need to replace the obsoleted dst. For this to work we keep dst_cache in new structure protected by RCU. For packets to remote real servers we will use noref version of dst_cache, it will be valid while we are in RCU read-side critical section because now dst_release for replaced dsts will be invoked after the grace period. Packets to local real servers that are passed to local stack with NF_ACCEPT need a dst clone. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-01ipvs: rename functions related to dst_cache resetJulian Anastasov
Move and give better names to two functions: - ip_vs_dst_reset to __ip_vs_dst_cache_reset - __ip_vs_dev_reset to ip_vs_forget_dev Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-04-01ipvs: avoid routing by TOS for real serverJulian Anastasov
Avoid replacing the cached route for real server on every packet with different TOS. I doubt that routing by TOS for real server is used at all, so we should be better with such optimization. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-03-31net: add option to enable error queue packets waking selectKeller, Jacob E
Currently, when a socket receives something on the error queue it only wakes up the socket on select if it is in the "read" list, that is the socket has something to read. It is useful also to wake the socket if it is in the error list, which would enable software to wait on error queue packets without waking up for regular data on the socket. The main use case is for receiving timestamped transmit packets which return the timestamp to the socket via the error queue. This enables an application to select on the socket for the error queue only instead of for the regular traffic. -v2- * Added the SO_SELECT_ERR_QUEUE socket option to every architechture specific file * Modified every socket poll function that checks error queue Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Cc: Jeffrey Kirsher <jeffrey.t.kirsher@intel.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Matthew Vick <matthew.vick@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-27net: use the frag lru_lock to protect netns_frags.nqueues updateJesper Dangaard Brouer
Move the protection of netns_frags.nqueues updates under the LRU_lock, instead of the write lock. As they are located on the same cacheline, and this is also needed when transitioning to use per hash bucket locking. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>