summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2010-05-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (63 commits) drivers/net/usb/asix.c: Fix pointer cast. be2net: Bug fix to avoid disabling bottom half during firmware upgrade. proc_dointvec: write a single value hso: add support for new products Phonet: fix potential use-after-free in pep_sock_close() ath9k: remove VEOL support for ad-hoc ath9k: change beacon allocation to prefer the first beacon slot sock.h: fix kernel-doc warning cls_cgroup: Fix build error when built-in macvlan: do proper cleanup in macvlan_common_newlink() V2 be2net: Bug fix in init code in probe net/dccp: expansion of error code size ath9k: Fix rx of mcast/bcast frames in PS mode with auto sleep wireless: fix sta_info.h kernel-doc warnings wireless: fix mac80211.h kernel-doc warnings iwlwifi: testing the wrong variable in iwl_add_bssid_station() ath9k_htc: rare leak in ath9k_hif_usb_alloc_tx_urbs() ath9k_htc: dereferencing before check in hif_usb_tx_cb() rt2x00: Fix rt2800usb TX descriptor writing. rt2x00: Fix failed SLEEP->AWAKE and AWAKE->SLEEP transitions. ...
2010-05-24tun: Update classid on packet injectionHerbert Xu
This patch makes tun update its socket classid every time we inject a packet into the network stack. This is so that any updates made by the admin to the process writing packets to tun is effected. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-24cls_cgroup: Store classid in struct sockHerbert Xu
Up until now cls_cgroup has relied on fetching the classid out of the current executing thread. This runs into trouble when a packet processing is delayed in which case it may execute out of another thread's context. Furthermore, even when a packet is not delayed we may fail to classify it if soft IRQs have been disabled, because this scenario is indistinguishable from one where a packet unrelated to the current thread is processed by a real soft IRQ. In fact, the current semantics is inherently broken, as a single skb may be constructed out of the writes of two different tasks. A different manifestation of this problem is when the TCP stack transmits in response of an incoming ACK. This is currently unclassified. As we already have a concept of packet ownership for accounting purposes in the skb->sk pointer, this is a natural place to store the classid in a persistent manner. This patch adds the cls_cgroup classid in struct sock, filling up an existing hole on 64-bit :) The value is set at socket creation time. So all sockets created via socket(2) automatically gains the ID of the thread creating it. Whenever another process touches the socket by either reading or writing to it, we will change the socket classid to that of the process if it has a valid (non-zero) classid. For sockets created on inbound connections through accept(2), we inherit the classid of the original listening socket through sk_clone, possibly preceding the actual accept(2) call. In order to minimise risks, I have not made this the authoritative classid. For now it is only used as a backup when we execute with soft IRQs disabled. Once we're completely happy with its semantics we can use it as the sole classid. Footnote: I have rearranged the error path on cls_group module creation. If we didn't do this, then there is a window where someone could create a tc rule using cls_group before the cgroup subsystem has been registered. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-24net-2.6 : V2 - fix dev_get_valid_nameDaniel Lezcano
the commit: commit d90310243fd750240755e217c5faa13e24f41536 Author: Octavian Purdila <opurdila@ixiacom.com> Date: Wed Nov 18 02:36:59 2009 +0000 net: device name allocation cleanups introduced a bug when there is a hash collision making impossible to rename a device with eth%d. This bug is very hard to reproduce and appears rarely. The problem is coming from we don't pass a temporary buffer to __dev_alloc_name but 'dev->name' which is modified by the function. A detailed explanation is here: http://marc.info/?l=linux-netdev&m=127417784011987&w=2 Changelog: V2 : replaced strings comparison by pointers comparison Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Reviewed-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-24rtnetlink: Fix error handling in do_setlink()David Howells
Commit c02db8c6290bb992442fec1407643c94cc414375: Author: Chris Wright <chrisw@sous-sol.org> Date: Sun May 16 01:05:45 2010 -0700 Subject: rtnetlink: make SR-IOV VF interface symmetric adds broken error handling to do_setlink() in net/core/rtnetlink.c. The problem is the following chunk of code: if (tb[IFLA_VFINFO_LIST]) { struct nlattr *attr; int rem; nla_for_each_nested(attr, tb[IFLA_VFINFO_LIST], rem) { if (nla_type(attr) != IFLA_VF_INFO) ----> goto errout; err = do_setvfinfo(dev, attr); if (err < 0) goto errout; modified = 1; } } which can get to errout without setting err, resulting in the following error: net/core/rtnetlink.c: In function 'do_setlink': net/core/rtnetlink.c:904: warning: 'err' may be used uninitialized in this function Change the code to return -EINVAL in this case. Note that this might not be the appropriate error though. Signed-off-by: David Howells <dhowells@redhat.com> cc: Chris Wright <chrisw@sous-sol.org> cc: David S. Miller <davem@davemloft.net> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-21Merge branch 'master' into for-2.6.35Jens Axboe
Conflicts: fs/ext3/fsync.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-21pipe: add support for shrinking and growing pipesJens Axboe
This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for growing and shrinking the size of a pipe and adjusts pipe.c and splice.c (and relay and network splice) usage to work with these larger (or smaller) pipes. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-21net: Expose all network devices in a namespaces in sysfsEric W. Biederman
This reverts commit aaf8cdc34ddba08122f02217d9d684e2f9f5d575. Drivers like the ipw2100 call device_create_group when they are initialized and device_remove_group when they are shutdown. Moving them between namespaces deletes their sysfs groups early. In particular the following call chain results. netdev_unregister_kobject -> device_del -> kobject_del -> sysfs_remove_dir With sysfs_remove_dir recursively deleting all of it's subdirectories, and nothing adding them back. Ouch! Therefore we need to call something that ultimate calls sysfs_mv_dir as that sysfs function can move sysfs directories between namespaces without deleting their subdirectories or their contents. Allowing us to avoid placing extra boiler plate into every driver that does something interesting with sysfs. Currently the function that provides that capability is device_rename. That is the code works without nasty side effects as originally written. So remove the misguided fix for moving devices between namespaces. The bug in the kobject layer that inspired it has now been recognized and fixed. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-21net/sysfs: Fix the bitrot in network device kobject namespace supportEric W. Biederman
I had a couple of stupid bugs in: netns: Teach network device kobjects which namespace they are in. - I duplicated the Kconfig for the NET_NS - The build was broken when sysfs was not compiled in The sysfs breakage is because after I moved the operations for the sysfs to the kobject layer, to make things cleaner I forgot to move the ifdefs. Opps. I'm not quite certain how I got introduced a second NET_NS Kconfig, but it was probably a 3 way merge somewhere along the way that did not notice that the NET_NS Kconfig option had mvoed and thout that was a bug. It probably slipped in because it used to be the sysfs patches were the first patches in my network namespace patches. Some things just don't go like you would expect. Neither of these bugs actually affect anything in the common case but they should be fixed. Thanks to Serge for noticing they were present. Reported-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Acked-by: David S. Miller <davem@davemloft.net>
2010-05-21netns: Teach network device kobjects which namespace they are in.Eric W. Biederman
The problem. Network devices show up in sysfs and with the network namespace active multiple devices with the same name can show up in the same directory, ouch! To avoid that problem and allow existing applications in network namespaces to see the same interface that is currently presented in sysfs, this patch enables the tagging directory support in sysfs. By using the network namespace pointers as tags to separate out the the sysfs directory entries we ensure that we don't have conflicts in the directories and applications only see a limited set of the network devices. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-21net: fix problem in dequeuing from input_pkt_queueTom Herbert
Fix some issues introduced in batch skb dequeuing for input_pkt_queue. The primary issue it that the queue head must be incremented only after a packet has been processed, that is only after __netif_receive_skb has been called. This is needed for the mechanism to prevent OOO packet in RFS. Also when flushing the input_pkt_queue and process_queue, the process queue should be done first to prevent OOO packets. Because the input_pkt_queue has been effectively split into two queues, the calculation of the tail ptr is no longer correct. The correct value would be head+input_pkt_queue->len+process_queue->len. To avoid this calculation we added an explict input_queue_tail in softnet_data. The tail value is simply incremented when queuing to input_pkt_queue. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-21gro: Fix bogus gso_size on the first fraglist entryHerbert Xu
When GRO produces fraglist entries, and the resulting skb hits an interface that is incapable of TSO but capable of FRAGLIST, we end up producing a bogus packet with gso_size non-zero. This was reported in the field with older versions of KVM that did not set the TSO bits on tuntap. This patch fixes that. Reported-by: Igor Zhang <yugzhang@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-18net: Add netlink support for virtual port management (was iovnl)Scott Feldman
Add new netdev ops ndo_{set|get}_vf_port to allow setting of port-profile on a netdev interface. Extends netlink socket RTM_SETLINK/ RTM_GETLINK with two new sub msgs called IFLA_VF_PORTS and IFLA_PORT_SELF (added to end of IFLA_cmd list). These are both nested atrtibutes using this layout: [IFLA_NUM_VF] [IFLA_VF_PORTS] [IFLA_VF_PORT] [IFLA_PORT_*], ... [IFLA_VF_PORT] [IFLA_PORT_*], ... ... [IFLA_PORT_SELF] [IFLA_PORT_*], ... These attributes are design to be set and get symmetrically. VF_PORTS is a list of VF_PORTs, one for each VF, when dealing with an SR-IOV device. PORT_SELF is for the PF of the SR-IOV device, in case it wants to also have a port-profile, or for the case where the VF==PF, like in enic patch 2/2 of this patch set. A port-profile is used to configure/enable the external switch virtual port backing the netdev interface, not to configure the host-facing side of the netdev. A port-profile is an identifier known to the switch. How port- profiles are installed on the switch or how available port-profiles are made know to the host is outside the scope of this patch. There are two types of port-profiles specs in the netlink msg. The first spec is for 802.1Qbg (pre-)standard, VDP protocol. The second spec is for devices that run a similar protocol as VDP but in firmware, thus hiding the protocol details. In either case, the specs have much in common and makes sense to define the netlink msg as the union of the two specs. For example, both specs have a notition of associating/deassociating a port-profile. And both specs require some information from the hypervisor manager, such as client port instance ID. The general flow is the port-profile is applied to a host netdev interface using RTM_SETLINK, the receiver of the RTM_SETLINK msg communicates with the switch, and the switch virtual port backing the host netdev interface is configured/enabled based on the settings defined by the port-profile. What those settings comprise, and how those settings are managed is again outside the scope of this patch, since this patch only deals with the first step in the flow. Signed-off-by: Scott Feldman <scofeldm@cisco.com> Signed-off-by: Roopa Prabhu <roprabhu@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-18net: Remove unnecessary semicolons after switch statementsJoe Perches
Also added an explicit break; to avoid a fallthrough in net/ipv4/tcp_input.c Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-18net: add a noref bit on skb dstEric Dumazet
Use low order bit of skb->_skb_dst to tell dst is not refcounted. Change _skb_dst to _skb_refdst to make sure all uses are catched. skb_dst() returns the dst, regardless of noref bit set or not, but with a lockdep check to make sure a noref dst is not given if current user is not rcu protected. New skb_dst_set_noref() helper to set an notrefcounted dst on a skb. (with lockdep check) skb_dst_drop() drops a reference only if skb dst was refcounted. skb_dst_force() helper is used to force a refcount on dst, when skb is queued and not anymore RCU protected. Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if !IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in sock_queue_rcv_skb(), in __nf_queue(). Use skb_dst_force() in dev_requeue_skb(). Note: dst_use_noref() still dirties dst, we might transform it later to do one dirtying per jiffies. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-18rps: avoid one atomic in enqueue_to_backlogEric Dumazet
If CONFIG_SMP=y, then we own a queue spinlock, we can avoid the atomic test_and_set_bit() from napi_schedule_prep(). We now have same number of atomic ops per netif_rx() calls than with pre-RPS kernel. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-17Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: include/linux/if_link.h
2010-05-16rtnetlink: make SR-IOV VF interface symmetricChris Wright
Now we have a set of nested attributes: IFLA_VFINFO_LIST (NESTED) IFLA_VF_INFO (NESTED) IFLA_VF_MAC IFLA_VF_VLAN IFLA_VF_TX_RATE This allows a single set to operate on multiple attributes if desired. Among other things, it means a dump can be replayed to set state. The current interface has yet to be released, so this seems like something to consider for 2.6.34. Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-16net: Introduce sk_route_nocapsEric Dumazet
TCP-MD5 sessions have intermittent failures, when route cache is invalidated. ip_queue_xmit() has to find a new route, calls sk_setup_caps(sk, &rt->u.dst), destroying the sk->sk_route_caps &= ~NETIF_F_GSO_MASK that MD5 desperately try to make all over its way (from tcp_transmit_skb() for example) So we send few bad packets, and everything is fine when tcp_transmit_skb() is called again for this socket. Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a socket field, sk_route_nocaps, containing bits to mask on sk_route_caps. Reported-by: Bhaskar Dutta <bhaskie@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-16net: Consistent skb timestampingEric Dumazet
With RPS inclusion, skb timestamping is not consistent in RX path. If netif_receive_skb() is used, its deferred after RPS dispatch. If netif_rx() is used, its done before RPS dispatch. This can give strange tcpdump timestamps results. I think timestamping should be done as soon as possible in the receive path, to get meaningful values (ie timestamps taken at the time packet was delivered by NIC driver to our stack), even if NAPI already can defer timestamping a bit (RPS can help to reduce the gap) Tom Herbert prefer to sample timestamps after RPS dispatch. In case sampling is expensive (HPET/acpi_pm on x86), this makes sense. Let admins switch from one mode to another, using a new sysctl, /proc/sys/net/core/netdev_tstamp_prequeue Its default value (1), means timestamps are taken as soon as possible, before backlog queueing, giving accurate timestamps. Setting a 0 value permits to sample timestamps when processing backlog, after RPS dispatch, to lower the load of the pre-RPS cpu. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-16net: adjust handle_macvlan to pass port struct to hookJiri Pirko
Now there's null check here and also again in the hook. Looking at bridge bits which are simmilar, port structure is rcu_dereferenced right away in handle_bridge and passed to hook. Looks nicer. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-12Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: Documentation/feature-removal-schedule.txt drivers/net/wireless/ath/ar9170/usb.c drivers/scsi/iscsi_tcp.c net/ipv4/ipmr.c
2010-05-07rps: Various optimizationsEric Dumazet
Introduce ____napi_schedule() helper for callers in irq disabled contexts. rps_trigger_softirq() becomes a leaf function. Use container_of() in process_backlog() instead of accessing per_cpu address. Use a custom inlined version of __napi_complete() in process_backlog() to avoid one locked instruction : only current cpu owns and manipulates this napi, and NAPI_STATE_SCHED is the only possible flag set on backlog. we can use a plain write instead of clear_bit(), and we dont need an smp_mb() memory barrier, since RPS is on, backlog is protected by a spinlock. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-06veth: Dont kfree_skb() after dev_forward_skb()Eric Dumazet
In case of congestion, netif_rx() frees the skb, so we must assume dev_forward_skb() also consume skb. Bug introduced by commit 445409602c092 (veth: move loopback logic to common location) We must change dev_forward_skb() to always consume skb, and veth to not double free it. Bug report : http://marc.info/?l=linux-netdev&m=127310770900442&w=3 Reported-by: Martín Ferrari <martin.ferrari@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-06netpoll: add generic support for bridge and bonding devicesWANG Cong
This whole patchset is for adding netpoll support to bridge and bonding devices. I already tested it for bridge, bonding, bridge over bonding, and bonding over bridge. It looks fine now. To make bridge and bonding support netpoll, we need to adjust some netpoll generic code. This patch does the following things: 1) introduce two new priv_flags for struct net_device: IFF_IN_NETPOLL which identifies we are processing a netpoll; IFF_DISABLE_NETPOLL is used to disable netpoll support for a device at run-time; 2) introduce one new method for netdev_ops: ->ndo_netpoll_cleanup() is used to clean up netpoll when a device is removed. 3) introduce netpoll_poll_dev() which takes a struct net_device * parameter; export netpoll_send_skb() and netpoll_poll_dev() which will be used later; 4) hide a pointer to struct netpoll in struct netpoll_info, ditto. 5) introduce ->real_dev for struct netpoll. 6) introduce a new status NETDEV_BONDING_DESLAE, which is used to disable netconsole before releasing a slave, to avoid deadlocks. Cc: David Miller <davem@davemloft.net> Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-05net: __alloc_skb() speedupEric Dumazet
With following patch I can reach maximum rate of my pktgen+udpsink simulator : - 'old' machine : dual quad core E5450 @3.00GHz - 64 UDP rx flows (only differ by destination port) - RPS enabled, NIC interrupts serviced on cpu0 - rps dispatched on 7 other cores. (~130.000 IPI per second) - SLAB allocator (faster than SLUB in this workload) - tg3 NIC - 1.080.000 pps without a single drop at NIC level. Idea is to add two prefetchw() calls in __alloc_skb(), one to prefetch first sk_buff cache line, the second to prefetch the shinfo part. Also using one memset() to initialize all skb_shared_info fields instead of one by one to reduce number of instructions, using long word moves. All skb_shared_info fields before 'dataref' are cleared in __alloc_skb(). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-04net: skb_free_datagram_locked() fixEric Dumazet
Commit 4b0b72f7dd617b ( net: speedup udp receive path ) introduced a bug in skb_free_datagram_locked(). We should not skb_orphan() skb if we dont have the guarantee we are the last skb user, this might happen with MSG_PEEK concurrent users. To keep socket locked for the smallest period of time, we split consume_skb() logic, inlined in skb_free_datagram_locked() Reported-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-03net: fix softnet_statChangli Gao
Per cpu variable softnet_data.total was shared between IRQ and SoftIRQ context without any protection. And enqueue_to_backlog should update the netdev_rx_stat of the target CPU. This patch renames softnet_data.total to softnet_data.processed: the number of packets processed in uppper levels(IP stacks). softnet_stat data is moved into softnet_data. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- include/linux/netdevice.h | 17 +++++++---------- net/core/dev.c | 26 ++++++++++++-------------- net/sched/sch_generic.c | 2 +- 3 files changed, 20 insertions(+), 25 deletions(-) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-02net: Inline skb_pull() in eth_type_trans().David S. Miller
In commit 6be8ac2f ("[NET]: uninline skb_pull, de-bloats a lot") we uninlined skb_pull. But in some critical paths it makes sense to inline this thing and it helps performance significantly. Create an skb_pull_inline() so that we can do this in a way that serves also as annotation. Based upon a patch by Eric Dumazet. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-01net: sock_def_readable() and friends RCU conversionEric Dumazet
sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we need two atomic operations (and associated dirtying) per incoming packet. RCU conversion is pretty much needed : 1) Add a new structure, called "struct socket_wq" to hold all fields that will need rcu_read_lock() protection (currently: a wait_queue_head_t and a struct fasync_struct pointer). [Future patch will add a list anchor for wakeup coalescing] 2) Attach one of such structure to each "struct socket" created in sock_alloc_inode(). 3) Respect RCU grace period when freeing a "struct socket_wq" 4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct socket_wq" 5) Change sk_sleep() function to use new sk->sk_wq instead of sk->sk_sleep 6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside a rcu_read_lock() section. 7) Change all sk_has_sleeper() callers to : - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock) - Use wq_has_sleeper() to eventually wakeup tasks. - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock) 8) sock_wake_async() is modified to use rcu protection as well. 9) Exceptions : macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq" instead of dynamically allocated ones. They dont need rcu freeing. Some cleanups or followups are probably needed, (possible sk_callback_lock conversion to a spinlock for example...). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-28net: speedup udp receive pathEric Dumazet
Since commit 95766fff ([UDP]: Add memory accounting.), each received packet needs one extra sock_lock()/sock_release() pair. This added latency because of possible backlog handling. Then later, ticket spinlocks added yet another latency source in case of DDOS. This patch introduces lock_sock_bh() and unlock_sock_bh() synchronization primitives, avoiding one atomic operation and backlog processing. skb_free_datagram_locked() uses them instead of full blown lock_sock()/release_sock(). skb is orphaned inside locked section for proper socket memory reclaim, and finally freed outside of it. UDP receive path now take the socket spinlock only once. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27net: disallow to use net_assign_generic externallyJiri Pirko
Now there's no need to use this fuction directly because it's handled by register_pernet_device. So to make this simple and easy to understand, make this static to do not tempt potentional users. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27net: sk_add_backlog() take rmem_alloc into accountEric Dumazet
Current socket backlog limit is not enough to really stop DDOS attacks, because user thread spend many time to process a full backlog each round, and user might crazy spin on socket lock. We should add backlog size and receive_queue size (aka rmem_alloc) to pace writers, and let user run without being slow down too much. Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in stress situations. Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp receiver can now process ~200.000 pps (instead of ~100 pps before the patch) on a 8 core machine. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27net: batch skb dequeueing from softnet input_pkt_queueChangli Gao
batch skb dequeueing from softnet input_pkt_queue to reduce potential lock contention when RPS is enabled. Note: in the worst case, the number of packets in a softnet_data may be double of netdev_max_backlog. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27net: reimplement softnet_data.output_queue as a FIFO queueChangli Gao
reimplement softnet_data.output_queue as a FIFO queue to keep the fairness among the qdiscs rescheduled. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> ---- include/linux/netdevice.h | 1 + net/core/dev.c | 22 ++++++++++++---------- 2 files changed, 13 insertions(+), 10 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/kaber/ipmr-2.6
2010-04-27Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/e100.c drivers/net/e1000e/netdev.c
2010-04-26net: rtnetlink: decouple rtnetlink address families from real address familiesPatrick McHardy
Decouple rtnetlink address families from real address families in socket.h to be able to add rtnetlink interfaces to code that is not a real address family without increasing AF_MAX/NPROTO. This will be used to add support for multicast route dumping from all tables as the proc interface can't be extended to support anything but the main table without breaking compatibility. This partialy undoes the patch to introduce independant families for routing rules and converts ipmr routing rules to a new rtnetlink family. Similar to that patch, values up to 127 are reserved for real address families, values above that may be used arbitrarily. Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-04-26net: fib_rules: mark arguments to fib_rules_register const and __net_initdataPatrick McHardy
fib_rules_register() duplicates the template passed to it without modification, mark the argument as const. Additionally the templates are only needed when instantiating a new namespace, so mark them as __net_initdata, which means they can be discarded when CONFIG_NET_NS=n. Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-04-25netns: rename unregister_pernet_subsys parameterJiri Pirko
Stay consistent with other functions and with comment also and name pernet_operations parameter properly. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-25rps: optimize rps_get_cpu()Changli Gao
optimize rps_get_cpu(). don't initialize ports when we can get the ports. one memory access for ports than two. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22net: Socket filter ancilliary data access for skb->dev->typePaul LeoNerd Evans
Add an SKF_AD_HATYPE field to the packet ancilliary data area, giving access to skb->dev->type, as reported in the sll_hatype field. When capturing packets on a PF_PACKET/SOCK_RAW socket bound to all interfaces, there doesn't appear to be a way for the filter program to actually find out the underlying hardware type the packet was captured on. This patch adds such ability. This patch also handles the case where skb->dev can be NULL, such as on netlink sockets. Signed-off-by: Paul Evans <leonerd@leonerd.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22rtnetlink: potential ERR_PTR dereferenceDan Carpenter
In the original code, if rtnl_create_link() returned an ERR_PTR then that would get passed to rtnl_configure_link() which dereferences it. Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22net: Orphan and de-dst skbs earlier in xmit path.David S. Miller
This way GSO packets don't get handled differently. With help from Eric Dumazet. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
2010-04-22rps: immediate send IPI in process_backlog()Eric Dumazet
If some skb are queued to our backlog, we are delaying IPI sending at the end of net_rx_action(), increasing latencies. This defeats the queueing, since we want to quickly dispatch packets to the pool of worker cpus, then eventually deeply process our packets. It's better to send IPI before processing our packets in upper layers, from process_backlog(). Change the _and_disable_irq suffix to _and_enable_irq(), since we enable local irq in net_rps_action(), sorry for the confusion. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22net: Introduce skb_orphan_try()Eric Dumazet
At this point, skb->destructor is not the original one (stored in DEV_GSO_CB(skb)->destructor) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-6000.c net/core/dev.c
2010-04-21net: Fix an RCU warning in dev_pick_tx()David Howells
Fix the following RCU warning in dev_pick_tx(): =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- net/core/dev.c:1993 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by swapper/0: #0: (&idev->mc_ifc_timer){+.-...}, at: [<ffffffff81039e65>] run_timer_softirq+0x17b/0x278 #1: (rcu_read_lock_bh){.+....}, at: [<ffffffff812ea3eb>] dev_queue_xmit+0x14e/0x4dc stack backtrace: Pid: 0, comm: swapper Not tainted 2.6.34-rc5-cachefs #4 Call Trace: <IRQ> [<ffffffff810516c4>] lockdep_rcu_dereference+0xaa/0xb2 [<ffffffff812ea4f6>] dev_queue_xmit+0x259/0x4dc [<ffffffff812ea3eb>] ? dev_queue_xmit+0x14e/0x4dc [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf [<ffffffff81035362>] ? local_bh_enable_ip+0xbc/0xc1 [<ffffffff812f0954>] neigh_resolve_output+0x24b/0x27c [<ffffffff8134f673>] ip6_output_finish+0x7c/0xb4 [<ffffffff81350c34>] ip6_output2+0x256/0x261 [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf [<ffffffff813517fb>] ip6_output+0xbbc/0xbcb [<ffffffff8135bc5d>] ? fib6_force_start_gc+0x2b/0x2d [<ffffffff81368acb>] mld_sendpack+0x273/0x39d [<ffffffff81368858>] ? mld_sendpack+0x0/0x39d [<ffffffff81052099>] ? mark_held_locks+0x52/0x70 [<ffffffff813692fc>] mld_ifc_timer_expire+0x24f/0x288 [<ffffffff81039ed6>] run_timer_softirq+0x1ec/0x278 [<ffffffff81039e65>] ? run_timer_softirq+0x17b/0x278 [<ffffffff813690ad>] ? mld_ifc_timer_expire+0x0/0x288 [<ffffffff81035531>] ? __do_softirq+0x69/0x140 [<ffffffff8103556a>] __do_softirq+0xa2/0x140 [<ffffffff81002e0c>] call_softirq+0x1c/0x28 [<ffffffff81004b54>] do_softirq+0x38/0x80 [<ffffffff81034f06>] irq_exit+0x45/0x47 [<ffffffff810177c3>] smp_apic_timer_interrupt+0x88/0x96 [<ffffffff810028d3>] apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff810488dd>] ? __atomic_notifier_call_chain+0x0/0x86 [<ffffffff810096bf>] ? mwait_idle+0x6e/0x78 [<ffffffff810096b6>] ? mwait_idle+0x65/0x78 [<ffffffff810011cb>] cpu_idle+0x4d/0x83 [<ffffffff81380b05>] rest_init+0xb9/0xc0 [<ffffffff81380a4c>] ? rest_init+0x0/0xc0 [<ffffffff8168dcf0>] start_kernel+0x392/0x39d [<ffffffff8168d2a3>] x86_64_start_reservations+0xb3/0xb7 [<ffffffff8168d38b>] x86_64_start_kernel+0xe4/0xeb An rcu_dereference() should be an rcu_dereference_bh(). Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21net: Remove two unnecessary exports (skbuff).Rami Rosen
There is no need to export skb_under_panic() and skb_over_panic() in skbuff.c, since these methods are used only in skbuff.c ; this patch removes these two exports. It also marks these functions as 'static' and removeS the extern declarations of them from include/linux/skbuff.h Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20net: sk_sleep() helperEric Dumazet
Define a new function to return the waitqueue of a "struct sock". static inline wait_queue_head_t *sk_sleep(struct sock *sk) { return sk->sk_sleep; } Change all read occurrences of sk_sleep by a call to this function. Needed for a future RCU conversion. sk_sleep wont be a field directly available. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>