summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2006-06-26[NET] netpoll: break recursive loop in netpoll rx pathNeil Horman
The netpoll system currently has a rx to tx path via: netpoll_rx __netpoll_rx arp_reply netpoll_send_skb dev->hard_start_tx This rx->tx loop places network drivers at risk of inadvertently causing a deadlock or BUG halt by recursively trying to acquire a spinlock that is used in both their rx and tx paths (this problem was origionally reported to me in the 3c59x driver, which shares a spinlock between the boomerang_interrupt and boomerang_start_xmit routines). This patch breaks this loop, by queueing arp frames, so that they can be responded to after all receive operations have been completed. Tested by myself and the reported with successful results. Specifically it was tested with netdump. Heres the BZ with details: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=194055 Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-26[NET] netpoll: don't spin forever sending to stopped queuesJeremy Fitzhardinge
When transmitting a skb in netpoll_send_skb(), only retry a limited number of times if the device queue is stopped. Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-26[NET]: skb_find_text ignores to argumentPhil Oester
skb_find_text takes a "to" argument which is supposed to limit how far into the skb it will search for the given text. At present, it seems to ignore that argument on the first skb, and instead return a match even if the text occurs beyond the limit. Patch below fixes this, after adjusting for the "from" starting point. This consequently fixes the netfilter string match's "--to" handling, which currently is broken. Signed-off-by: Phil Oester <kernel@linuxace.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-26[NET]: make net/core/dev.c:netdev_nit staticAdrian Bunk
netdev_nit can now become static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-26[NET]: Fix GSO problems in dev_hard_start_xmit()Michael Chan
Fix 2 problems in dev_hard_start_xmit(): 1. nskb->next needs to link back to skb->next if hard_start_xmit() returns non-zero. 2. Since the total number of GSO fragments may exceed MAX_SKB_FRAGS + 1, it needs to stop transmitting if the netif_queue is stopped. Signed-off-by: Michael Chan <mchan@broadcom.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-23Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: [NET]: Require CAP_NET_ADMIN to create tuntap devices. [NET]: fix net-core kernel-doc [TCP]: Move inclusion of <linux/dmaengine.h> to correct place in <linux/tcp.h> [IPSEC]: Handle GSO packets [NET]: Added GSO toggle [NET]: Add software TSOv4 [NET]: Add generic segmentation offload [NET]: Merge TSO/UFO fields in sk_buff [NET]: Prevent transmission after dev_deactivate [IPV6] ADDRCONF: Fix default source address selection without CONFIG_IPV6_PRIVACY [IPV6]: Fix source address selection. [NET]: Avoid allocating skb in skb_pad
2006-06-23[PATCH] list: use list_replace_init() instead of list_splice_init()Oleg Nesterov
list_splice_init(list, head) does unneeded job if it is known that list_empty(head) == 1. We can use list_replace_init() instead. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[NET]: fix net-core kernel-docRandy Dunlap
Warning(/var/linsrc/linux-2617-g4//include/linux/skbuff.h:304): No description found for parameter 'dma_cookie' Warning(/var/linsrc/linux-2617-g4//include/net/sock.h:1274): No description found for parameter 'copied_early' Warning(/var/linsrc/linux-2617-g4//net/core/dev.c:3309): No description found for parameter 'chan' Warning(/var/linsrc/linux-2617-g4//net/core/dev.c:3309): No description found for parameter 'event' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-23[NET]: Added GSO toggleHerbert Xu
This patch adds a generic segmentation offload toggle that can be turned on/off for each net device. For now it only supports in TCPv4. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-23[NET]: Add software TSOv4Herbert Xu
This patch adds the GSO implementation for IPv4 TCP. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-23[NET]: Add generic segmentation offloadHerbert Xu
This patch adds the infrastructure for generic segmentation offload. The idea is to tap into the potential savings of TSO without hardware support by postponing the allocation of segmented skb's until just before the entry point into the NIC driver. The same structure can be used to support software IPv6 TSO, as well as UFO and segmentation offload for other relevant protocols, e.g., DCCP. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-23[NET]: Merge TSO/UFO fields in sk_buffHerbert Xu
Having separate fields in sk_buff for TSO/UFO (tso_size/ufo_size) is not going to scale if we add any more segmentation methods (e.g., DCCP). So let's merge them. They were used to tell the protocol of a packet. This function has been subsumed by the new gso_type field. This is essentially a set of netdev feature bits (shifted by 16 bits) that are required to process a specific skb. As such it's easy to tell whether a given device can process a GSO skb: you just have to and the gso_type field and the netdev's features field. I've made gso_type a conjunction. The idea is that you have a base type (e.g., SKB_GSO_TCPV4) that can be modified further to support new features. For example, if we add a hardware TSO type that supports ECN, they would declare NETIF_F_TSO | NETIF_F_TSO_ECN. All TSO packets with CWR set would have a gso_type of SKB_GSO_TCPV4 | SKB_GSO_TCPV4_ECN while all other TSO packets would be SKB_GSO_TCPV4. This means that only the CWR packets need to be emulated in software. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-23[NET]: Prevent transmission after dev_deactivateHerbert Xu
The dev_deactivate function has bit-rotted since the introduction of lockless drivers. In particular, the spin_unlock_wait call at the end has no effect on the xmit routine of lockless drivers. With a little bit of work, we can make it much more useful by providing the guarantee that when it returns, no more calls to the xmit routine of the underlying driver will be made. The idea is simple. There are two entry points in to the xmit routine. The first comes from dev_queue_xmit. That one is easily stopped by using synchronize_rcu. This works because we set the qdisc to noop_qdisc before the synchronize_rcu call. That in turn causes all subsequent packets sent to dev_queue_xmit to be dropped. The synchronize_rcu call also ensures all outstanding calls leave their critical section. The other entry point is from qdisc_run. Since we now have a bit that indicates whether it's running, all we have to do is to wait until the bit is off. I've removed the loop to wait for __LINK_STATE_SCHED to clear. This is useless because netif_wake_queue can cause it to be set again. It is also harmless because we've disarmed qdisc_run. I've also removed the spin_unlock_wait on xmit_lock because its only purpose of making sure that all outstanding xmit_lock holders have exited is also given by dev_watchdog_down. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-23[NET]: Avoid allocating skb in skb_padHerbert Xu
First of all it is unnecessary to allocate a new skb in skb_pad since the existing one is not shared. More importantly, our hard_start_xmit interface does not allow a new skb to be allocated since that breaks requeueing. This patch uses pskb_expand_head to expand the existing skb and linearize it if needed. Actually, someone should sift through every instance of skb_pad on a non-linear skb as they do not fit the reasons why this was originally created. Incidentally, this fixes a minor bug when the skb is cloned (tcpdump, TCP, etc.). As it is skb_pad will simply write over a cloned skb. Because of the position of the write it is unlikely to cause problems but still it's best if we don't do it. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-18[ETHTOOL]: Fix UFO typoHerbert Xu
The function ethtool_get_ufo was referring to ETHTOOL_GTSO instead of ETHTOOL_GUFO. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-18[NET]: Add NETIF_F_GEN_CSUM and NETIF_F_ALL_CSUMHerbert Xu
The current stack treats NETIF_F_HW_CSUM and NETIF_F_NO_CSUM identically so we test for them in quite a few places. For the sake of brevity, I'm adding the macro NETIF_F_GEN_CSUM for these two. We also test the disjunct of NETIF_F_IP_CSUM and the other two in various places, for that purpose I've added NETIF_F_ALL_CSUM. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-18[NET]: Warn in __skb_trim if skb is pagedHerbert Xu
It's better to warn and fail rather than rarely triggering BUG on paths that incorrectly call skb_trim/__skb_trim on a non-linear skb. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-18[NET]: Clean up skb_linearizeHerbert Xu
The linearisation operation doesn't need to be super-optimised. So we can replace __skb_linearize with __pskb_pull_tail which does the same thing but is more general. Also, most users of skb_linearize end up testing whether the skb is linear or not so it helps to make skb_linearize do just that. Some callers of skb_linearize also use it to copy cloned data, so it's useful to have a new function skb_linearize_cow to copy the data if it's either non-linear or cloned. Last but not least, I've removed the gfp argument since nobody uses it anymore. If it's ever needed we can easily add it back. Misc bugs fixed by this patch: * via-velocity error handling (also, no SG => no frags) Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-18[NET]: Add netif_tx_lockHerbert Xu
Various drivers use xmit_lock internally to synchronise with their transmission routines. They do so without setting xmit_lock_owner. This is fine as long as netpoll is not in use. With netpoll it is possible for deadlocks to occur if xmit_lock_owner isn't set. This is because if a printk occurs while xmit_lock is held and xmit_lock_owner is not set can cause netpoll to attempt to take xmit_lock recursively. While it is possible to resolve this by getting netpoll to use trylock, it is suboptimal because netpoll's sole objective is to maximise the chance of getting the printk out on the wire. So delaying or dropping the message is to be avoided as much as possible. So the only alternative is to always set xmit_lock_owner. The following patch does this by introducing the netif_tx_lock family of functions that take care of setting/unsetting xmit_lock_owner. I renamed xmit_lock to _xmit_lock to indicate that it should not be used directly. I didn't provide irq versions of the netif_tx_lock functions since xmit_lock is meant to be a BH-disabling lock. This is pretty much a straight text substitution except for a small bug fix in winbond. It currently uses netif_stop_queue/spin_unlock_wait to stop transmission. This is unsafe as an IRQ can potentially wake up the queue. So it is safer to use netif_tx_disable. The hamradio bits used spin_lock_irq but it is unnecessary as xmit_lock must never be taken in an IRQ handler. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-18[SECMARK]: Add secmark support to core networking.James Morris
Add a secmark field to the skbuff structure, to allow security subsystems to place security markings on network packets. This is similar to the nfmark field, except is intended for implementing security policy, rather than than networking policy. This patch was already acked in principle by Dave Miller. Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-18[I/OAT]: Add a sysctl for tuning the I/OAT offloaded I/O thresholdChris Leech
Any socket recv of less than this ammount will not be offloaded Signed-off-by: Chris Leech <christopher.leech@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-18[I/OAT]: Structure changes for TCP recv offload to I/OATChris Leech
Adds an async_wait_queue and some additional fields to tcp_sock, and a dma_cookie_t to sk_buff. Signed-off-by: Chris Leech <christopher.leech@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-18[I/OAT]: Utility functions for offloading sk_buff to iovec copiesChris Leech
Provides for pinning user space pages in memory, copying to iovecs, and copying from sk_buffs including fragmented and chained sk_buffs. Signed-off-by: Chris Leech <christopher.leech@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-18[I/OAT]: Setup the networking subsystem as a DMA clientChris Leech
Attempts to allocate per-CPU DMA channels Signed-off-by: Chris Leech <christopher.leech@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-26[NET]: dev.c comment fixesStephen Hemminger
Noticed that dev_alloc_name() comment was incorrect, and more spellung errors. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-12[NEIGH]: Fix IP-over-ATM and ARP interaction.Simon Kelley
The classical IP over ATM code maintains its own IPv4 <-> <ATM stuff> ARP table, using the standard neighbour-table code. The neigh_table_init function adds this neighbour table to a linked list of all neighbor tables which is used by the functions neigh_delete() neigh_add() and neightbl_set(), all called by the netlink code. Once the ATM neighbour table is added to the list, there are two tables with family == AF_INET there, and ARP entries sent via netlink go into the first table with matching family. This is indeterminate and often wrong. To see the bug, on a kernel with CLIP enabled, create a standard IPv4 ARP entry by pinging an unused address on a local subnet. Then attempt to complete that entry by doing ip neigh replace <ip address> lladdr <some mac address> nud reachable Looking at the ARP tables by using ip neigh show will reveal two ARP entries for the same address. One of these can be found in /proc/net/arp, and the other in /proc/net/atm/arp. This patch adds a new function, neigh_table_init_no_netlink() which does everything the neigh_table_init() does, except add the table to the netlink all-arp-tables chain. In addition neigh_table_init() has a check that all tables on the chain have a distinct address family. The init call in clip.c is changed to call neigh_table_init_no_netlink(). Since ATM ARP tables are rather more complicated than can currently be handled by the available rtattrs in the netlink protocol, no functionality is lost by this patch, and non-ATM ARP manipulation via netlink is rescued. A more complete solution would involve a rtattr for ATM ARP entries and some way for the netlink code to give neigh_add and friends more information than just address family with which to find the correct ARP table. [ I've changed the assertion checking in neigh_table_init() to not use BUG_ON() while holding neigh_tbl_lock. Instead we remember that we found an existing tbl with the same family, and after dropping the lock we'll give a diagnostic kernel log message and a stack dump. -DaveM ] Signed-off-by: Simon Kelley <simon@thekelleys.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-10[NET]: Do sysfs registration as part of register_netdevice.Stephen Hemminger
The last step of netdevice registration was being done by a delayed call, but because it was delayed, it was impossible to return any error code if the class_device registration failed. Side effects: * one state in registration process is unnecessary. * register_netdevice can sleep inside class_device registration/hotplug * code in netdev_run_todo only does unregistration so it is simpler. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-09[NET] linkwatch: Handle jiffies wrap-aroundHerbert Xu
The test used in the linkwatch does not handle wrap-arounds correctly. Since the intention of the code is to eliminate bursts of messages we can afford to delay things up to a second. Using that fact we can easily handle wrap-arounds by making sure that we don't delay things by more than one second. This is based on diagnosis and a patch by Stefan Rompf. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Stefan Rompf <stefan@loplof.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-09[NET]: Make netdev_chain a raw notifier.Alan Stern
From: Alan Stern <stern@rowland.harvard.edu> This chain does it's own locking via the RTNL semaphore, and can also run recursively so adding a new mutex here was causing deadlocks. Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-07[NET]: Create netdev attribute_groups with class_device_addStephen Hemminger
Atomically create attributes when class device is added. This avoids the race between registering class_device (which generates hotplug event), and the creation of attribute groups. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-20Merge branch 'upstream-linus' of ↵Linus Torvalds
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6 * 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6: (21 commits) [PATCH] wext: Fix RtNetlink ENCODE security permissions [PATCH] bcm43xx: iw_priv_args names should be <16 characters [PATCH] bcm43xx: sysfs code cleanup [PATCH] bcm43xx: fix pctl slowclock limit calculation [PATCH] bcm43xx: fix dyn tssi2dbm memleak [PATCH] bcm43xx: fix config menu alignment [PATCH] bcm43xx wireless: fix printk format warnings [PATCH] softmac: report when scanning has finished [PATCH] softmac: fix event sending [PATCH] softmac: handle iw_mode properly [PATCH] softmac: dont send out packets while scanning [PATCH] softmac: return -EAGAIN from getscan while scanning [PATCH] bcm43xx: set trans_start on TX to prevent bogus timeouts [PATCH] orinoco: fix truncating commsquality RID with the latest Symbol firmware [PATCH] softmac: fix spinlock recursion on reassoc [PATCH] Revert NET_RADIO Kconfig title change [PATCH] wext: Fix IWENCODEEXT security permissions [PATCH] wireless/atmel: send WEXT scan completion events [PATCH] wireless/airo: clean up WEXT association and scan events [PATCH] softmac uses Wiress Ext. ...
2006-04-20[NET]: Add skb->truesize assertion checking.David S. Miller
Add some sanity checking. truesize should be at least sizeof(struct sk_buff) plus the current packet length. If not, then truesize is seriously mangled and deserves a kernel log message. Currently we'll do the check for release of stream socket buffers. But we can add checks to more spots over time. Incorporating ideas from Herbert Xu. Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-19[PATCH] wext: Fix RtNetlink ENCODE security permissionsJean Tourrilhes
I've just realised that the RtNetlink code does not check the permission for SIOCGIWENCODE and SIOCGIWENCODEEXT, which means that any user can read the encryption keys. The fix is trivial and should go in 2.6.17 alonside the two other patch I sent you last week. Signed-off-by: Jean Tourrilhes <jt@hpl.hp.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2006-04-19[PATCH] wext: Fix IWENCODEEXT security permissionsJean Tourrilhes
Check the permissions when user-space try to read the encryption parameters via SIOCGIWENCODEEXT. This is trivial and probably should go in 2.6.17... Bug was found by Brian Eaton <eaton.lists@gmail.com>, thanks ! Signed-off-by: Jean Tourrilhes <jt@hpl.hp.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2006-04-18unaligned access in sk_run_filter()Dmitry Mishin
This patch fixes unaligned access warnings noticed on IA64 in sk_run_filter(). 'ptr' can be unaligned. Signed-off-By: Dmitry Mishin <dim@openvz.org> Signed-off-By: Kirill Korotaev <dev@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-11[PATCH] for_each_possible_cpu: network codesKAMEZAWA Hiroyuki
for_each_cpu() actually iterates across all possible CPUs. We've had mistakes in the past where people were using for_each_cpu() where they should have been iterating across only online or present CPUs. This is inefficient and possibly buggy. We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the future. This patch replaces for_each_cpu with for_each_possible_cpu under /net Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-10[NET]: Fix hotplug race during device registration.Sergey Vlasov
From: Thomas de Grenier de Latour <degrenier@easyconnect.fr> On Sun, 9 Apr 2006 21:56:59 +0400, Sergey Vlasov <vsu@altlinux.ru> wrote: > However, show_address() does not output anything unless > dev->reg_state == NETREG_REGISTERED - and this state is set by > netdev_run_todo() only after netdev_register_sysfs() returns, so in > the meantime (while netdev_register_sysfs() is busy adding the > "statistics" attribute group) some process may see an empty "address" > attribute. I've tried the attached patch, suggested by Sergey Vlasov on hotplug-devel@, and as far as i can test it works just fine. Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-10[NET]: More kzalloc conversions.Andrew Morton
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-10[NET] kzalloc: use in alloc_netdevPaolo 'Blaisorblade' Giarrusso
Noticed this use, fixed it. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-10[NET]: Fix an off-by-21-or-49 error.Adrian Bunk
This patch fixes an off-by-21-or-49 error ;-) spotted by the Coverity checker. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-31[NET]: add SO_RCVBUF commentAndrew Morton
Put a comment in there explaining why we double the setsockopt() caller's SO_RCVBUF. People keep wondering. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-29[NET]: Deinline some larger functions from netdevice.hDenis Vlasenko
On a allyesconfig'ured kernel: Size Uses Wasted Name and definition ===== ==== ====== ================================================ 95 162 12075 netif_wake_queue include/linux/netdevice.h 129 86 9265 dev_kfree_skb_any include/linux/netdevice.h 127 56 5885 netif_device_attach include/linux/netdevice.h 73 86 4505 dev_kfree_skb_irq include/linux/netdevice.h 46 60 1534 netif_device_detach include/linux/netdevice.h 119 16 1485 __netif_rx_schedule include/linux/netdevice.h 143 5 492 netif_rx_schedule include/linux/netdevice.h 81 7 366 netif_schedule include/linux/netdevice.h netif_wake_queue is big because __netif_schedule is a big inline: static inline void __netif_schedule(struct net_device *dev) { if (!test_and_set_bit(__LINK_STATE_SCHED, &dev->state)) { unsigned long flags; struct softnet_data *sd; local_irq_save(flags); sd = &__get_cpu_var(softnet_data); dev->next_sched = sd->output_queue; sd->output_queue = dev; raise_softirq_irqoff(NET_TX_SOFTIRQ); local_irq_restore(flags); } } static inline void netif_wake_queue(struct net_device *dev) { #ifdef CONFIG_NETPOLL_TRAP if (netpoll_trap()) return; #endif if (test_and_clear_bit(__LINK_STATE_XOFF, &dev->state)) __netif_schedule(dev); } By de-inlining __netif_schedule we are saving a lot of text at each callsite of netif_wake_queue and netif_schedule. __netif_rx_schedule is also big, and it makes more sense to keep both of them out of line. Patch also deinlines dev_kfree_skb_any. We can deinline dev_kfree_skb_irq instead... oh well. netif_device_attach/detach are not hot paths, we can deinline them too. Signed-off-by: Denis Vlasenko <vda@ilport.com.ua> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-29[NET]: deinline 200+ byte inlines in sock.hDenis Vlasenko
Sizes in bytes (allyesconfig, i386) and files where those inlines are used: 238 sock_queue_rcv_skb 2.6.16/net/x25/x25_in.o 238 sock_queue_rcv_skb 2.6.16/net/rose/rose_in.o 238 sock_queue_rcv_skb 2.6.16/net/packet/af_packet.o 238 sock_queue_rcv_skb 2.6.16/net/netrom/nr_in.o 238 sock_queue_rcv_skb 2.6.16/net/llc/llc_sap.o 238 sock_queue_rcv_skb 2.6.16/net/llc/llc_conn.o 238 sock_queue_rcv_skb 2.6.16/net/irda/af_irda.o 238 sock_queue_rcv_skb 2.6.16/net/ipx/af_ipx.o 238 sock_queue_rcv_skb 2.6.16/net/ipv6/udp.o 238 sock_queue_rcv_skb 2.6.16/net/ipv6/raw.o 238 sock_queue_rcv_skb 2.6.16/net/ipv4/udp.o 238 sock_queue_rcv_skb 2.6.16/net/ipv4/raw.o 238 sock_queue_rcv_skb 2.6.16/net/ipv4/ipmr.o 238 sock_queue_rcv_skb 2.6.16/net/econet/econet.o 238 sock_queue_rcv_skb 2.6.16/net/econet/af_econet.o 238 sock_queue_rcv_skb 2.6.16/net/bluetooth/sco.o 238 sock_queue_rcv_skb 2.6.16/net/bluetooth/l2cap.o 238 sock_queue_rcv_skb 2.6.16/net/bluetooth/hci_sock.o 238 sock_queue_rcv_skb 2.6.16/net/ax25/ax25_in.o 238 sock_queue_rcv_skb 2.6.16/net/ax25/af_ax25.o 238 sock_queue_rcv_skb 2.6.16/net/appletalk/ddp.o 238 sock_queue_rcv_skb 2.6.16/drivers/net/pppoe.o 276 sk_receive_skb 2.6.16/net/decnet/dn_nsp_in.o 276 sk_receive_skb 2.6.16/net/dccp/ipv6.o 276 sk_receive_skb 2.6.16/net/dccp/ipv4.o 276 sk_receive_skb 2.6.16/net/dccp/dccp_ipv6.o 276 sk_receive_skb 2.6.16/drivers/net/pppoe.o 209 sk_dst_check 2.6.16/net/ipv6/ip6_output.o 209 sk_dst_check 2.6.16/net/ipv4/udp.o 209 sk_dst_check 2.6.16/net/decnet/dn_nsp_out.o Large inlines with multiple callers: Size Uses Wasted Name and definition ===== ==== ====== ================================================ 238 21 4360 sock_queue_rcv_skb include/net/sock.h 109 10 801 sock_recv_timestamp include/net/sock.h 276 4 768 sk_receive_skb include/net/sock.h 94 8 518 __sk_dst_check include/net/sock.h 209 3 378 sk_dst_check include/net/sock.h 131 4 333 sk_setup_caps include/net/sock.h 152 2 132 sk_stream_alloc_pskb include/net/sock.h 125 2 105 sk_stream_writequeue_purge include/net/sock.h Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-27Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: [NET]: drop duplicate assignment in request_sock [IPSEC]: Fix tunnel error handling in ipcomp6
2006-03-27[PATCH] Notifier chain update: API changesAlan Stern
The kernel's implementation of notifier chains is unsafe. There is no protection against entries being added to or removed from a chain while the chain is in use. The issues were discussed in this thread: http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2 We noticed that notifier chains in the kernel fall into two basic usage classes: "Blocking" chains are always called from a process context and the callout routines are allowed to sleep; "Atomic" chains can be called from an atomic context and the callout routines are not allowed to sleep. We decided to codify this distinction and make it part of the API. Therefore this set of patches introduces three new, parallel APIs: one for blocking notifiers, one for atomic notifiers, and one for "raw" notifiers (which is really just the old API under a new name). New kinds of data structures are used for the heads of the chains, and new routines are defined for registration, unregistration, and calling a chain. The three APIs are explained in include/linux/notifier.h and their implementation is in kernel/sys.c. With atomic and blocking chains, the implementation guarantees that the chain links will not be corrupted and that chain callers will not get messed up by entries being added or removed. For raw chains the implementation provides no guarantees at all; users of this API must provide their own protections. (The idea was that situations may come up where the assumptions of the atomic and blocking APIs are not appropriate, so it should be possible for users to handle these things in their own way.) There are some limitations, which should not be too hard to live with. For atomic/blocking chains, registration and unregistration must always be done in a process context since the chain is protected by a mutex/rwsem. Also, a callout routine for a non-raw chain must not try to register or unregister entries on its own chain. (This did happen in a couple of places and the code had to be changed to avoid it.) Since atomic chains may be called from within an NMI handler, they cannot use spinlocks for synchronization. Instead we use RCU. The overhead falls almost entirely in the unregister routine, which is okay since unregistration is much less frequent that calling a chain. Here is the list of chains that we adjusted and their classifications. None of them use the raw API, so for the moment it is only a placeholder. ATOMIC CHAINS ------------- arch/i386/kernel/traps.c: i386die_chain arch/ia64/kernel/traps.c: ia64die_chain arch/powerpc/kernel/traps.c: powerpc_die_chain arch/sparc64/kernel/traps.c: sparc64die_chain arch/x86_64/kernel/traps.c: die_chain drivers/char/ipmi/ipmi_si_intf.c: xaction_notifier_list kernel/panic.c: panic_notifier_list kernel/profile.c: task_free_notifier net/bluetooth/hci_core.c: hci_notifier net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_chain net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_expect_chain net/ipv6/addrconf.c: inet6addr_chain net/netfilter/nf_conntrack_core.c: nf_conntrack_chain net/netfilter/nf_conntrack_core.c: nf_conntrack_expect_chain net/netlink/af_netlink.c: netlink_chain BLOCKING CHAINS --------------- arch/powerpc/platforms/pseries/reconfig.c: pSeries_reconfig_chain arch/s390/kernel/process.c: idle_chain arch/x86_64/kernel/process.c idle_notifier drivers/base/memory.c: memory_chain drivers/cpufreq/cpufreq.c cpufreq_policy_notifier_list drivers/cpufreq/cpufreq.c cpufreq_transition_notifier_list drivers/macintosh/adb.c: adb_client_list drivers/macintosh/via-pmu.c sleep_notifier_list drivers/macintosh/via-pmu68k.c sleep_notifier_list drivers/macintosh/windfarm_core.c wf_client_list drivers/usb/core/notify.c usb_notifier_list drivers/video/fbmem.c fb_notifier_list kernel/cpu.c cpu_chain kernel/module.c module_notify_list kernel/profile.c munmap_notifier kernel/profile.c task_exit_notifier kernel/sys.c reboot_notifier_list net/core/dev.c netdev_chain net/decnet/dn_dev.c: dnaddr_chain net/ipv4/devinet.c: inetaddr_chain It's possible that some of these classifications are wrong. If they are, please let us know or submit a patch to fix them. Note that any chain that gets called very frequently should be atomic, because the rwsem read-locking used for blocking chains is very likely to incur cache misses on SMP systems. (However, if the chain's callout routines may sleep then the chain cannot be atomic.) The patch set was written by Alan Stern and Chandra Seetharaman, incorporating material written by Keith Owens and suggestions from Paul McKenney and Andrew Morton. [jes@sgi.com: restructure the notifier chain initialization macros] Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[NET]: drop duplicate assignment in request_sockNorbert Kiesel
Just noticed that request_sock.[ch] contain a useless assignment of rskq_accept_head to itself. I assume this is a typo and the 2nd one was supposed to be _tail. However, setting _tail to NULL is not needed, so the patch below just drops the 2nd assignment. Signed-off-By: Norbert Kiesel <nkiesel@tbdnetworks.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-25Merge branch 'audit.b3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: (22 commits) [PATCH] fix audit_init failure path [PATCH] EXPORT_SYMBOL patch for audit_log, audit_log_start, audit_log_end and audit_format [PATCH] sem2mutex: audit_netlink_sem [PATCH] simplify audit_free() locking [PATCH] Fix audit operators [PATCH] promiscuous mode [PATCH] Add tty to syscall audit records [PATCH] add/remove rule update [PATCH] audit string fields interface + consumer [PATCH] SE Linux audit events [PATCH] Minor cosmetic cleanups to the code moved into auditfilter.c [PATCH] Fix audit record filtering with !CONFIG_AUDITSYSCALL [PATCH] Fix IA64 success/failure indication in syscall auditing. [PATCH] Miscellaneous bug and warning fixes [PATCH] Capture selinux subject/object context information. [PATCH] Exclude messages by message type [PATCH] Collect more inode information during syscall processing. [PATCH] Pass dentry, not just name, in fsnotify creation hooks. [PATCH] Define new range of userspace messages. [PATCH] Filter rule comparators ... Fixed trivial conflict in security/selinux/hooks.c
2006-03-25Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: [NETFILTER] x_table.c: sem2mutex [IPV4]: Aggregate route entries with different TOS values [TCP]: Mark tcp_*mem[] __read_mostly. [TCP]: Set default max buffers from memory pool size [SCTP]: Fix up sctp_rcv return value [NET]: Take RTNL when unregistering notifier [WIRELESS]: Fix config dependencies. [NET]: Fill in a 32-bit hole in struct sock on 64-bit platforms. [NET]: Ensure device name passed to SO_BINDTODEVICE is NULL terminated. [MODULES]: Don't allow statically declared exports [BRIDGE]: Unaligned accesses in the ethernet bridge
2006-03-25[PATCH] POLLRDHUP/EPOLLRDHUP handling for half-closed devices notificationsDavide Libenzi
Implement the half-closed devices notifiation, by adding a new POLLRDHUP (and its alias EPOLLRDHUP) bit to the existing poll/select sets. Since the existing POLLHUP handling, that does not report correctly half-closed devices, was feared to be changed, this implementation leaves the current POLLHUP reporting unchanged and simply add a new bit that is set in the few places where it makes sense. The same thing was discussed and conceptually agreed quite some time ago: http://lkml.org/lkml/2003/7/12/116 Since this new event bit is added to the existing Linux poll infrastruture, even the existing poll/select system calls will be able to use it. As far as the existing POLLHUP handling, the patch leaves it as is. The pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing archs and sets the bit in the six relevant files. The other attached diff is the simple change required to sys/epoll.h to add the EPOLLRDHUP definition. There is "a stupid program" to test POLLRDHUP delivery here: http://www.xmailserver.org/pollrdhup-test.c It tests poll(2), but since the delivery is same epoll(2) will work equally. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] slab: implement /proc/slab_allocatorsAl Viro
Implement /proc/slab_allocators. It produces output like: idr_layer_cache: 80 idr_pre_get+0x33/0x4e buffer_head: 2555 alloc_buffer_head+0x20/0x75 mm_struct: 9 mm_alloc+0x1e/0x42 mm_struct: 20 dup_mm+0x36/0x370 vm_area_struct: 384 dup_mm+0x18f/0x370 vm_area_struct: 151 do_mmap_pgoff+0x2e0/0x7c3 vm_area_struct: 1 split_vma+0x5a/0x10e vm_area_struct: 11 do_brk+0x206/0x2e2 vm_area_struct: 2 copy_vma+0xda/0x142 vm_area_struct: 9 setup_arg_pages+0x99/0x214 fs_cache: 8 copy_fs_struct+0x21/0x133 fs_cache: 29 copy_process+0xf38/0x10e3 files_cache: 30 alloc_files+0x1b/0xcf signal_cache: 81 copy_process+0xbaa/0x10e3 sighand_cache: 77 copy_process+0xe65/0x10e3 sighand_cache: 1 de_thread+0x4d/0x5f8 anon_vma: 241 anon_vma_prepare+0xd9/0xf3 size-2048: 1 add_sect_attrs+0x5f/0x145 size-2048: 2 journal_init_revoke+0x99/0x302 size-2048: 2 journal_init_revoke+0x137/0x302 size-2048: 2 journal_init_inode+0xf9/0x1c4 Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> DESC slab-leaks3-locking-fix EDESC From: Andrew Morton <akpm@osdl.org> Update for slab-remove-cachep-spinlock.patch Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexander Nyberg <alexn@telia.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>