summaryrefslogtreecommitdiff
path: root/include/net/route.h
AgeCommit message (Collapse)Author
2011-04-07ipv4: Fix "Set rt->rt_iif more sanely on output routes."OGAWA Hirofumi
Commit 1018b5c01636c7c6bda31a719bda34fc631db29a ("Set rt->rt_iif more sanely on output routes.") breaks rt_is_{output,input}_route. This became the cause to return "IP_PKTINFO's ->ipi_ifindex == 0". To fix it, this does: 1) Add "int rt_route_iif;" to struct rtable 2) For input routes, always set rt_route_iif to same value as rt_iif 3) For output routes, always set rt_route_iif to zero. Set rt_iif as it is done currently. 4) Change rt_is_{output,input}_route() to test rt_route_iif Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-25route: Take the right src and dst addresses in ip_route_newportsSteffen Klassert
When we set up the flow informations in ip_route_newports(), we take the address informations from the the rt_key_src and rt_key_dst fields of the rtable. They appear to be empty. So take the address informations from rt_src and rt_dst instead. This issue was introduced by commit 5e2b61f78411be25f0b84f97d5b5d312f184dfd1 ("ipv4: Remove flowi from struct rtable.") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22ipv4: fix route deletion for IPs on many subnetsJulian Anastasov
Alex Sidorenko reported for problems with local routes left after IP addresses are deleted. It happens when same IPs are used in more than one subnet for the device. Fix fib_del_ifaddr to restrict the checks for duplicate local and broadcast addresses only to the IFAs that use our primary IFA or another primary IFA with same address. And we expect the prefsrc to be matched when the routes are deleted because it is possible they to differ only by prefsrc. This patch prevents local and broadcast routes to be leaked until their primary IP is deleted finally from the box. As the secondary address promotion needs to delete the routes for all secondaries that used the old primary IFA, add option to ignore these secondaries from the checks and to assume they are already deleted, so that we can safely delete the route while these IFAs are still on the device list. Reported-by: Alex Sidorenko <alexandre.sidorenko@hp.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12net: Put fl4_* macros to struct flowi4 and use them again.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12ipv4: Use flowi4 in public route lookup interfaces.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12net: Make flowi ports AF dependent.David S. Miller
Create two sets of port member accessors, one set prefixed by fl4_* and the other prefixed by fl6_* This will let us to create AF optimal flow instances. It will work because every context in which we access the ports, we have to be fully aware of which AF the flowi is anyways. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12net: Put flowi_* prefix on AF independent members of struct flowiDavid S. Miller
I intend to turn struct flowi into a union of AF specific flowi structs. There will be a common structure that each variant includes first, much like struct sock_common. This is the first step to move in that direction. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12ipv4: Create and use route lookup helpers.David S. Miller
The idea here is this minimizes the number of places one has to edit in order to make changes to how flows are defined and used. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-05ipv4: Remove flowi from struct rtable.David S. Miller
The only necessary parts are the src/dst addresses, the interface indexes, the TOS, and the mark. The rest is unnecessary bloat, which amounts to nearly 50 bytes on 64-bit. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-05ipv4: Use passed-in protocol in ip_route_newports().David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-02ipv4: ip_route_output_key() is better as an inline.David S. Miller
This avoid a stack frame at zero cost. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-02ipv4: Make output route lookup return rtable directly.David S. Miller
Instead of on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01xfrm: Handle blackhole route creation via afinfo.David S. Miller
That way we don't have to potentially do this in every xfrm_lookup() caller. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01ipv4: Kill can_sleep arg to ip_route_output_flow()David S. Miller
This boolean state is now available in the flow flags. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01net: Add FLOWI_FLAG_CAN_SLEEP.David S. Miller
And set is in contexts where the route resolution can sleep. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep"David S. Miller
Since that is what the current vague "flags" argument means. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01ipv4: Can final ip_route_connect() arg to boolean "can_sleep".David S. Miller
Since that's what the current vague "flags" thing means. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24ipv4: Rearrange how ip_route_newports() gets port keys.David S. Miller
ip_route_newports() is the only place in the entire kernel that cares about the port members in the routing cache entry's lookup flow key. Therefore the only reason we store an entire flow inside of the struct rtentry is for this one special case. Rewrite ip_route_newports() such that: 1) The caller passes in the original port values, so we don't need to use the rth->fl.fl_ip_{s,d}port values to remember them. 2) The lookup flow is constructed by hand instead of being copied from the routing cache entry's flow. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-10inet: Create a mechanism for upward inetpeer propagation into routes.David S. Miller
If we didn't have a routing cache, we would not be able to properly propagate certain kinds of dynamic path attributes, for example PMTU information and redirects. The reason is that if we didn't have a routing cache, then there would be no way to lookup all of the active cached routes hanging off of sockets, tunnels, IPSEC bundles, etc. Consider the case where we created a cached route, but no inetpeer entry existed and also we were not asked to pre-COW the route metrics and therefore did not force the creation a new inetpeer entry. If we later get a PMTU message, or a redirect, and store this information in a new inetpeer entry, there is no way to teach that cached route about the newly existing inetpeer entry. The facilities implemented here handle this problem. First we create a generation ID. When we create a cached route of any kind, we remember the generation ID at the time of attachment. Any time we force-create an inetpeer entry in response to new path information, we bump that generation ID. The dst_ops->check() callback is where the knowledge of this event is propagated. If the global generation ID does not equal the one stored in the cached route, and the cached route has not attached to an inetpeer yet, we look it up and attach if one is found. Now that we've updated the cached route's information, we update the route's generation ID too. This clears the way for implementing PMTU and redirects directly in the inetpeer cache. There is absolutely no need to consult cached route information in order to maintain this information. At this point nothing bumps the inetpeer genids, that comes in the later changes which handle PMTUs and redirects using inetpeers. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-28net: Pre-COW metrics for TCP.David S. Miller
TCP is going to record metrics for the connection, so pre-COW the route metrics at route cache entry creation time. This avoids several atomic operations that have to occur if we COW the metrics after the entry reaches global visibility. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-27net: Implement read-only protection and COW'ing of metrics.David S. Miller
Routing metrics are now copy-on-write. Initially a route entry points it's metrics at a read-only location. If a routing table entry exists, it will point there. Else it will point at the all zero metric place-holder called 'dst_default_metrics'. The writeability state of the metrics is stored in the low bits of the metrics pointer, we have two bits left to spare if we want to store more states. For the initial implementation, COW is implemented simply via kmalloc. However future enhancements will change this to place the writable metrics somewhere else, in order to increase sharing. Very likely this "somewhere else" will be the inetpeer cache. Note also that this means that metrics updates may transiently fail if we cannot COW the metrics successfully. But even by itself, this patch should decrease memory usage and increase cache locality especially for routing workloads. In those cases the read-only metric copies stay in place and never get written to. TCP workloads where metrics get updated, and those rare cases where PMTU triggers occur, will take a very slight performance hit. But that hit will be alleviated when the long-term writable metrics move to a more sharable location. Since the metrics storage went from a u32 array of RTAX_MAX entries to what is essentially a pointer, some retooling of the dst_entry layout was necessary. Most importantly, we need to preserve the alignment of the reference count so that it doesn't share cache lines with the read-mostly state, as per Eric Dumazet's alignment assertion checks. The only non-trivial bit here is the move of the 'flags' member into the writeable cacheline. This is OK since we are always accessing the flags around the same moment when we made a modification to the reference count. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-20ipv4: Flush per-ns routing cache more sanely.David S. Miller
Flush the routing cache only of entries that match the network namespace in which the purge event occurred. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
2010-12-13ipv4: Don't pre-seed hoplimit metric.David S. Miller
Always go through a new ip4_dst_hoplimit() helper, just like ipv6. This allowed several simplifications: 1) The interim dst_metric_hoplimit() can go as it's no longer userd. 2) The sysctl_ip_default_ttl entry no longer needs to use ipv4_doint_and_flush, since the sysctl is not cached in routing cache metrics any longer. 3) ipv4_doint_and_flush no longer needs to be exported and therefore can be marked static. When ipv4_doint_and_flush_strategy was removed some time ago, the external declaration in ip.h was mistakenly left around so kill that off too. We have to move the sysctl_ip_default_ttl declaration into ipv4's route cache definition header net/route.h, because currently net/ip.h (where the declaration lives now) has a back dependency on net/route.h Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-17net: use the macros defined for the members of flowiChangli Gao
Use the macros defined for the members of flowi to clean the code up. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-12ipv4: Make rt->fl.iif tests lest obscure.David S. Miller
When we test rt->fl.iif against zero, we're seeing if it's an output or an input route. Make that explicit with some helper functions. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-11net: get rid of rtable->idevEric Dumazet
It seems idev field in struct rtable has no special purpose, but adding extra atomic ops. We hold refcounts on the device itself (using percpu data, so pretty cheap in current kernel). infiniband case is solved using dst.dev instead of idev->dev Removal of this field means routing without route cache is now using shared data, percpu data, and only potential contention is a pair of atomic ops on struct neighbour per forwarded packet. About 5% speedup on routing test. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Roland Dreier <rolandd@cisco.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-27tproxy: check for transparent flag in ip_route_newportsUlrich Weber
as done in ip_route_connect() Signed-off-by: Ulrich Weber <uweber@astaro.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-11net-next: remove useless union keywordChangli Gao
remove useless union keyword in rtable, rt6_info and dn_route. Since there is only one member in a union, the union keyword isn't useful. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-18net: implements ip_route_input_noref()Eric Dumazet
ip_route_input() is the version returning a refcounted dst, while ip_route_input_noref() returns a non refcounted one. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-17percpu: add __percpu sparse annotations to netTejun Heo
Add __percpu sparse annotations to net. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. The macro and type tricks around snmp stats make things a bit interesting. DEFINE/DECLARE_SNMP_STAT() macros mark the target field as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly. All snmp_mib_*() users which used to cast the argument to (void **) are updated to cast it to (void __percpu **). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Vlad Yasevich <vladislav.yasevich@hp.com> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02net: NETDEV_UNREGISTER_PERNET -> NETDEV_UNREGISTER_BATCHEric W. Biederman
The motivation for an additional notifier in batched netdevice notification (rt_do_flush) only needs to be called once per batch not once per namespace. For further batching improvements I need a guarantee that the netdevices are unregistered in order allowing me to unregister an all of the network devices in a network namespace at the same time with the guarantee that the loopback device is really and truly unregistered last. Additionally it appears that we moved the route cache flush after the final synchronize_net, which seems wrong and there was no explanation. So I have restored the original location of the final synchronize_net. Cc: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-04net: cleanup include/netEric Dumazet
This cleanup patch puts struct/union/enum opening braces, in first line to ease grep games. struct something { becomes : struct something { Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-03net: skb->rtable accessorEric Dumazet
Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb Delete skb->rtable field Setting rtable is not allowed, just set dst instead as rtable is an alias. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-01ipv4: Conditionally enable transparent flow flag when connectingKOVACS Krisztian
Set FLOWI_FLAG_ANYSRC in flowi->flags if the socket has the transparent socket option set. This way we selectively enable certain connections with non-local source addresses to be routed. Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-01ipv4: Make inet_sock.h independent of route.hKOVACS Krisztian
inet_iif() in inet_sock.h requires route.h. Since users of inet_iif() usually require other route.h functionality anyway this patch moves inet_iif() to route.h. Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-27missing bits of net-namespace / sysctlAl Viro
Piss-poor sysctl registration API strikes again, film at 11... What we really need is _pathname_ required to be present in already registered table, so that kernel could warn about bad order. That's the next target for sysctl stuff (and generally saner and more explicit order of initialization of ipv[46] internals wouldn't hurt either). For the time being, here are full fixups required by ..._rotable() stuff; we make per-net sysctl sets descendents of "ro" one and make sure that sufficient skeleton is there before we start registering per-net sysctls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-06netns: add namespace parameter to rt_cache_flushDenis V. Lunev
Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-29ipv4: Update MTU to all related cache entries in ip_rt_frag_needed()Timo Teras
Add struct net_device parameter to ip_rt_frag_needed() and update MTU to cache entries where ifindex is specified. This is similar to what is already done in ip_rt_redirect(). Signed-off-by: Timo Teras <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-25[NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.YOSHIFUJI Hideaki
Introduce per-sock inlines: sock_net(), sock_net_set() and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set(). Without CONFIG_NET_NS, no namespace other than &init_net exists. Let's explicitly define them to help compiler optimizations. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-24[NET]: include/net/route.h - remove duplicate includeJoe Perches
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-01[IPV4] route cache: Introduce rt_genid for smooth cache invalidationEric Dumazet
Current ip route cache implementation is not suited to large caches. We can consume a lot of CPU when cache must be invalidated, since we currently need to evict all cache entries, and this eviction is sometimes asynchronous. min_delay & max_delay can somewhat control this asynchronism behavior, but whole thing is a kludge, regularly triggering infamous soft lockup messages. When entries are still in use, this also consumes a lot of ram, filling dst_garbage.list. A better scheme is to use a generation identifier on each entry, so that cache invalidation can be performed by changing the table identifier, without having to scan all entries. No more delayed flushing, no more stalling when secret_interval expires. Invalidated entries will then be freed at GC time (controled by ip_rt_gc_timeout or stress), or when an invalidated entry is found in a chain when an insert is done. Thus we keep a normal equilibrium. This patch : - renames rt_hash_rnd to rt_genid (and makes it an atomic_t) - Adds a new rt_genid field to 'struct rtable' (filling a hole on 64bit) - Checks entry->rt_genid at appropriate places :
2008-02-01[NET]: Introducing socket mark socket option.Laszlo Attila Toth
A userspace program may wish to set the mark for each packets its send without using the netfilter MARK target. Changing the mark can be used for mark based routing without netfilter or for packet filtering. It requires CAP_NET_ADMIN capability. Signed-off-by: Laszlo Attila Toth <panther@balabit.hu> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NETNS]: Routing cache virtualization.Denis V. Lunev
Basically, this piece looks relatively easy. Namespace is already available on the dst entry via device and the device is safe to dereferrence. Compare it with one of a searcher and skip entry if appropriate. The only exception is ip_rt_frag_needed. So, add namespace parameter to it. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NETNS]: Correct namespace for connect-time routing.Denis V. Lunev
ip_route_connect and ip_route_newports are a part of routing API presented to the socket layer. The namespace is available inside them through a socket. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NETNS]: Add namespace parameter to ip_route_output_key.Denis V. Lunev
Needed to propagate it down to the ip_route_output_flow. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NETNS]: Add namespace parameter to ip_route_output_flow.Denis V. Lunev
Needed to propagate it down to the __ip_route_output_key. Signed_off_by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NETNS]: Add namespace parameter to __ip_route_output_key.Denis V. Lunev
This is only required to propagate it down to the ip_route_output_slow. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NETNS]: Pass namespace through ip_rt_ioctl.Denis V. Lunev
... up to rtentry_to_fib_config Acked-by: Benjamin Thery <benjamin.thery@bull.net> Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NETNS]: Add netns parameter to inet_(dev_)add_type.Eric W. Biederman
The patch extends the inet_addr_type and inet_dev_addr_type with the network namespace pointer. That allows to access the different tables relatively to the network namespace. The modification of the signature function is reported in all the callers of the inet_addr_type using the pointer to the well known init_net. Acked-by: Benjamin Thery <benjamin.thery@bull.net> Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[IPV4]: Add inet_dev_addr_type()Laszlo Attila Toth
Address type search can be limited to an interface by inet_dev_addr_type function. Signed-off-by: Laszlo Attila Toth <panther@balabit.hu> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>