summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2015-01-09rhashtable: avoid unnecessary wakeup for worker queueYing Xue
Move condition statements of verifying whether hash table size exceeds its maximum threshold or reaches its minimum threshold from resizing functions to resizing decision functions, avoiding unnecessary wakeup for worker queue thread. Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-09rhashtable: future table needs to be traversed when remove an objectYing Xue
When remove an object from hash table, we currently only traverse old bucket table to check whether the object exists. If the object is not found in it, we will try again. But in the second search loop, we still search the object from the old table instead of future table. As a result, the object may be not removed from hash table especially when resizing is currently in progress and the object is just saved in the future table. Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-09rhashtable: involve rhashtable_lookup_insert routineYing Xue
Involve a new function called rhashtable_lookup_insert() which makes lookup and insertion atomic under bucket lock protection, helping us avoid to introduce an extra lock when we search and insert an object into hash table. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-09rhashtable: introduce rhashtable_wakeup_worker helper functionYing Xue
Introduce rhashtable_wakeup_worker() helper function to reduce duplicated code where to wake up worker. By the way, as long as the both "future_tbl" and "tbl" bucket table pointers point to the same bucket array, we should try to wake up the resizing worker thread, otherwise, it indicates the work of resizing hash table is not finished yet. However, currently we will wake up the worker thread only when the two pointers point to different bucket array. Obviously this is wrong. So, the issue is also fixed as well in the patch. Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-09rhashtable: optimize rhashtable_lookup routineYing Xue
Define an internal compare function and relevant compare argument, and then make use of rhashtable_lookup_compare() to lookup key in hash table, reducing duplicated code between rhashtable_lookup() and rhashtable_lookup_compare(). Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-09Merge branch 'cxgb4-next'David S. Miller
Hariprasad Shenai says: ==================== Add support for few debugfs entries This patch series adds support for devlog, cim_la, cim_qcfg and mps_tcam debugfs entries. The patches series is created against 'net-next' tree. And includes patches on cxgb4 driver. We have included all the maintainers of respective drivers. Kindly review the change and let us know in case of any review comments. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-09cxgb4: Add support for mps_tcam debugfsHariprasad Shenai
Debug log to get the MPS TCAM table Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-09cxgb4: Add support for cim_qcfg entry in debugfsHariprasad Shenai
Adds debug log to get cim queue config Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-09cxgb4: Add support for cim_la entry in debugfsHariprasad Shenai
The CIM LA captures the embedded processor’s internal state. Optionally, it can also trace the flow of data in and out of the embedded processor. Therefore, the CIM LA output contains detailed information of what code the embedded processor executed prior to the CIM LA capture. Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-09cxgb4: Add support for devlogHariprasad Shenai
Add support for device log entry in debugfs Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-09doc: fix the compile error of txtimestamp.cWANG Cong
Vinson reported: HOSTCC Documentation/networking/timestamping/txtimestamp Documentation/networking/timestamping/txtimestamp.c:64:8: error: redefinition of ‘struct in6_pktinfo’ struct in6_pktinfo { ^ In file included from /usr/include/arpa/inet.h:23:0, from Documentation/networking/timestamping/txtimestamp.c:33: /usr/include/netinet/in.h:456:8: note: originally defined here struct in6_pktinfo ^ After we sync with libc header, we don't need this ugly hack any more. Reported-by: Vinson Lee <vlee@twopensource.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-09ipv6: fix redefinition of in6_pktinfo and ip6_mtuinfoWANG Cong
Both netinet/in.h and linux/ipv6.h define these two structs, if we include both of them, we got: /usr/include/linux/ipv6.h:19:8: error: redefinition of ‘struct in6_pktinfo’ struct in6_pktinfo { ^ In file included from /usr/include/arpa/inet.h:22:0, from txtimestamp.c:33: /usr/include/netinet/in.h:524:8: note: originally defined here struct in6_pktinfo ^ In file included from txtimestamp.c:40:0: /usr/include/linux/ipv6.h:24:8: error: redefinition of ‘struct ip6_mtuinfo’ struct ip6_mtuinfo { ^ In file included from /usr/include/arpa/inet.h:22:0, from txtimestamp.c:33: /usr/include/netinet/in.h:531:8: note: originally defined here struct ip6_mtuinfo ^ So similarly to what we did for in6_addr, we need to sync with libc header on their definitions. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2015-01-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Just a pile of random fixes, including: 1) Do not apply TSO limits to non-TSO packets, fix from Herbert Xu. 2) MDI{,X} eeprom check in e100 driver is reversed, from John W. Linville. 3) Missing error return assignments in several ethernet drivers, from Julia Lawall. 4) Altera TSE device doesn't come back up after ifconfig down/up sequence, fix from Kostya Belezko. 5) Add more cases to the check for whether the qmi_wwan device has a bogus MAC address and needs to be assigned a random one. From Kristian Evensen. 6) Fix interrupt hangs in CPSW, from Felipe Balbi. 7) Implement ndo_features_check in r8152 so that the stack doesn't feed GSO packets which are outside of the chip's capabilities. From Hayes Wang" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits) qla3xxx: don't allow never end busy loop xen-netback: fixing the propagation of the transmit shaper timeout r8152: support ndo_features_check batman-adv: fix potential TT client + orig-node memory leak batman-adv: fix multicast counter when purging originators batman-adv: fix counter for multicast supporting nodes batman-adv: fix lock class for decoding hash in network-coding.c batman-adv: fix delayed foreign originator recognition batman-adv: fix and simplify condition when bonding should be used Revert "mac80211: Fix accounting of the tailroom-needed counter" net: ethernet: cpsw: fix hangs with interrupts enic: free all rq buffs when allocation fails qmi_wwan: Set random MAC on devices with buggy fw openvswitch: Consistently include VLAN header in flow and port stats. tcp: Do not apply TSO segment limit to non-TSO packets Altera TSE: Add missing phydev net/mlx4_core: Fix error flow in mlx4_init_hca() net/mlx4_core: Correcly update the mtt's offset in the MR re-reg flow qlcnic: Fix return value in qlcnic_probe() net: axienet: fix error return code ...
2015-01-07Merge tag 'for-linus-3' of git://git.code.sf.net/p/openipmi/linux-ipmiLinus Torvalds
Pull IPMI fixlet from Corey Minyard: "Fix a compile warning" * tag 'for-linus-3' of git://git.code.sf.net/p/openipmi/linux-ipmi: ipmi: Fix compile warning with tv_usec
2015-01-06net: eth: xgene: change APM X-Gene SoC platform ethernet to support ACPIFeng Kan
This adds support for APM X-Gene ethernet driver to use ACPI table to derive ethernet driver parameter. Signed-off-by: Feng Kan <fkan@apm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06qla3xxx: don't allow never end busy loopAndy Shevchenko
The counter variable wasn't increased at all which may stuck under certain circumstances. Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06net/fsl: Add mEMAC MDIO support to XGMAC MDIOAndy Fleming
The Freescale mEMAC supports operating at 10/100/1000/10G, and its associated MDIO controller is likewise capable of operating both Clause 22 and Clause 45 MDIO buses. It is nearly identical to the MDIO controller on the XGMAC, so we just modify that driver. Portions of this driver developed by: Sandeep Singh <sandeep@freescale.com> Roy Zang <tie-fei.zang@freescale.com> Signed-off-by: Andy Fleming <afleming@gmail.com> Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06ethtool: Extend ethtool plugin module eeprom API to phylibEd Swierk
This patch extends the ethtool plugin module eeprom API to support cards whose phy support is delegated to a separate driver. The handlers for ETHTOOL_GMODULEINFO and ETHTOOL_GMODULEEEPROM call the module_info and module_eeprom functions if the phy driver provides them; otherwise the handlers call the equivalent ethtool_ops functions provided by network drivers with built-in phy support. Signed-off-by: Ed Swierk <eswierk@skyportsystems.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "Revert a potential seek_data/hole regression which shows up when using ext4 to handle ext3 file systems, plus two minor bug fixes" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: remove spurious KERN_INFO from ext4_warning call Revert "ext4: fix suboptimal seek_{data,hole} extents traversial" ext4: prevent online resize with backup superblock
2015-01-06mm: propagate error from stack expansion even for guard pageLinus Torvalds
Jay Foad reports that the address sanitizer test (asan) sometimes gets confused by a stack pointer that ends up being outside the stack vma that is reported by /proc/maps. This happens due to an interaction between RLIMIT_STACK and the guard page: when we do the guard page check, we ignore the potential error from the stack expansion, which effectively results in a missing guard page, since the expected stack expansion won't have been done. And since /proc/maps explicitly ignores the guard page (commit d7824370e263: "mm: fix up some user-visible effects of the stack guard page"), the stack pointer ends up being outside the reported stack area. This is the minimal patch: it just propagates the error. It also effectively makes the guard page part of the stack limit, which in turn measn that the actual real stack is one page less than the stack limit. Let's see if anybody notices. We could teach acct_stack_growth() to allow an extra page for a grow-up/grow-down stack in the rlimit test, but I don't want to add more complexity if it isn't needed. Reported-and-tested-by: Jay Foad <jay.foad@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-06Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-mergeDavid S. Miller
Included changes: - ensure bonding is used (if enabled) for packets coming in the soft interface - fix race condition to avoid orig_nodes to be deleted right after being added - avoid false positive lockdep splats by assigning lockclass to the proper hashtable lock objects - avoid miscounting of multicast 'disabled' nodes in the network - fix memory leak in the Global Translation Table in case of originator interval change Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06xen-netback: fixing the propagation of the transmit shaper timeoutPalik, Imre
Since e9ce7cb6b107 ("xen-netback: Factor queue-specific data into queue struct"), the transimt shaper timeout is always set to 0. The value the user sets via xenbus is never propagated to the transmit shaper. This patch fixes the issue. Cc: Anthony Liguori <aliguori@amazon.com> Signed-off-by: Imre Palik <imrep@amazon.de> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06Driver: Vmxnet3: Make Rx ring 2 size configurableShrikrishna Khare
Rx ring 2 size can be configured by adjusting rx-jumbo parameter of ethtool -G. Signed-off-by: Ramya Bolla <bollar@vmware.com> Signed-off-by: Shreyas Bhatewara <sbhatewara@vmware.com> Signed-off-by: Shrikrishna Khare <skhare@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06Merge tag 'mac80211-for-davem-2015-01-06' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Here's just a single fix - a revert of a patch that broke the p54 and cw2100 drivers (arguably due to bad assumptions there.) Since this affects kernels since 3.17, I decided to revert for now and we'll revisit this optimisation properly for -next. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06r8152: support ndo_features_checkhayeswang
Support ndo_features_check to avoid: - the transport offset is more than the hw limitation when using hw checksum. - the skb->len of a GSO packet is more than the limitation. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06arm_arch_timer: include clocksource.h directlyRichard Cochran
This driver makes use of the clocksource code. Previously it had only included the proper header indirectly, but that chain was inadvertently broken by 74d23cc "time: move the timecounter/cyclecounter code into its own file." This patch fixes the issue by including clocksource.h directly. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06cxgb4: Add PCI device ID for new T5 adapterHariprasad Shenai
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06batman-adv: fix potential TT client + orig-node memory leakLinus Lüssing
This patch fixes a potential memory leak which can occur once an originator times out. On timeout the according global translation table entry might not get purged correctly. Furthermore, the non purged TT entry will cause its orig-node to leak, too. Which additionally can lead to the new multicast optimization feature not kicking in because of a therefore bogus counter. In detail: The batadv_tt_global_entry->orig_list holds the reference to the orig-node. Usually this reference is released after BATADV_PURGE_TIMEOUT through: _batadv_purge_orig()-> batadv_purge_orig_node()->batadv_update_route()->_batadv_update_route()-> batadv_tt_global_del_orig() which purges this global tt entry and releases the reference to the orig-node. However, if between two batadv_purge_orig_node() calls the orig-node timeout grew to 2*BATADV_PURGE_TIMEOUT then this call path isn't reached. Instead the according orig-node is removed from the originator hash in _batadv_purge_orig(), the batadv_update_route() part is skipped and won't be reached anymore. Fixing the issue by moving batadv_tt_global_del_orig() out of the rcu callback. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Acked-by: Antonio Quartulli <antonio@meshcoding.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-01-06batman-adv: fix multicast counter when purging originatorsLinus Lüssing
When purging an orig_node we should only decrease counter tracking the number of nodes without multicast optimizations support if it was increased through this orig_node before. A not yet quite initialized orig_node (meaning it did not have its turn in the mcast-tvlv handler so far) which gets purged would not adhere to this and will lead to a counter imbalance. Fixing this by adding a check whether the orig_node is mcast-initalized before decreasing the counter in the mcast-orig_node-purging routine. Introduced by 60432d756cf06e597ef9da511402dd059b112447 ("batman-adv: Announce new capability via multicast TVLV") Reported-by: Tobias Hachmer <tobias@hachmer.de> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-01-06batman-adv: fix counter for multicast supporting nodesLinus Lüssing
A miscounting of nodes having multicast optimizations enabled can lead to multicast packet loss in the following scenario: If the first OGM a node receives from another one has no multicast optimizations support (no multicast tvlv) then we are missing to increase the counter. This potentially leads to the wrong assumption that we could safely use multicast optimizations. Fixings this by increasing the counter if the initial OGM has the multicast TVLV unset, too. Introduced by 60432d756cf06e597ef9da511402dd059b112447 ("batman-adv: Announce new capability via multicast TVLV") Reported-by: Tobias Hachmer <tobias@hachmer.de> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-01-06batman-adv: fix lock class for decoding hash in network-coding.cMartin Hundebøll
batadv_has_set_lock_class() is called with the wrong hash table as first argument (probably due to a copy-paste error), which leads to false positives when running with lockdep. Introduced-by: 612d2b4fe0a1ff2f8389462a6f8be34e54124c05 ("batman-adv: network coding - save overheard and tx packets for decoding") Signed-off-by: Martin Hundebøll <martin@hundeboll.net> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-01-06batman-adv: fix delayed foreign originator recognitionLinus Lüssing
Currently it can happen that the reception of an OGM from a new originator is not being accepted. More precisely it can happen that an originator struct gets allocated and initialized (batadv_orig_node_new()), even the TQ gets calculated and set correctly (batadv_iv_ogm_calc_tq()) but still the periodic orig_node purging thread will decide to delete it if it has a chance to jump between these two function calls. This is because batadv_orig_node_new() initializes the last_seen value to zero and its caller (batadv_iv_ogm_orig_get()) makes it visible to other threads by adding it to the hash table already. batadv_iv_ogm_calc_tq() will set the last_seen variable to the correct, current time a few lines later but if the purging thread jumps in between that it will think that the orig_node timed out and will wrongly schedule it for deletion already. If the purging interval is the same as the originator interval (which is the default: 1 second), then this game can continue for several rounds until the random OGM jitter added enough difference between these two (in tests, two to about four rounds seemed common). Fixing this by initializing the last_seen variable of an orig_node to the current time before adding it to the hash table. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-01-06batman-adv: fix and simplify condition when bonding should be usedSimon Wunderlich
The current condition actually does NOT consider bonding when the interface the packet came in from is the soft interface, which is the opposite of what it should do (and the comment describes). Fix that and slightly simplify the condition. Reported-by: Ray Gibson <booray@gmail.com> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-01-06Merge branch 'rt_cong_ctrl'David S. Miller
Daniel Borkmann says: ==================== net: allow setting congctl via routing table This is the second part of our work and allows for setting the congestion control algorithm via routing table. For details, please see individual patches. Since patch 1 is a bug fix, we suggest applying patch 1 to net, and then merging net into net-next, for example, and following up with the remaining feature patches wrt dependencies. Joint work with Florian Westphal, suggested by Hannes Frederic Sowa. Patch for iproute2 is available under [1], but will be reposted with along with the man-page update when this set hits net-next. [1] http://patchwork.ozlabs.org/patch/418149/ Thanks! v2 -> v3: - Added module auto-loading as suggested by David Miller, thanks! - Added patch 2 for handling possible sleeps in fib6 - While working on this, we discovered a bug, hence fix in patch 1 - Added auto-loading to patch 4 - Rebased, retested, rest the same. v1 -> v2: - Very sorry, I noticed I had decnet disabled during testing. Added missing header include in decnet, rest as is. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06net: tcp: add per route congestion controlDaniel Borkmann
This work adds the possibility to define a per route/destination congestion control algorithm. Generally, this opens up the possibility for a machine with different links to enforce specific congestion control algorithms with optimal strategies for each of them based on their network characteristics, even transparently for a single application listening on all links. For our specific use case, this additionally facilitates deployment of DCTCP, for example, applications can easily serve internal traffic/dsts in DCTCP and external one with CUBIC. Other scenarios would also allow for utilizing e.g. long living, low priority background flows for certain destinations/routes while still being able for normal traffic to utilize the default congestion control algorithm. We also thought about a per netns setting (where different defaults are possible), but given its actually a link specific property, we argue that a per route/destination setting is the most natural and flexible. The administrator can utilize this through ip-route(8) by appending "congctl [lock] <name>", where <name> denotes the name of a congestion control algorithm and the optional lock parameter allows to enforce the given algorithm so that applications in user space would not be allowed to overwrite that algorithm for that destination. The dst metric lookups are being done when a dst entry is already available in order to avoid a costly lookup and still before the algorithms are being initialized, thus overhead is very low when the feature is not being used. While the client side would need to drop the current reference on the module, on server side this can actually even be avoided as we just got a flat-copied socket clone. Joint work with Florian Westphal. Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06net: tcp: add RTAX_CC_ALGO fib handlingDaniel Borkmann
This patch adds the minimum necessary for the RTAX_CC_ALGO congestion control metric to be set up and dumped back to user space. While the internal representation of RTAX_CC_ALGO is handled as a u32 key, we avoided to expose this implementation detail to user space, thus instead, we chose the netlink attribute that is being exchanged between user space to be the actual congestion control algorithm name, similarly as in the setsockopt(2) API in order to allow for maximum flexibility, even for 3rd party modules. It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as it should have been stored in RTAX_FEATURES instead, we first thought about reusing it for the congestion control key, but it brings more complications and/or confusion than worth it. Joint work with Florian Westphal. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06net: tcp: add key management to congestion controlDaniel Borkmann
This patch adds necessary infrastructure to the congestion control framework for later per route congestion control support. For a per route congestion control possibility, our aim is to store a unique u32 key identifier into dst metrics, which can then be mapped into a tcp_congestion_ops struct. We argue that having a RTAX key entry is the most simple, generic and easy way to manage, and also keeps the memory footprint of dst entries lower on 64 bit than with storing a pointer directly, for example. Having a unique key id also allows for decoupling actual TCP congestion control module management from the FIB layer, i.e. we don't have to care about expensive module refcounting inside the FIB at this point. We first thought of using an IDR store for the realization, which takes over dynamic assignment of unused key space and also performs the key to pointer mapping in RCU. While doing so, we stumbled upon the issue that due to the nature of dynamic key distribution, it just so happens, arguably in very rare occasions, that excessive module loads and unloads can lead to a possible reuse of previously used key space. Thus, previously stale keys in the dst metric are now being reassigned to a different congestion control algorithm, which might lead to unexpected behaviour. One way to resolve this would have been to walk FIBs on the actually rare occasion of a module unload and reset the metric keys for each FIB in each netns, but that's just very costly. Therefore, we argue a better solution is to reuse the unique congestion control algorithm name member and map that into u32 key space through jhash. For that, we split the flags attribute (as it currently uses 2 bits only anyway) into two u32 attributes, flags and key, so that we can keep the cacheline boundary of 2 cachelines on x86_64 and cache the precalculated key at registration time for the fast path. On average we might expect 2 - 4 modules being loaded worst case perhaps 15, so a key collision possibility is extremely low, and guaranteed collision-free on LE/BE for all in-tree modules. Overall this results in much simpler code, and all without the overhead of an IDR. Due to the deterministic nature, modules can now be unloaded, the congestion control algorithm for a specific but unloaded key will fall back to the default one, and on module reload time it will switch back to the expected algorithm transparently. Joint work with Florian Westphal. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06net: tcp: refactor reinitialization of congestion controlDaniel Borkmann
We can just move this to an extra function and make the code a bit more readable, no functional change. Joint work with Florian Westphal. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06net: fib6: convert cfg metric to u32 outside of table write lockFlorian Westphal
Do the nla validation earlier, outside the write lock. This is needed by followup patch which needs to be able to call request_module (which can sleep) if needed. Joint work with Daniel Borkmann. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06net: fib6: fib6_commit_metrics: fix potential NULL pointer dereferenceDaniel Borkmann
When IPv6 host routes with metrics attached are being added, we fetch the metrics store from the dst via COW through dst_metrics_write_ptr(), added through commit e5fd387ad5b3. One remaining problem here is that we actually call into inet_getpeer() and may end up allocating/creating a new peer from the kmemcache, which may fail. Example trace from perf probe (inet_getpeer:41) where create is 1: ip 6877 [002] 4221.391591: probe:inet_getpeer: (ffffffff8165e293) 85e294 inet_getpeer.part.7 (<- kmem_cache_alloc()) 85e578 inet_getpeer 8eb333 ipv6_cow_metrics 8f10ff fib6_commit_metrics Therefore, a check for NULL on the return of dst_metrics_write_ptr() is necessary here. Joint work with Florian Westphal. Fixes: e5fd387ad5b3 ("ipv6: do not overwrite inetpeer metrics prematurely") Cc: Michal Kubeček <mkubecek@suse.cz> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is definedHubert Sokolowski
Add checking whether the call to ndo_dflt_fdb_dump is needed. It is not expected to call ndo_dflt_fdb_dump unconditionally by some drivers (i.e. qlcnic or macvlan) that defines own ndo_fdb_dump. Other drivers define own ndo_fdb_dump and don't want ndo_dflt_fdb_dump to be called at all. At the same time it is desirable to call the default dump function on a bridge device. Fix attributes that are passed to dev->netdev_ops->ndo_fdb_dump. Add extra checking in br_fdb_dump to avoid duplicate entries as now filter_dev can be NULL. Following tests for filtering have been performed before the change and after the patch was applied to make sure they are the same and it doesn't break the filtering algorithm. [root@localhost ~]# cd /root/iproute2-3.18.0/bridge [root@localhost bridge]# modprobe dummy [root@localhost bridge]# ./bridge fdb add f1:f2:f3:f4:f5:f6 dev dummy0 [root@localhost bridge]# brctl addbr br0 [root@localhost bridge]# brctl addif br0 dummy0 [root@localhost bridge]# ip link set dev br0 address 02:00:00:12:01:04 [root@localhost bridge]# # show all [root@localhost bridge]# ./bridge fdb show 33:33:00:00:00:01 dev p2p1 self permanent 01:00:5e:00:00:01 dev p2p1 self permanent 33:33:ff:ac:ce:32 dev p2p1 self permanent 33:33:00:00:02:02 dev p2p1 self permanent 01:00:5e:00:00:fb dev p2p1 self permanent 33:33:00:00:00:01 dev p7p1 self permanent 01:00:5e:00:00:01 dev p7p1 self permanent 33:33:ff:79:50:53 dev p7p1 self permanent 33:33:00:00:02:02 dev p7p1 self permanent 01:00:5e:00:00:fb dev p7p1 self permanent f2:46:50:85:6d:d9 dev dummy0 master br0 permanent f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent 33:33:00:00:00:01 dev dummy0 self permanent f1:f2:f3:f4:f5:f6 dev dummy0 self permanent 33:33:00:00:00:01 dev br0 self permanent 02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent 02:00:00:12:01:04 dev br0 master br0 permanent [root@localhost bridge]# # filter by bridge [root@localhost bridge]# ./bridge fdb show br br0 f2:46:50:85:6d:d9 dev dummy0 master br0 permanent f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent 33:33:00:00:00:01 dev dummy0 self permanent f1:f2:f3:f4:f5:f6 dev dummy0 self permanent 33:33:00:00:00:01 dev br0 self permanent 02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent 02:00:00:12:01:04 dev br0 master br0 permanent [root@localhost bridge]# # filter by port [root@localhost bridge]# ./bridge fdb show brport dummy0 f2:46:50:85:6d:d9 master br0 permanent f2:46:50:85:6d:d9 vlan 1 master br0 permanent 33:33:00:00:00:01 self permanent f1:f2:f3:f4:f5:f6 self permanent [root@localhost bridge]# # filter by port + bridge [root@localhost bridge]# ./bridge fdb show br br0 brport dummy0 f2:46:50:85:6d:d9 master br0 permanent f2:46:50:85:6d:d9 vlan 1 master br0 permanent 33:33:00:00:00:01 self permanent f1:f2:f3:f4:f5:f6 self permanent [root@localhost bridge]# Signed-off-by: Hubert Sokolowski <hubert.sokolowski@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06Merge branch 'ip_cmsg_csum'David S. Miller
Tom Herbert says: ==================== ip: Support checksum returned in csmg This patch set allows the packet checksum for a datagram socket to be returned in csum data in recvmsg. This allows userspace to implement its own checksum over the data, for instance if an IP tunnel was be implemented in user space, the inner checksum could be validated. Changes in this patch set: - Move checksum conversion to inet_sock from udp_sock. This generalizes checksum conversion for use with other protocols. - Move IP cmsg constants to a header file and make processing of the flags more efficient in ip_cmsg_recv - Return checksum value in cmsg. This is specifically the unfolded 32 bit checksum of the full packet starting from the first byte returned in recvmsg Tested: Wrote a little server to get checksums in cmsg for UDP and verfied correct checksum is returned. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06ip: Add offset parameter to ip_cmsg_recvTom Herbert
Add ip_cmsg_recv_offset function which takes an offset argument that indicates the starting offset in skb where data is being received from. This will be useful in the case of UDP and provided checksum to user space. ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of zero. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06ip: Add offset parameter to ip_cmsg_recvTom Herbert
Add ip_cmsg_recv_offset function which takes an offset argument that indicates the starting offset in skb where data is being received from. This will be useful in the case of UDP and provided checksum to user space. ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of zero. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06ip: IP cmsg cleanupTom Herbert
Move the IP_CMSG_* constants from ip_sockglue.c to inet_sock.h so that they can be referenced in other source files. Restructure ip_cmsg_recv to not go through flags using shift, check for flags by 'and'. This eliminates both the shift and a conditional per flag check. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06ip: Move checksum convert defines to inetTom Herbert
Move convert_csum from udp_sock to inet_sock. This allows the possibility that we can use convert checksum for different types of sockets and also allows convert checksum to be enabled from inet layer (what we'll want to do when enabling IP_CHECKSUM cmsg). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06netlink: Warn on unordered or illegal nla_nest_cancel() or nlmsg_cancel()Thomas Graf
Calling nla_nest_cancel() in a different order as the nesting was built up can lead to negative offsets being calculated which results in skb_trim() being called with an underflowed unsigned int. Warn if mark < skb->data as it's definitely a bug. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06Linux 3.19-rc3Linus Torvalds
2015-01-05Merge tag 'powerpc-3.19-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux Pull powerpc fixes from Michael Ellerman: - Wire up sys_execveat(). Tested on 32 & 64 bit. - Fix for kdump on LE systems with cpus hot unplugged. - Revert Anton's fix for "kernel BUG at kernel/smpboot.c:134!", this broke other platforms, we'll do a proper fix for 3.20. * tag 'powerpc-3.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: Revert "powerpc: Secondary CPUs must set cpu_callin_map after setting active and online" powerpc/kdump: Ignore failure in enabling big endian exception during crash powerpc: Wire up sys_execveat() syscall