summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2015-03-04mpls: Multicast route table change notificationsEric W. Biederman
Unlike IPv4 this code notifies on all cases where mpls routes are added or removed and it never automatically removes routes. Avoiding both the userspace confusion that is caused by omitting route updates and the possibility of a flood of netlink traffic when an interface goes doew. For now reserved labels are handled automatically and userspace is not notified. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04mpls: Netlink commands to add, remove, and dump routesEric W. Biederman
This change adds two new netlink routing attributes: RTA_VIA and RTA_NEWDST. RTA_VIA specifies the specifies the next machine to send a packet to like RTA_GATEWAY. RTA_VIA differs from RTA_GATEWAY in that it includes the address family of the address of the next machine to send a packet to. Currently the MPLS code supports addresses in AF_INET, AF_INET6 and AF_PACKET. For AF_INET and AF_INET6 the destination mac address is acquired from the neighbour table. For AF_PACKET the destination mac_address is specified in the netlink configuration. I think raw destination mac address support with the family AF_PACKET will prove useful. There is MPLS-TP which is defined to operate on machines that do not support internet packets of any flavor. Further seem to be corner cases where it can be useful. At this point I don't care much either way. RTA_NEWDST specifies the destination address to forward the packet with. MPLS typically changes it's destination address at every hop. For a swap operation RTA_NEWDST is specified with a length of one label. For a push operation RTA_NEWDST is specified with two or more labels. For a pop operation RTA_NEWDST is not specified or equivalently an emtpy RTAN_NEWDST is specified. Those new netlink attributes are used to implement handling of rt-netlink RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE messages, to maintain the MPLS label table. rtm_to_route_config parses a netlink RTM_NEWROUTE or RTM_DELROUTE message, verify no unhandled attributes or unhandled values are present and sets up the data structures for mpls_route_add and mpls_route_del. I did my best to match up with the existing conventions with the caveats that MPLS addresses are all destination-specific-addresses, and so don't properly have a scope. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04mpls: Add a sysctl to control the size of the mpls label tableEric W. Biederman
This sysctl gives two benefits. By defaulting the table size to 0 mpls even when compiled in and enabled defaults to not forwarding any packets. This prevents unpleasant surprises for users. The other benefit is that as mpls labels are allocated locally a dense table a small dense label table may be used which saves memory and is extremely simple and efficient to implement. This sysctl allows userspace to choose the restrictions on the label table size userspace applications need to cope with. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04mpls: Basic routing supportEric W. Biederman
This change adds a new Kconfig option MPLS_ROUTING. The core of this change is the code to look at an mpls packet received from another machine. Look that packet up in a routing table and forward the packet on. Support of MPLS over ATM is not considered or attempted here. This implemntation follows RFC3032 and implements the MPLS shim header that can pass over essentially any network. What RFC3021 refers to as the as the Incoming Label Map (ILM) I call net->mpls.platform_label[]. What RFC3031 refers to as the Next Label Hop Forwarding Entry (NHLFE) I call mpls_route. Though calling it the label fordwarding information base (lfib) might also be valid. Further the implemntation forwards packets as described in RFC3032. There is no need and given the original motivation for MPLS a strong discincentive to have a flexible label forwarding path. In essence the logic is the topmost label is read, looked up, removed, and replaced by 0 or more new lables and the sent out the specified interface to it's next hop. Quite a few optional features are not implemented here. Among them are generation of ICMP errors when the TTL is exceeded or the packet is larger than the next hop MTU (those conditions are detected and the packets are dropped instead of generating an icmp error). The traffic class field is always set to 0. The implementation focuses on IP over MPLS and does not handle egress of other kinds of protocols. Instead of implementing coordination with the neighbour table and sorting out how to input next hops in a different address family (for which there is value). I was lazy and implemented a next hop mac address instead. The code is simpler and there are flavor of MPLS such as MPLS-TP where neither an IPv4 nor an IPv6 next hop is appropriate so a next hop by mac address would need to be implemented at some point. Two new definitions AF_MPLS and PF_MPLS are exposed to userspace. Decoding the mpls header must be done by first byeswapping a 32bit bit endian word into the local cpu endian and then bit shifting to extract the pieces. There is no C bit-field that can represent a wire format mpls header on a little endian machine as the low bits of the 20bit label wind up in the wrong half of third byte. Therefore internally everything is deal with in cpu native byte order except when writing to and reading from a packet. For management simplicity if a label is configured to forward out an interface that is down the packet is dropped early. Similarly if an network interface is removed rt_dev is updated to NULL (so no reference is preserved) and any packets for that label are dropped. Keeping the label entries in the kernel allows the kernel label table to function as the definitive source of which labels are allocated and which are not. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04neigh: Add helper function neigh_xmitEric W. Biederman
For MPLS I am building the code so that either the neighbour mac address can be specified or we can have a next hop in ipv4 or ipv6. The kind of next hop we have is indicated by the neighbour table pointer. A neighbour table pointer of NULL is a link layer address. A non-NULL neighbour table pointer indicates which neighbour table and thus which address family the next hop address is in that we need to look up. The code either sends a packet directly or looks up the appropriate neighbour table entry and sends the packet. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04neigh: Factor out ___neigh_lookup_norefEric W. Biederman
While looking at the mpls code I found myself writing yet another version of neigh_lookup_noref. We currently have __ipv4_lookup_noref and __ipv6_lookup_noref. So to make my work a little easier and to make it a smidge easier to verify/maintain the mpls code in the future I stopped and wrote ___neigh_lookup_noref. Then I rewote __ipv4_lookup_noref and __ipv6_lookup_noref in terms of this new function. I tested my new version by verifying that the same code is generated in ip_finish_output2 and ip6_finish_output2 where these functions are inlined. To get to ___neigh_lookup_noref I added a new neighbour cache table function key_eq. So that the static size of the key would be available. I also added __neigh_lookup_noref for people who want to to lookup a neighbour table entry quickly but don't know which neibhgour table they are going to look up. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/rocker/rocker.c The rocker commit was two overlapping changes, one to rename the ->vport member to ->pport, and another making the bitmask expression use '1ULL' instead of plain '1'. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) If an IPVS tunnel is created with a mixed-family destination address, it cannot be removed. Fix from Alexey Andriyanov. 2) Fix module refcount underflow in netfilter's nft_compat, from Pablo Neira Ayuso. 3) Generic statistics infrastructure can reference variables sitting on a released function stack, therefore use dynamic allocation always. Fix from Ignacy Gawędzki. 4) skb_copy_bits() return value test is inverted in ip_check_defrag(). 5) Fix network namespace exit in openvswitch, we have to release all of the per-net vports. From Pravin B Shelar. 6) Fix signedness bug in CAIF's cfpkt_iterate(), from Dan Carpenter. 7) Fix rhashtable grow/shrink behavior, only expand during inserts and shrink during deletes. From Daniel Borkmann. 8) Netdevice names with semicolons should never be allowed, because they serve as a separator. From Matthew Thode. 9) Use {,__}set_current_state() where appropriate, from Fabian Frederick. 10) Revert byte queue limits support in r8169 driver, it's causing regressions we can't figure out. 11) tcp_should_expand_sndbuf() erroneously uses tp->packets_out to measure packets in flight, properly use tcp_packets_in_flight() instead. From Neal Cardwell. 12) Fix accidental removal of support for bluetooth in CSR based Intel wireless cards. From Marcel Holtmann. 13) We accidently added a behavioral change between native and compat tasks, wrt testing the MSG_CMSG_COMPAT bit. Just ignore it if the user happened to set it in a native binary as that was always the behavior we had. From Catalin Marinas. 14) Check genlmsg_unicast() return valud in hwsim netlink tx frame handling, from Bob Copeland. 15) Fix stale ->radar_required setting in mac80211 that can prevent starting new scans, from Eliad Peller. 16) Fix memory leak in nl80211 monitor, from Johannes Berg. 17) Fix race in TX index handling in xen-netback, from David Vrabel. 18) Don't enable interrupts in amx-xgbe driver until all software et al. state is ready for the interrupt handler to run. From Thomas Lendacky. 19) Add missing netlink_ns_capable() checks to rtnl_newlink(), from Eric W Biederman. 20) The amount of header space needed in macvtap was not calculated properly, fix it otherwise we splat past the beginning of the packet. From Eric Dumazet. 21) Fix bcmgenet TCP TX perf regression, from Jaedon Shin. 22) Don't raw initialize or mod timers, use setup_timer() and mod_timer() instead. From Vaishali Thakkar. 23) Fix software maintained statistics in bcmgenet and systemport drivers, from Florian Fainelli. 24) DMA descriptor updates in sh_eth need proper memory barriers, from Ben Hutchings. 25) Don't do UDP Fragmentation Offload on RAW sockets, from Michal Kubecek. 26) Openvswitch's non-masked set actions aren't constructed properly into netlink messages, fix from Joe Stringer. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits) openvswitch: Fix serialization of non-masked set actions. gianfar: Reduce logging noise seen due to phy polling if link is down ibmveth: Add function to enable live MAC address changes net: bridge: add compile-time assert for cb struct size udp: only allow UFO for packets from SOCK_DGRAM sockets sh_eth: Really fix padding of short frames on TX Revert "sh_eth: Enable Rx descriptor word 0 shift for r8a7790" sh_eth: Fix RX recovery on R-Car in case of RX ring underrun sh_eth: Ensure proper ordering of descriptor active bit write/read net/mlx4_en: Disbale GRO for incoming loopback/selftest packets net/mlx4_core: Fix wrong mask and error flow for the update-qp command net: systemport: fix software maintained statistics net: bcmgenet: fix software maintained statistics rxrpc: don't multiply with HZ twice rxrpc: terminate retrans loop when sending of skb fails net/hsr: Fix NULL pointer dereference and refcnt bugs when deleting a HSR interface. net: pasemi: Use setup_timer and mod_timer net: stmmac: Use setup_timer and mod_timer net: 8390: axnet_cs: Use setup_timer and mod_timer net: 8390: pcnet_cs: Use setup_timer and mod_timer ...
2015-03-03ax25: Stop using magic neighbour cache operations.Eric W. Biederman
Before the ax25 stack calls dev_queue_xmit it always calls ax25_type_trans which sets skb->protocol to ETH_P_AX25. Which means that by looking at the protocol type it is possible to detect IP packets that have not been munged by the ax25 stack in ndo_start_xmit and call a function to munge them. Rename ax25_neigh_xmit to ax25_ip_xmit and tweak the return type and value to be appropriate for an ndo_start_xmit function. Update all of the ax25 devices to test the protocol type for ETH_P_IP and return ax25_ip_xmit as the first thing they do. This preserves the existing semantics of IP packet processing, but the timing will be a little different as the IP packets now pass through the qdisc layer before reaching the ax25 ip packet processing. Remove the now unnecessary ax25 neighbour table operations. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02Merge branch 'fixes-for-4.0-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal Pull thermal management fixes from Eduardo Valentin: "Specifics: - Several fixes in tmon tool. - Fixes in intel int340x for _ART and _TRT tables. - Add id for Avoton SoC into powerclamp driver. - Fixes in RCAR thermal driver to remove race conditions and fix fail path - Fixes in TI thermal driver: removal of unnecessary code and build fix if !CONFIG_PM_SLEEP - Cleanups in exynos thermal driver - Add stubs for include/linux/thermal.h. Now drivers using thermal calls but that also work without CONFIG_THERMAL will be able to compile for systems that don't care about thermal. Note: I am sending this pull on Rui's behalf while he fixes issues in his Linux box" * 'fixes-for-4.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal: thermal: int340x_thermal: Ignore missing _ART, _TRT tables thermal/intel_powerclamp: add id for Avoton SoC tools/thermal: tmon: silence 'set but not used' warnings tools/thermal: tmon: use pkg-config to determine library dependencies tools/thermal: tmon: support cross-compiling tools/thermal: tmon: add .gitignore tools/thermal: tmon: fixup tui windowing calculations tools/thermal: tmon: tui: don't hard-code dialog window size assumptions tools/thermal: tmon: add min/max macros tools/thermal: tmon: add --target-temp parameter thermal: exynos: Clean-up code to use oneline entry for exynos compatible table thermal: rcar: Make error and remove paths symmetrical with init thermal: rcar: Fix race condition between init and interrupt thermal: Introduce dummy functions when thermal is not defined ti-soc-thermal: Delete an unnecessary check before the function call "cpufreq_cooling_unregister" thermal: ti-soc-thermal: bandgap: Fix build warning if !CONFIG_PM_SLEEP
2015-03-02neigh: Don't require dst in neigh_hh_initEric W. Biederman
- Add protocol to neigh_tbl so that dst->ops->protocol is not needed - Acquire the device from neigh->dev This results in a neigh_hh_init that will cache the samve values regardless of the packets flowing through it. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02arp: Kill arp_findEric W. Biederman
There are no more callers so kill this function. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: Kill dev_rebuild_headerEric W. Biederman
Now that there are no more users kill dev_rebuild_header and all of it's implementations. This is long overdue. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02neigh: Move neigh_compat_output into ax25_ip.cEric W. Biederman
The only caller is now is ax25_neigh_construct so move neigh_compat_output into ax25_ip.c make it static and rename it ax25_neigh_output. Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-hams@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02ax25: Refactor to use private neighbour operations.Eric W. Biederman
AX25 already has it's own private arp cache operations to isolate it's abuse of dev_rebuild_header to transmit packets. Add a function ax25_neigh_construct that will allow all of the ax25 devices to force using these operations, so that the generic arp code does not need to. Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-hams@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02ax25: Make ax25_header and ax25_rebuild_header staticEric W. Biederman
The only user is in ax25_ip.c so stop exporting these functions. Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-hams@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net/mlx4_core: Fix wrong mask and error flow for the update-qp commandOr Gerlitz
The bit mask for currently supported driver features (MLX4_UPDATE_QP_SUPPORTED_ATTRS) of the update-qp command was defined twice (using enum value and pre-processor define directive) and wrong. The return value of the call to mlx4_update_qp() from within the SRIOV resource-tracker was wrongly voided down. Fix both issues. issue: none Fixes: 09e05c3f78e9 ('net/mlx4: Set vlan stripping policy by the right command') Fixes: ce8d9e0d6746 ('net/mlx4_core: Add UPDATE_QP SRIOV wrapper support') Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02ebpf: move CONFIG_BPF_SYSCALL-only function declarationsDaniel Borkmann
Masami noted that it would be better to hide the remaining CONFIG_BPF_SYSCALL-only function declarations within the BPF header ifdef, w/o else path dummy alternatives since these functions are not supposed to have a user outside of CONFIG_BPF_SYSCALL. Suggested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Reference: http://article.gmane.org/gmane.linux.kernel.api/8658 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next A small batch with accumulated updates in nf-next, mostly IPVS updates, they are: 1) Add 64-bits stats counters to IPVS, from Julian Anastasov. 2) Move NETFILTER_XT_MATCH_ADDRTYPE out of NETFILTER_ADVANCED as docker seem to require this, from Anton Blanchard. 3) Use boolean instead of numeric value in set_match_v*(), from coccinelle via Fengguang Wu. 4) Allows rescheduling of new connections in IPVS when port reuse is detected, from Marcelo Ricardo Leitner. 5) Add missing bits to support arptables extensions from nft_compat, from Arturo Borrero. Patrick is preparing a large batch to enhance the set infrastructure, named expressions among other things, that should follow up soon after this batch. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02Merge branch 'for-upstream' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2015-03-02 Here's the first bluetooth-next pull request targeting the 4.1 kernel: - ieee802154/6lowpan cleanups - SCO routing to host interface support for the btmrvl driver - AMP code cleanups - Fixes to AMP HCI init sequence - Refactoring of the HCI callback mechanism - Added shutdown routine for Intel controllers in the btusb driver - New config option to enable/disable Bluetooth debugfs information - Fix for early data reception on L2CAP fixed channels Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: Remove iocb argument from sendmsg and recvmsgYing Xue
After TIPC doesn't depend on iocb argument in its internal implementations of sendmsg() and recvmsg() hooks defined in proto structure, no any user is using iocb argument in them at all now. Then we can drop the redundant iocb argument completely from kinds of implementations of both sendmsg() and recvmsg() in the entire networking stack. Cc: Christoph Hellwig <hch@lst.de> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: move skb->dropcount to skb->cb[]Eyal Birger
Commit 977750076d98 ("af_packet: add interframe drop cmsg (v6)") unionized skb->mark and skb->dropcount in order to allow recording of the socket drop count while maintaining struct sk_buff size. skb->dropcount was introduced since there was no available room in skb->cb[] in packet sockets. However, its introduction led to the inability to export skb->mark, or any other aliased field to userspace if so desired. Moving the dropcount metric to skb->cb[] eliminates this problem at the expense of 4 bytes less in skb->cb[] for protocol families using it. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: add common accessor for setting dropcount on packetsEyal Birger
As part of an effort to move skb->dropcount to skb->cb[], use a common function in order to set dropcount in struct sk_buff. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: use common macro for assering skb->cb[] available size in protocol familiesEyal Birger
As part of an effort to move skb->dropcount to skb->cb[] use a common macro in protocol families using skb->cb[] for ancillary data to validate available room in skb->cb[]. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: bluetooth: compact struct bt_skb_cb by converting boolean fields to bit ↵Eyal Birger
fields Convert boolean fields incoming and req_start to bit fields and move force_active in order save space in bt_skb_cb in an effort to use a portion of skb->cb[] for storing skb->dropcount. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: bluetooth: compact struct bt_skb_cb by inlining struct hci_req_ctrlEyal Birger
struct hci_req_ctrl is never used outside of struct bt_skb_cb; Inlining it frees 8 bytes on a 64 bit system in skb->cb[] allowing the addition of more ancillary data. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02pppoe: Use workqueue to die properly when a PADT is receivedSimon Farnsworth
When a PADT frame is received, the socket may not be in a good state to close down the PPP interface. The current implementation handles this by simply blocking all further PPP traffic, and hoping that the lack of traffic will trigger the user to investigate. Use schedule_work to get to a process context from which we clear down the PPP interface, in a fashion analogous to hangup on a TTY-based PPP interface. This causes pppd to disconnect immediately, and allows tools to take immediate corrective action. Note that pppd's rp_pppoe.so plugin has code in it to disable the session when it disconnects; however, as a consequence of this patch, the session is already disabled before rp_pppoe.so is asked to disable the session. The result is a harmless error message: Failed to disconnect PPPoE socket: 114 Operation already in progress This message is safe to ignore, as long as the error is 114 Operation already in progress; in that specific case, it means that the PPPoE session has already been disabled before pppd tried to disable it. Signed-off-by: Simon Farnsworth <simon@farnz.org.uk> Tested-by: Dan Williams <dcbw@redhat.com> Tested-by: Christoph Schulz <develop@kristov.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01cls_bpf: add initial eBPF support for programmable classifiersDaniel Borkmann
This work extends the "classic" BPF programmable tc classifier by extending its scope also to native eBPF code! This allows for user space to implement own custom, 'safe' C like classifiers (or whatever other frontend language LLVM et al may provide in future), that can then be compiled with the LLVM eBPF backend to an eBPF elf file. The result of this can be loaded into the kernel via iproute2's tc. In the kernel, they can be JITed on major archs and thus run in native performance. Simple, minimal toy example to demonstrate the workflow: #include <linux/ip.h> #include <linux/if_ether.h> #include <linux/bpf.h> #include "tc_bpf_api.h" __section("classify") int cls_main(struct sk_buff *skb) { return (0x800 << 16) | load_byte(skb, ETH_HLEN + __builtin_offsetof(struct iphdr, tos)); } char __license[] __section("license") = "GPL"; The classifier can then be compiled into eBPF opcodes and loaded via tc, for example: clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o tc filter add dev em1 parent 1: bpf cls.o [...] As it has been demonstrated, the scope can even reach up to a fully fledged flow dissector (similarly as in samples/bpf/sockex2_kern.c). For tc, maps are allowed to be used, but from kernel context only, in other words, eBPF code can keep state across filter invocations. In future, we perhaps may reattach from a different application to those maps e.g., to read out collected statistics/state. Similarly as in socket filters, we may extend functionality for eBPF classifiers over time depending on the use cases. For that purpose, cls_bpf programs are using BPF_PROG_TYPE_SCHED_CLS program type, so we can allow additional functions/accessors (e.g. an ABI compatible offset translation to skb fields/metadata). For an initial cls_bpf support, we allow the same set of helper functions as eBPF socket filters, but we could diverge at some point in time w/o problem. I was wondering whether cls_bpf and act_bpf could share C programs, I can imagine that at some point, we introduce i) further common handlers for both (or even beyond their scope), and/or if truly needed ii) some restricted function space for each of them. Both can be abstracted easily through struct bpf_verifier_ops in future. The context of cls_bpf versus act_bpf is slightly different though: a cls_bpf program will return a specific classid whereas act_bpf a drop/non-drop return code, latter may also in future mangle skbs. That said, we can surely have a "classify" and "action" section in a single object file, or considered mentioned constraint add a possibility of a shared section. The workflow for getting native eBPF running from tc [1] is as follows: for f_bpf, I've added a slightly modified ELF parser code from Alexei's kernel sample, which reads out the LLVM compiled object, sets up maps (and dynamically fixes up map fds) if any, and loads the eBPF instructions all centrally through the bpf syscall. The resulting fd from the loaded program itself is being passed down to cls_bpf, which looks up struct bpf_prog from the fd store, and holds reference, so that it stays available also after tc program lifetime. On tc filter destruction, it will then drop its reference. Moreover, I've also added the optional possibility to annotate an eBPF filter with a name (e.g. path to object file, or something else if preferred) so that when tc dumps currently installed filters, some more context can be given to an admin for a given instance (as opposed to just the file descriptor number). Last but not least, bpf_prog_get() and bpf_prog_put() needed to be exported, so that eBPF can be used from cls_bpf built as a module. Thanks to 60a3b2253c41 ("net: bpf: make eBPF interpreter images read-only") I think this is of no concern since anything wanting to alter eBPF opcode after verification stage would crash the kernel. [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01ebpf: move read-only fields to bpf_prog and shrink bpf_prog_auxDaniel Borkmann
is_gpl_compatible and prog_type should be moved directly into bpf_prog as they stay immutable during bpf_prog's lifetime, are core attributes and they can be locked as read-only later on via bpf_prog_select_runtime(). With a bit of rearranging, this also allows us to shrink bpf_prog_aux to exactly 1 cacheline. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01ebpf: add sched_cls_type and map it to sk_filter's verifier opsDaniel Borkmann
As discussed recently and at netconf/netdev01, we want to prevent making bpf_verifier_ops registration available for modules, but have them at a controlled place inside the kernel instead. The reason for this is, that out-of-tree modules can go crazy and define and register any verfifier ops they want, doing all sorts of crap, even bypassing available GPLed eBPF helper functions. We don't want to offer such a shiny playground, of course, but keep strict control to ourselves inside the core kernel. This also encourages us to design eBPF user helpers carefully and generically, so they can be shared among various subsystems using eBPF. For the eBPF traffic classifier (cls_bpf), it's a good start to share the same helper facilities as we currently do in eBPF for socket filters. That way, we have BPF_PROG_TYPE_SCHED_CLS look like it's own type, thus one day if there's a good reason to diverge the set of helper functions from the set available to socket filters, we keep ABI compatibility. In future, we could place all bpf_prog_type_list at a central place, perhaps. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01ebpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefsDaniel Borkmann
Socket filter code and other subsystems with upcoming eBPF support should not need to deal with the fact that we have CONFIG_BPF_SYSCALL defined or not. Having the bpf syscall as a config option is a nice thing and I'd expect it to stay that way for expert users (I presume one day the default setting of it might change, though), but code making use of it should not care if it's actually enabled or not. Instead, hide this via header files and let the rest deal with it. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01ebpf: export BPF_PSEUDO_MAP_FD to uapiDaniel Borkmann
We need to export BPF_PSEUDO_MAP_FD to user space, as it's used in the ELF BPF loader where instructions are being loaded that need map fixups. An initial stage loads all maps into the kernel, and later on replaces related instructions in the eBPF blob with BPF_PSEUDO_MAP_FD as source register and the actual fd as immediate value. The kernel verifier recognizes this keyword and replaces the map fd with a real pointer internally. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01ebpf: constify various function pointer structsDaniel Borkmann
We can move bpf_map_ops and bpf_verifier_ops and other structs into ro section, bpf_map_type_list and bpf_prog_type_list into read mostly. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-28tcp: tso: remove tp->tso_deferredEric Dumazet
TSO relies on ability to defer sending a small amount of packets. Heuristic is to wait for future ACKS in hope to send more packets at once. Current algorithm uses a per socket tso_deferred field as a pseudo timer. This pseudo timer relies on future ACK, but there is no guarantee we receive them in time. Fix would be to use a real timer, but cost of such timer is probably too expensive for typical cases. This patch changes the logic to test the time of last transmit, because we should not add bursts of more than 1ms for any given flow. We've used this patch for about two years at Google, before FQ/pacing as it would reduce a fair amount of bursts. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-28usbnet: Fix tx_packets stat for FLAG_MULTI_FRAME driversBen Hutchings
Currently the usbnet core does not update the tx_packets statistic for drivers with FLAG_MULTI_PACKET and there is no hook in the TX completion path where they could do this. cdc_ncm and dependent drivers are bumping tx_packets stat on the transmit path while asix and sr9800 aren't updating it at all. Add a packet count in struct skb_data so these drivers can fill it in, initialise it to 1 for other drivers, and add the packet count to the tx_packets statistic on completion. Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Tested-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-28Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds
Pull drm fixes from Dave Airlie: "Just general fixes: radeon, i915, atmel, tegra, amdkfd and one core fix" * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (28 commits) drm: atmel-hlcdc: remove clock polarity from crtc driver drm/radeon: only enable DP audio if the monitor supports it drm/radeon: fix atom aux payload size check for writes (v2) drm/radeon: fix 1 RB harvest config setup for TN/RL drm/radeon: enable SRBM timeout interrupt on EG/NI drm/radeon: enable SRBM timeout interrupt on SI drm/radeon: enable SRBM timeout interrupt on CIK v2 drm/radeon: dump full IB if we hit a packet error drm/radeon: disable mclk switching with 120hz+ monitors drm/radeon: use drm_mode_vrefresh() rather than mode->vrefresh drm/radeon: enable native backlight control on old macs drm/i915: Fix frontbuffer false positve. drm/i915: Align initial plane backing objects correctly drm/i915: avoid processing spurious/shared interrupts in low-power states drm/i915: Check obj->vma_list under the struct_mutex drm/i915: Fix a use after free, and unbalanced refcounting drm: atmel-hlcdc: remove useless pm_runtime_put_sync in probe drm: atmel-hlcdc: reset layer A2Q and UPDATE bits when disabling it drm: Fix deadlock due to getconnector locking changes drm/i915: Dell Chromebook 11 has PWM backlight ...
2015-02-27fib_trie: Convert fib_alias to hlist from listAlexander Duyck
There isn't any advantage to having it as a list and by making it an hlist we make the fib_alias more compatible with the list_info in terms of the type of list used. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27multicast: Extend ip address command to enable multicast group join/leave onMadhu Challa
Joining multicast group on ethernet level via "ip maddr" command would not work if we have an Ethernet switch that does igmp snooping since the switch would not replicate multicast packets on ports that did not have IGMP reports for the multicast addresses. Linux vxlan interfaces created via "ip link add vxlan" have the group option that enables then to do the required join. By extending ip address command with option "autojoin" we can get similar functionality for openvswitch vxlan interfaces as well as other tunneling mechanisms that need to receive multicast traffic. The kernel code is structured similar to how the vxlan driver does a group join / leave. example: ip address add 224.1.1.10/24 dev eth5 autojoin ip address del 224.1.1.10/24 dev eth5 Signed-off-by: Madhu Challa <challa@noironetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27igmp v6: add __ipv6_sock_mc_join and __ipv6_sock_mc_dropMadhu Challa
Based on the igmp v4 changes from Eric Dumazet. 959d10f6bbf6("igmp: add __ip_mc_{join|leave}_group()") These changes are needed to perform igmp v6 join/leave while RTNL is held. Make ipv6_sock_mc_join and ipv6_sock_mc_drop wrappers around __ipv6_sock_mc_join and __ipv6_sock_mc_drop to avoid proliferation of work queues. Signed-off-by: Madhu Challa <challa@noironetworks.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27rhashtable: remove indirection for grow/shrink decision functionsDaniel Borkmann
Currently, all real users of rhashtable default their grow and shrink decision functions to rht_grow_above_75() and rht_shrink_below_30(), so that there's currently no need to have this explicitly selectable. It can/should be generic and private inside rhashtable until a real use case pops up. Since we can make this private, we'll save us this additional indirection layer and can improve insertion/deletion time as well. Reference: http://patchwork.ozlabs.org/patch/443040/ Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27udp: In udp_flow_src_port use random hash value if skb_get_hash failsTom Herbert
In the unlikely event that skb_get_hash is unable to deduce a hash in udp_flow_src_port we use a consistent random value instead. This is specified in GRE/UDP draft section 3.2.1: https://tools.ietf.org/html/draft-ietf-tsvwg-gre-in-udp-encap-04 Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27at86rf230: add support for external xtal trimAlexander Aring
This patch adds support for setting the xtal trim register. Some at86rf2xx transceiver boards needs fine tuning the xtal capacitor. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-02-27Bluetooth: Update New CSRK event to match latest specificationJohan Hedberg
The 'master' parameter of the New CSRK event was recently renamed to 'type', with the old values kept for backwards compatibility as unauthenticated local/remote keys. This patch updates the code to take into account the two new (authenticated) values and ensures they get used based on the security level of the connection that the respective keys get distributed over. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-02-25net: dsa: Introduce dsa_is_port_initializedGuenter Roeck
To avoid race conditions when using the ds->ports[] array, we need to check if the accessed port has been initialized. Introduce and use helper function dsa_is_port_initialized for that purpose and use it where needed. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-25net: dsa: integrate with SWITCHDEV for HW bridgingFlorian Fainelli
In order to support bridging offloads in DSA switch drivers, select NET_SWITCHDEV to get access to the port_stp_update and parent_get_id NDOs that we are required to implement. To facilitate the integratation at the DSA driver level, we implement 3 types of operations: - port_join_bridge - port_leave_bridge - port_stp_update DSA will resolve which switch ports that are currently bridge port members as some Switch hardware/drivers need to know about that to limit the register programming to just the relevant registers (especially for slow MDIO buses). We also take care of setting the correct STP state when slave network devices are brought up/down while being bridge members. Finally, when a port is leaving the bridge, we make sure we set in BR_STATE_FORWARDING state, otherwise the bridge layer would leave it disabled as a result of having left the bridge. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-25ipvs: allow rescheduling of new connections when port reuse is detectedMarcelo Ricardo Leitner
Currently, when TCP/SCTP port reusing happens, IPVS will find the old entry and use it for the new one, behaving like a forced persistence. But if you consider a cluster with a heavy load of small connections, such reuse will happen often and may lead to a not optimal load balancing and might prevent a new node from getting a fair load. This patch introduces a new sysctl, conn_reuse_mode, that allows controlling how to proceed when port reuse is detected. The default value will allow rescheduling of new connections only if the old entry was in TIME_WAIT state for TCP or CLOSED for SCTP. Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2015-02-24bonding: Implement port churn-machine (AD standard 43.4.17).Mahesh Bandewar
The Churn Detection machines detect the situation where a port is operable, but the Actor and Partner have not attached the link to an Aggregator and brought the link into operation within a bound time period. Under normal operation of the LACP, agreement between Actor and Partner should be reached very rapidly. Continued failure to reach agreement can be symptomatic of device failure. Actor-churn-detection state-machine Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com> =================================== BEGIN=True + PortEnable=False | v +------------------------+ ActorPort.Sync=True +------------------+ | ACTOR_CHURN_MONITOR | ---------------------> | NO_ACTOR_CHURN | |========================| |==================| | ActorChurn=False | ActorPort.Sync=False | ActorChurn=False | | ActorChurn.Timer=Start | <--------------------- | | +------------------------+ +------------------+ | ^ | | ActorChurn.Timer=Expired | | ActorPort.Sync=True | | | +-----------------+ | | | ACTOR_CHURN | | | |=================| | +--------------> | ActorChurn=True | ------------+ | | +-----------------+ Similar for the Partner-churn-detection. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-24thermal: Introduce dummy functions when thermal is not definedNishanth Menon
When CONFIG_THERMAL is not enabled, it is better to introduce equivalent dummy functions in the exported header than to introduce #ifdeffery in drivers using the function. This will prevent issues such as that reported in: http://www.spinics.net/lists/linux-next/msg31573.html While at it switch over to IS_ENABLED for thermal macros to allow for thermal framework to be built as framework and relevant APIs be usable by relevant drivers as a result. Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Nishanth Menon <nm@ti.com> Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
2015-02-24Merge tag 'stable/for-linus-4.0-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen bugfixes from David Vrabel: "Xen regression and bug fixes for 4.0-rc1 - Fix two regressions introduced in 4.0-rc1 affecting PV/PVH guests in certain configurations. - Prevent pvscsi frontends bypassing backend checks. - Allow privcmd hypercalls to be preempted even on kernel with voluntary preemption. This fixes soft-lockups with long running toolstack hypercalls (e.g., when creating/destroying large domains)" * tag 'stable/for-linus-4.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: Initialize cr4 shadow for 64-bit PV(H) guests xen-scsiback: mark pvscsi frontend request consumed only after last read x86/xen: allow privcmd hypercalls to be preempted x86/xen: Make sure X2APIC_ENABLE bit of MSR_IA32_APICBASE is not set
2015-02-24Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid Pull HID fixes from Jiri Kosina: - a few fixes to Sony driver (rmmod breakage, spinlock initialization), by Antonio Ospite, Frank Praznik and Jiri Kosina - fix for wMaxInputLength handling regression in i2c-hid, by Seth Forshee - IRQ safety spinlock fix in sensor hub driver, by Srinivas Pandruvada - IRQ level sensitivity fix to i2c-hid to be compliant with the spec, by Mika Westerberg - a couple device ID additions piggy-backing on top of that * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: HID: microsoft: Add ID for NE7K wireless keyboard HID: i2c-hid: Limit reads to wMaxInputLength bytes for input events HID: sony: fix uninitialized per-controller spinlock HID: sony: initialize sony_dev_list_lock properly HID: sony: Fix a WARNING shown when rmmod-ing the driver HID: sensor-hub: correct dyn_callback_lock IRQ-safe change HID: hid-sensor-hub: Correct documentation HID: saitek: add USB ID for older R.A.T. 7 HID: i2c-hid: The interrupt should be level sensitive HID: wacom: Add missing ABS_MISC event and feature declaration for 27QHD