summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2014-09-28cxgb4: Add support for adaptive rxHariprasad Shenai
Based on original work by Kumar Sanghvi <kumaras@chelsio.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28cxgb4/cxgb4vf: Add Devicde ID for two more adapterHariprasad Shenai
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28cxgb4vf: Remove superfluous "idx" parameter of CH_DEVICE() macro.Hariprasad Shenai
Remove redundant idx parameter of CH_DEVICE() macro, its always zero. Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28cxgb4: Use BAR2 Going To Sleep (GTS) for T5 and later.Hariprasad Shenai
Use BAR2 GTS for T5. If we are on T4 use the old doorbell mechanism; otherwise ue the new BAR2 mechanism. Use BAR2 doorbells for refilling FL's. Based on original work by Casey Leedom <leedom@chelsio.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28arp: Do not perturb drop profiles with ignored ARP packetsRick Jones
We do not wish to disturb dropwatch or perf drop profiles with an ARP we will ignore. Signed-off-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net_sched: remove the first parameter from tcf_exts_destroy()WANG Cong
Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <hadi@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28mlx4: exploit skb->xmit_more to conditionally send doorbellEric Dumazet
skb->xmit_more tells us if another skb is coming next. We need to send doorbell when : xmit_more is not set, or txqueue is stopped (preventing next skb to come immediately) Tested with a modified pktgen version, I got a 40% increase of throughput. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28Merge branch 'r8152'David S. Miller
Hayes Wang says: ==================== r8152: support setting eee by ethtool Modify some definitions about EEE, and add the support of setting the EEE through ethtool. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28r8152: support ethtool eeehayeswang
Support get_eee() and set_eee() of ethtool_ops. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28r8152: add functions to set EEEhayeswang
Add functions to enable EEE and set EEE advertisement. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28r8152: change the EEE definitionhayeswang
Replace the EEE definitions with the ones which is declared in "mdio.h". Chage some definitions to make them readable. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28Merge branch 'defxx-next'David S. Miller
Maciej W. Rozycki says: ==================== defxx: DEFEA fixes and updates I have finally got my hands on an EISA variation of the board (DEC FDDIcontroller/EISA aka DEFEA) and was able to do some testing. Here are initial updates to the driver that address problems I encountered so far. More to come later on as I get back to the system that I have in a remote location -- I need to double-check MMIO support and see what might have been causing spurious interrupts I saw with the 8259A PIC the board's interrupt line has been routed to. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28defxx: DEFEA's ESIC port I/O decoding cleanupMaciej W. Rozycki
Use the slot-specific I/O range for decoding accesses to PDQ ASIC registers (IOCS0) and the discrete Burst Holdoff register (IOCS1) as per the "HD64981F EISA Slave Interface Controller (ESIC)" datasheet. Use disjoint decode ranges now that the assignment of chip selects is known. Update the span of the port I/O resource requested accordingly. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28defxx: DEFEA's Burst Holdoff register initialization fixMaciej W. Rozycki
Use the mask rather than bit number macro to initialize the chip select control bit for PDQ register space decoding in the Burst Holdoff register. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28defxx: Correct DEFEA's ESIC port I/O accessesMaciej W. Rozycki
Reverse the order of arguments to `outb', data to write comes first. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2014-09-25 1) Remove useless hash_resize_mutex in xfrm_hash_resize(). This mutex is used only there, but xfrm_hash_resize() can't be called concurrently at all. From Ying Xue. 2) Extend policy hashing to prefixed policies based on prefix lenght thresholds. From Christophe Gouault. 3) Make the policy hash table thresholds configurable via netlink. From Christophe Gouault. 4) Remove the maximum authentication length for AH. This was needed to limit stack usage. We switched already to allocate space, so no need to keep the limit. From Herbert Xu. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28Merge branch 'dsa_eee'David S. Miller
Florian Fainelli says: ==================== net: dsa: EEE and other PM features This patch set allows DSA switch drivers to enable/disable/query EEE on a per-port level, as well as control precisely which switch ports are enable/disabled. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net: dsa: bcm_sf2: add support for controlling EEEFlorian Fainelli
When EEE is enabled, negotiate this feature with the PHY and make sure that the capability checking, local EEE advertisement, link partner EEE advertisement and auto-negotiation resolution returned by phy_init_eee() is positive, and enable EEE at the switch level. While querying the current EEE settings, verify the low-power indication and indicate its status. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net: dsa: allow switches driver to implement get/set EEEFlorian Fainelli
Allow switches driver to query and enable/disable EEE on a per-port basis by implementing the ethtool_{get,set}_eee settings and delegating these operations to the switch driver. set_eee() will need to coordinate with the PHY driver to make sure that EEE is enabled, the link-partner supports it and the auto-negotiation result is satisfactory. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net: dsa: bcm_sf2: add port_enable/disable callbacksFlorian Fainelli
The SF2 switch driver is already architected around per-port enable/disable callbacks, so we just need a slight update to our existing bcm_sf2_port_setup() resp. bcm_sf2_port_disable() functions to be suitable as callbacks for port_enable/port_disable. We need to shuffle a little the code that does the per-port VLAN configuration/isolation since ports can now be brought up/down separately, so we need to make sure that IMP (CPU, management) port is always included in that specific port setup. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net: dsa: bcm_sf2: disable RGMII interface(s) when link is downFlorian Fainelli
When the link is down, disable the RGMII interface to conserve as much power as possible. We re-enable the RGMII interface whenever the link is detected. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net: dsa: allow enabling and disable switch portsFlorian Fainelli
Whenever a per-port network device is used/unused, invoke the switch driver port_enable/port_disable callbacks to allow saving as much power as possible by disabling unused parts of the switch (RX/TX logic, memory arrays, PHYs...). We supply a PHY device argument to make sure the switch driver can act on the PHY device if needed (like putting/taking the PHY out of deep low power mode). Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net: dsa: start and stop the PHY state machineFlorian Fainelli
dsa_slave_open() should start the PHY library state machine for its PHY interface, and dsa_slave_close() should stop the PHY library state machine accordingly. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28tcp: use tcp_flags in tcp_data_queue()Peter Pan(潘卫平)
This patch is a cleanup which follows the idea in commit e11ecddf5128 (tcp: use TCP_SKB_CB(skb)->tcp_flags in input path), and it may reduce register pressure since skb->cb[] access is fast, bacause skb is probably in a register. v2: remove variable th v3: reword the changelog Signed-off-by: Weiping Pan <panweiping3@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28tcp: change tcp_skb_pcount() locationEric Dumazet
Our goal is to access no more than one cache line access per skb in a write or receive queue when doing the various walks. After recent TCP_SKB_CB() reorganizations, it is almost done. Last part is tcp_skb_pcount() which currently uses skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs 3 cache lines in current kernel (skb->head, skb->end, and shinfo->gso_segs are all in 3 different cache lines, far from skb->cb) This very simple patch reuses space currently taken by tcp_tw_isn only in input path, as tcp_skb_pcount is only needed for skb stored in write queue. This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags to get SKBTX_ACK_TSTAMP, which seems possible. This also speeds up all sack processing in general. This speeds up tcp_sendmsg() because it no longer has to access/dirty shinfo. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28Merge branch 'tcp_skb_cb'David S. Miller
Eric Dumazet says: ==================== tcp: better TCP_SKB_CB layout TCP had the assumption that IPCB and IP6CB are first members of skb->cb[] This is fine, except that IPCB/IP6CB are used in TCP for a very short time in input path. What really matters for TCP stack is to get skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq in the same cache line. skb that are immediately consumed do not care because whole skb->cb[] is hot in cpu cache, while skb that sit in wocket write queue or receive queues do not need TCP_SKB_CB(skb)->header at all. This patch set implements the prereq for IPv4, IPv6, and TCP to make this possible. This makes TCP more efficient. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28tcp: better TCP_SKB_CB layout to reduce cache line missesEric Dumazet
TCP maintains lists of skb in write queue, and in receive queues (in order and out of order queues) Scanning these lists both in input and output path usually requires access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq These fields are currently in two different cache lines, meaning we waste lot of memory bandwidth when these queues are big and flows have either packet drops or packet reorders. We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because this header is not used in fast path. This allows TCP to search much faster in the skb lists. Even with regular flows, we save one cache line miss in fast path. Thanks to Christoph Paasch for noticing we need to cleanup skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path, and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28ipv6: add a struct inet6_skb_parm param to ipv6_opt_accepted()Eric Dumazet
ipv6_opt_accepted() assumes IP6CB(skb) holds the struct inet6_skb_parm that it needs. Lets not assume this, as TCP stack might use a different place. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28ipv4: rename ip_options_echo to __ip_options_echo()Eric Dumazet
ip_options_echo() assumes struct ip_options is provided in &IPCB(skb)->opt Lets break this assumption, but provide a helper to not change all call points. ip_send_unicast_reply() gets a new struct ip_options pointer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net : optimize skb_release_data()Eric Dumazet
Cache skb_shinfo(skb) in a variable to avoid computing it multiple times. Reorganize the tests to remove one indentation level. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26sparc: bpf_jit: add support for BPF_LD(X) | BPF_LEN instructionsAlexei Starovoitov
BPF_LD | BPF_W | BPF_LEN instruction is occasionally used by tcpdump and present in 11 tests in lib/test_bpf.c Teach sparc JIT compiler to emit it. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: bcmgenet: Fix compile warningTobias Klauser
bcmgenet_wol_resume() is only used in bcmgenet_resume(), which is only defined when CONFIG_PM_SLEEP is enabled. This leads to the following compile warning when building with !CONFIG_PM_SLEEP: drivers/net/ethernet/broadcom/genet/bcmgenet.c:1967:12: warning: ‘bcmgenet_wol_resume’ defined but not used [-Wunused-function] Since bcmgenet_resume() is the only user of bcmgenet_wol_resume(), fix this by directly inlining the function there. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net/openvswitch: remove dup comment in vport.hWang Sheng-Hui
Remove the duplicated comment "/* The following definitions are for users of the vport subsytem: */" in vport.h Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next Jeff Kirsher says: ==================== Intel Wired LAN Driver Updates 2014-09-23 This patch series adds support for the FM10000 Ethernet switch host interface. The Intel FM10000 Ethernet Switch is a 48-port Ethernet switch supporting both Ethernet ports and PCI Express host interfaces. The fm10k driver provides support for the host interface portion of the switch, both PF and VF. As the host interfaces are directly connected to the switch this results in some significant differences versus a standard network driver. For example there is no PHY or MII on the device. Since packets are delivered directly from the switch to the host interface these are unnecessary. Otherwise most of the functionality is very similar to our other network drivers such as ixgbe or igb. For example we support all the standard network offloads, jumbo frames, SR-IOV (64 VFS), PTP, and some VXLAN and NVGRE offloads. v2: converted dev_consume_skb_any() to dev_kfree_skb_any() fix up PTP code based on feedback from the community v3: converted the use of smb_mb__before_clear_bit() to smb_mb__before_atomic() added vmalloc header to patch 15 added prefetch header to patch 16 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: optimise inet_proto_csum_replace4()LEROY Christophe
csum_partial() is a generic function which is not optimised for small fixed length calculations, and its use requires to store "from" and "to" values in memory while we already have them available in registers. This also has impact, especially on RISC processors. In the same spirit as the change done by Eric Dumazet on csum_replace2(), this patch rewrites inet_proto_csum_replace4() taking into account RFC1624. I spotted during a NATted tcp transfert that csum_partial() is one of top 5 consuming functions (around 8%), and the second user of csum_partial() is inet_proto_csum_replace4(). Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: optimise csum_replace4()LEROY Christophe
csum_partial() is a generic function which is not optimised for small fixed length calculations, and its use requires to store "from" and "to" values in memory while we already have them available in registers. This also has impact, especially on RISC processors. In the same spirit as the change done by Eric Dumazet on csum_replace2(), this patch rewrites inet_proto_csum_replace4() taking into account RFC1624. I spotted during a NATted tcp transfert that csum_partial() is one of top 5 consuming functions (around 8%), and the second user of csum_partial() is inet_proto_csum_replace4(). I have proposed the same modification to inet_proto_csum_replace4() in another patch. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26Merge branch 'fec'David S. Miller
Fugang Duan says: ==================== net: fec: Code cleanup This patches does several things: - Fixing multiqueue issue. - Removing the unnecessary errata workaround. - Aligning the data buffer dma map/unmap size. - Freeing resource after probe failed. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: fec: free resource after phy probe failedNimrod Andy
Free memory and disable all related clocks when there has no phy connection or phy probe failed. Signed-off-by: Fugang Duan <B38611@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: fec: align rx data buffer size for dma map/unmapNimrod Andy
Align allocated rx data buffer size for dma map/unmap, otherwise kernel print warning when enable DMA_API_DEBUG. Signed-off-by: Fugang Duan <B38611@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: fec: remove the ERR006358 workaround for imx6sx enetNimrod Andy
Remove the ERR006358 workaround for imx6sx enet since the hw issue was fixed on the SOC. Signed-off-by: Fugang Duan <B38611@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: fec: Add Ftype to BD to distiguish three tx queues for AVBNimrod Andy
The current driver loss Ftype field init for BD, which cause tx queue #1 and #2 cannot work well. Add Ftype field to BD to distiguish three queues for AVB: 0 -> Best Effort 1 -> ClassA 2 -> ClassB Signed-off-by: Fugang Duan <B38611@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: introduce __skb_header_release()Eric Dumazet
While profiling TCP stack, I noticed one useless atomic operation in tcp_sendmsg(), caused by skb_header_release(). It turns out all current skb_header_release() users have a fresh skb, that no other user can see, so we can avoid one atomic operation. Introduce __skb_header_release() to clearly document this. This gave me a 1.5 % improvement on TCP_RR workload. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26fec: Remove fec_enet_select_queue()Fabio Estevam
Sparse complains about fec_enet_select_queue() not being static. Feedback from David Miller [1] was to remove this function instead of making it static: "Please just delete this function. It's overriding code which does exactly the same thing. Actually, more precisely, this code is duplicating code in a way that bypasses many core facilitites of the networking. For example, this override means that socket based flow steering, XPS, etc. are all not happening on these devices. Without ->ndo_select_queue(), the flow dissector does __netdev_pick_tx which is exactly what you want to happen." [1] http://www.spinics.net/lists/netdev/msg297653.html Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26Merge tag 'master-2014-09-16' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next John W. Linville says: ==================== pull request: wireless-next 2014-09-22 Please pull this batch of updates intended for the 3.18 stream... For the mac80211 bits, Johannes says: "This time, I have some rate minstrel improvements, support for a very small feature from CCX that Steinar reverse-engineered, dynamic ACK timeout support, a number of changes for TDLS, early support for radio resource measurement and many fixes. Also, I'm changing a number of places to clear key memory when it's freed and Intel claims copyright for code they developed." For the bluetooth bits, Johan says: "Here are some more patches intended for 3.18. Most of them are cleanups or fixes for SMP. The only exception is a fix for BR/EDR L2CAP fixed channels which should now work better together with the L2CAP information request procedure." For the iwlwifi bits, Emmanuel says: "I fix here dvm which was broken by my last pull request. Arik continues to work on TDLS and Luca solved a few issues in CT-Kill. Eyal keeps digging into rate scaling code, more to come soon. Besides this, nothing really special here." Beyond that, there are the usual big batches of updates to ath9k, b43, mwifiex, and wil6210 as well as a handful of other bits here and there. Also, rtlwifi gets some btcoexist attention from Larry. Please let me know if there are problems! ==================== Had to adjust the wil6210 code to comply with Joe Perches's recent change in net-next to make the netdev_*() routines return void instead of 'int'. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: Change netdev_<level> logging functions to return voidJoe Perches
No caller or macro uses the return value so make all the functions return void. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26mellanox: Change en_print to return voidJoe Perches
No caller or macro uses the return value so make it void. Signed-off-by: Joe Perches <joe@perches.com> Acked-By: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26Merge branch 'bpf-next'David S. Miller
Alexei Starovoitov says: ==================== eBPF syscall, verifier, testsuite v14 -> v15: - got rid of macros with hidden control flow (suggested by David) replaced macro with explicit goto or return and simplified where possible (affected patches #9 and #10) - rebased, retested v13 -> v14: - small change to 1st patch to ease 'new userspace with old kernel' problem (done similar to perf_copy_attr()) (suggested by Daniel) - the rest unchanged v12 -> v13: - replaced 'foo __user *' pointers with __aligned_u64 (suggested by David) - added __attribute__((aligned(8)) to 'union bpf_attr' to keep constant alignment between patches - updated manpage and syscall wrappers due to __aligned_u64 - rebased, retested on x64 with 32-bit and 64-bit userspace and on i386, build tested on arm32,sparc64 v11 -> v12: - dropped patch 11 and copied few macros to libbpf.h (suggested by Daniel) - replaced 'enum bpf_prog_type' with u32 to be safe in compat (.. Andy) - implemented and tested compat support (not part of this set) (.. Daniel) - changed 'void *log_buf' to 'char *' (.. Daniel) - combined struct bpf_work_struct and bpf_prog_info (.. Daniel) - added better return value explanation to manpage (.. Andy) - added log_buf/log_size explanation to manpage (.. Andy & Daniel) - added a lot more info about prog_type and map_type to manpage (.. Andy) - rebased, tweaked test_stubs Patches 1-4 establish BPF syscall shell for maps and programs. Patches 5-10 add verifier step by step Patch 11 adds test stubs for 'unspec' program type and verifier testsuite from user space Note that patches 1,3,4,7 add commands and attributes to the syscall while being backwards compatible from each other, which should demonstrate how other commands can be added in the future. After this set the programs can be loaded for testing only. They cannot be attached to any events. Though manpage talks about tracing and sockets, it will be a subject of future patches. Please take a look at manpage: BPF(2) Linux Programmer's Manual BPF(2) NAME bpf - perform a command on eBPF map or program SYNOPSIS #include <linux/bpf.h> int bpf(int cmd, union bpf_attr *attr, unsigned int size); DESCRIPTION bpf() syscall is a multiplexor for a range of different operations on eBPF which can be characterized as "universal in-kernel virtual machine". eBPF is similar to original Berkeley Packet Filter (or "classic BPF") used to filter network packets. Both statically analyze the programs before loading them into the kernel to ensure that programs cannot harm the running system. eBPF extends classic BPF in multiple ways including ability to call in- kernel helper functions and access shared data structures like eBPF maps. The programs can be written in a restricted C that is compiled into eBPF bytecode and executed on the eBPF virtual machine or JITed into native instruction set. eBPF Design/Architecture eBPF maps is a generic storage of different types. User process can create multiple maps (with key/value being opaque bytes of data) and access them via file descriptor. In parallel eBPF programs can access maps from inside the kernel. It's up to user process and eBPF program to decide what they store inside maps. eBPF programs are similar to kernel modules. They are loaded by the user process and automatically unloaded when process exits. Each eBPF program is a safe run-to-completion set of instructions. eBPF verifier statically determines that the program terminates and is safe to execute. During verification the program takes a hold of maps that it intends to use, so selected maps cannot be removed until the program is unloaded. The program can be attached to different events. These events can be packets, tracepoint events and other types in the future. A new event triggers execution of the program which may store information about the event in the maps. Beyond storing data the programs may call into in-kernel helper functions which may, for example, dump stack, do trace_printk or other forms of live kernel debugging. The same program can be attached to multiple events. Different programs can access the same map: tracepoint tracepoint tracepoint sk_buff sk_buff event A event B event C on eth0 on eth1 | | | | | | | | | | --> tracing <-- tracing socket socket prog_1 prog_2 prog_3 prog_4 | | | | |--- -----| |-------| map_3 map_1 map_2 Syscall Arguments bpf() syscall operation is determined by cmd which can be one of the following: BPF_MAP_CREATE Create a map with given type and attributes and return map FD BPF_MAP_LOOKUP_ELEM Lookup element by key in a given map and return its value BPF_MAP_UPDATE_ELEM Create or update element (key/value pair) in a given map BPF_MAP_DELETE_ELEM Lookup and delete element by key in a given map BPF_MAP_GET_NEXT_KEY Lookup element by key in a given map and return key of next element BPF_PROG_LOAD Verify and load eBPF program attr is a pointer to a union of type bpf_attr as defined below. size is the size of the union. union bpf_attr { struct { /* anonymous struct used by BPF_MAP_CREATE command */ __u32 map_type; __u32 key_size; /* size of key in bytes */ __u32 value_size; /* size of value in bytes */ __u32 max_entries; /* max number of entries in a map */ }; struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ __u32 map_fd; __aligned_u64 key; union { __aligned_u64 value; __aligned_u64 next_key; }; }; struct { /* anonymous struct used by BPF_PROG_LOAD command */ __u32 prog_type; __u32 insn_cnt; __aligned_u64 insns; /* 'const struct bpf_insn *' */ __aligned_u64 license; /* 'const char *' */ __u32 log_level; /* verbosity level of eBPF verifier */ __u32 log_size; /* size of user buffer */ __aligned_u64 log_buf; /* user supplied 'char *' buffer */ }; } __attribute__((aligned(8))); eBPF maps maps is a generic storage of different types for sharing data between kernel and userspace. Any map type has the following attributes: . type . max number of elements . key size in bytes . value size in bytes The following wrapper functions demonstrate how this syscall can be used to access the maps. The functions use the cmd argument to invoke different operations. BPF_MAP_CREATE int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, int max_entries) { union bpf_attr attr = { .map_type = map_type, .key_size = key_size, .value_size = value_size, .max_entries = max_entries }; return bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); } bpf() syscall creates a map of map_type type and given attributes key_size, value_size, max_entries. On success it returns process-local file descriptor. On error, -1 is returned and errno is set to EINVAL or EPERM or ENOMEM. The attributes key_size and value_size will be used by verifier during program loading to check that program is calling bpf_map_*_elem() helper functions with correctly initialized key and that program doesn't access map element value beyond specified value_size. For example, when map is created with key_size = 8 and program does: bpf_map_lookup_elem(map_fd, fp - 4) such program will be rejected, since in-kernel helper function bpf_map_lookup_elem(map_fd, void *key) expects to read 8 bytes from 'key' pointer, but 'fp - 4' starting address will cause out of bounds stack access. Similarly, when map is created with value_size = 1 and program does: value = bpf_map_lookup_elem(...); *(u32 *)value = 1; such program will be rejected, since it accesses value pointer beyond specified 1 byte value_size limit. Currently only hash table map_type is supported: enum bpf_map_type { BPF_MAP_TYPE_UNSPEC, BPF_MAP_TYPE_HASH, }; map_type selects one of the available map implementations in kernel. For all map_types eBPF programs access maps with the same bpf_map_lookup_elem()/bpf_map_update_elem() helper functions. BPF_MAP_LOOKUP_ELEM int bpf_lookup_elem(int fd, void *key, void *value) { union bpf_attr attr = { .map_fd = fd, .key = ptr_to_u64(key), .value = ptr_to_u64(value), }; return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr)); } bpf() syscall looks up an element with given key in a map fd. If element is found it returns zero and stores element's value into value. If element is not found it returns -1 and sets errno to ENOENT. BPF_MAP_UPDATE_ELEM int bpf_update_elem(int fd, void *key, void *value) { union bpf_attr attr = { .map_fd = fd, .key = ptr_to_u64(key), .value = ptr_to_u64(value), }; return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)); } The call creates or updates element with given key/value in a map fd. On success it returns zero. On error, -1 is returned and errno is set to EINVAL or EPERM or ENOMEM or E2BIG. E2BIG indicates that number of elements in the map reached max_entries limit specified at map creation time. BPF_MAP_DELETE_ELEM int bpf_delete_elem(int fd, void *key) { union bpf_attr attr = { .map_fd = fd, .key = ptr_to_u64(key), }; return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr)); } The call deletes an element in a map fd with given key. Returns zero on success. If element is not found it returns -1 and sets errno to ENOENT. BPF_MAP_GET_NEXT_KEY int bpf_get_next_key(int fd, void *key, void *next_key) { union bpf_attr attr = { .map_fd = fd, .key = ptr_to_u64(key), .next_key = ptr_to_u64(next_key), }; return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr)); } The call looks up an element by key in a given map fd and returns key of the next element into next_key pointer. If key is not found, it return zero and returns key of the first element into next_key. If key is the last element, it returns -1 and sets errno to ENOENT. Other possible errno values are ENOMEM, EFAULT, EPERM, EINVAL. This method can be used to iterate over all elements of the map. close(map_fd) will delete the map map_fd. Exiting process will delete all maps automatically. eBPF programs BPF_PROG_LOAD This cmd is used to load eBPF program into the kernel. char bpf_log_buf[LOG_BUF_SIZE]; int bpf_prog_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns, int insn_cnt, const char *license) { union bpf_attr attr = { .prog_type = prog_type, .insns = ptr_to_u64(insns), .insn_cnt = insn_cnt, .license = ptr_to_u64(license), .log_buf = ptr_to_u64(bpf_log_buf), .log_size = LOG_BUF_SIZE, .log_level = 1, }; return bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); } prog_type is one of the available program types: enum bpf_prog_type { BPF_PROG_TYPE_UNSPEC, BPF_PROG_TYPE_SOCKET, BPF_PROG_TYPE_TRACING, }; By picking prog_type program author selects a set of helper functions callable from eBPF program and corresponding format of struct bpf_context (which is the data blob passed into the program as the first argument). For example, the programs loaded with prog_type = TYPE_TRACING may call bpf_printk() helper, whereas TYPE_SOCKET programs may not. The set of functions available to the programs under given type may increase in the future. Currently the set of functions for TYPE_TRACING is: bpf_map_lookup_elem(map_fd, void *key) // lookup key in a map_fd bpf_map_update_elem(map_fd, void *key, void *value) // update key/value bpf_map_delete_elem(map_fd, void *key) // delete key in a map_fd bpf_ktime_get_ns(void) // returns current ktime bpf_printk(char *fmt, int fmt_size, ...) // prints into trace buffer bpf_memcmp(void *ptr1, void *ptr2, int size) // non-faulting memcmp bpf_fetch_ptr(void *ptr) // non-faulting load pointer from any address bpf_fetch_u8(void *ptr) // non-faulting 1 byte load bpf_fetch_u16(void *ptr) // other non-faulting loads bpf_fetch_u32(void *ptr) bpf_fetch_u64(void *ptr) and bpf_context is defined as: struct bpf_context { /* argN fields match one to one to arguments passed to trace events */ u64 arg1, arg2, arg3, arg4, arg5, arg6; /* return value from kretprobe event or from syscall_exit event */ u64 ret; }; The set of helper functions for TYPE_SOCKET is TBD. More program types may be added in the future. Like BPF_PROG_TYPE_USER_TRACING for unprivileged programs. BPF_PROG_TYPE_UNSPEC is used for testing only. Such programs cannot be attached to events. insns array of "struct bpf_insn" instructions insn_cnt number of instructions in the program license license string, which must be GPL compatible to call helper functions marked gpl_only log_buf user supplied buffer that in-kernel verifier is using to store verification log. Log is a multi-line string that should be used by program author to understand how verifier came to conclusion that program is unsafe. The format of the output can change at any time as verifier evolves. log_size size of user buffer. If size of the buffer is not large enough to store all verifier messages, -1 is returned and errno is set to ENOSPC. log_level verbosity level of eBPF verifier, where zero means no logs provided close(prog_fd) will unload eBPF program The maps are accesible from programs and generally tie the two together. Programs process various events (like tracepoint, kprobe, packets) and store the data into maps. User space fetches data from maps. Either the same or a different map may be used by user space as configuration space to alter program behavior on the fly. Events Once an eBPF program is loaded, it can be attached to an event. Various kernel subsystems have different ways to do so. For example: setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)); will attach the program prog_fd to socket sock which was received by prior call to socket(). ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd); will attach the program prog_fd to perf event event_fd which was received by prior call to perf_event_open(). Another way to attach the program to a tracing event is: event_fd = open("/sys/kernel/debug/tracing/events/skb/kfree_skb/filter"); write(event_fd, "bpf-123"); /* where 123 is eBPF program FD */ /* here program is attached and will be triggered by events */ close(event_fd); /* to detach from event */ EXAMPLES /* eBPF+sockets example: * 1. create map with maximum of 2 elements * 2. set map[6] = 0 and map[17] = 0 * 3. load eBPF program that counts number of TCP and UDP packets received * via map[skb->ip->proto]++ * 4. attach prog_fd to raw socket via setsockopt() * 5. print number of received TCP/UDP packets every second */ int main(int ac, char **av) { int sock, map_fd, prog_fd, key; long long value = 0, tcp_cnt, udp_cnt; map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 2); if (map_fd < 0) { printf("failed to create map '%s'\n", strerror(errno)); /* likely not run as root */ return 1; } key = 6; /* ip->proto == tcp */ assert(bpf_update_elem(map_fd, &key, &value) == 0); key = 17; /* ip->proto == udp */ assert(bpf_update_elem(map_fd, &key, &value) == 0); struct bpf_insn prog[] = { BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* r6 = r1 */ BPF_LD_ABS(BPF_B, 14 + 9), /* r0 = ip->proto */ BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),/* *(u32 *)(fp - 4) = r0 */ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), /* r2 = fp */ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = r2 - 4 */ BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* r1 = map_fd */ BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem), /* r0 = map_lookup(r1, r2) */ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), /* if (r0 == 0) goto pc+2 */ BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */ BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* lock *(u64 *)r0 += r1 */ BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */ BPF_EXIT_INSN(), /* return r0 */ }; prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET, prog, sizeof(prog), "GPL"); assert(prog_fd >= 0); sock = open_raw_sock("lo"); assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) == 0); for (;;) { key = 6; assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0); key = 17; assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0); printf("TCP %lld UDP %lld packets0, tcp_cnt, udp_cnt); sleep(1); } return 0; } RETURN VALUE For a successful call, the return value depends on the operation: BPF_MAP_CREATE The new file descriptor associated with eBPF map. BPF_PROG_LOAD The new file descriptor associated with eBPF program. All other commands Zero. On error, -1 is returned, and errno is set appropriately. ERRORS EPERM bpf() syscall was made without sufficient privilege (without the CAP_SYS_ADMIN capability). ENOMEM Cannot allocate sufficient memory. EBADF fd is not an open file descriptor EFAULT One of the pointers ( key or value or log_buf or insns ) is outside accessible address space. EINVAL The value specified in cmd is not recognized by this kernel. EINVAL For BPF_MAP_CREATE, either map_type or attributes are invalid. EINVAL For BPF_MAP_*_ELEM commands, some of the fields of "union bpf_attr" unused by this command are not set to zero. EINVAL For BPF_PROG_LOAD, attempt to load invalid program (unrecognized instruction or uses reserved fields or jumps out of range or loop detected or calls unknown function). EACCES For BPF_PROG_LOAD, though program has valid instructions, it was rejected, since it was deemed unsafe (may access disallowed memory region or uninitialized stack/register or function constraints don't match actual types or misaligned access). In such case it is recommended to call bpf() again with log_level = 1 and examine log_buf for specific reason provided by verifier. ENOENT For BPF_MAP_LOOKUP_ELEM or BPF_MAP_DELETE_ELEM, indicates that element with given key was not found. E2BIG program is too large or a map reached max_entries limit (max number of elements). NOTES These commands may be used only by a privileged process (one having the CAP_SYS_ADMIN capability). SEE ALSO eBPF architecture and instruction set is explained in Documentation/networking/filter.txt Linux 2014-09-16 BPF(2) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26bpf: mini eBPF library, test stubs and verifier testsuiteAlexei Starovoitov
1. the library includes a trivial set of BPF syscall wrappers: int bpf_create_map(int key_size, int value_size, int max_entries); int bpf_update_elem(int fd, void *key, void *value); int bpf_lookup_elem(int fd, void *key, void *value); int bpf_delete_elem(int fd, void *key); int bpf_get_next_key(int fd, void *key, void *next_key); int bpf_prog_load(enum bpf_prog_type prog_type, const struct sock_filter_int *insns, int insn_len, const char *license); bpf_prog_load() stores verifier log into global bpf_log_buf[] array and BPF_*() macros to build instructions 2. test stubs configure eBPF infra with 'unspec' map and program types. These are fake types used by user space testsuite only. 3. verifier tests valid and invalid programs and expects predefined error log messages from kernel. 40 tests so far. $ sudo ./test_verifier #0 add+sub+mul OK #1 unreachable OK #2 unreachable2 OK #3 out of range jump OK #4 out of range jump2 OK #5 test1 ld_imm64 OK ... Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26bpf: verifier (add verifier core)Alexei Starovoitov
This patch adds verifier core which simulates execution of every insn and records the state of registers and program stack. Every branch instruction seen during simulation is pushed into state stack. When verifier reaches BPF_EXIT, it pops the state from the stack and continues until it reaches BPF_EXIT again. For program: 1: bpf_mov r1, xxx 2: if (r1 == 0) goto 5 3: bpf_mov r0, 1 4: goto 6 5: bpf_mov r0, 2 6: bpf_exit The verifier will walk insns: 1, 2, 3, 4, 6 then it will pop the state recorded at insn#2 and will continue: 5, 6 This way it walks all possible paths through the program and checks all possible values of registers. While doing so, it checks for: - invalid instructions - uninitialized register access - uninitialized stack access - misaligned stack access - out of range stack access - invalid calling convention - instruction encoding is not using reserved fields Kernel subsystem configures the verifier with two callbacks: - bool (*is_valid_access)(int off, int size, enum bpf_access_type type); that provides information to the verifer which fields of 'ctx' are accessible (remember 'ctx' is the first argument to eBPF program) - const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id); returns argument constraints of kernel helper functions that eBPF program may call, so that verifier can checks that R1-R5 types match the prototype More details in Documentation/networking/filter.txt and in kernel/bpf/verifier.c Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26bpf: verifier (add branch/goto checks)Alexei Starovoitov
check that control flow graph of eBPF program is a directed acyclic graph check_cfg() does: - detect loops - detect unreachable instructions - check that program terminates with BPF_EXIT insn - check that all branches are within program boundary Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>