summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2016-03-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: "Highlights: 1) Support more Realtek wireless chips, from Jes Sorenson. 2) New BPF types for per-cpu hash and arrap maps, from Alexei Starovoitov. 3) Make several TCP sysctls per-namespace, from Nikolay Borisov. 4) Allow the use of SO_REUSEPORT in order to do per-thread processing of incoming TCP/UDP connections. The muxing can be done using a BPF program which hashes the incoming packet. From Craig Gallek. 5) Add a multiplexer for TCP streams, to provide a messaged based interface. BPF programs can be used to determine the message boundaries. From Tom Herbert. 6) Add 802.1AE MACSEC support, from Sabrina Dubroca. 7) Avoid factorial complexity when taking down an inetdev interface with lots of configured addresses. We were doing things like traversing the entire address less for each address removed, and flushing the entire netfilter conntrack table for every address as well. 8) Add and use SKB bulk free infrastructure, from Jesper Brouer. 9) Allow offloading u32 classifiers to hardware, and implement for ixgbe, from John Fastabend. 10) Allow configuring IRQ coalescing parameters on a per-queue basis, from Kan Liang. 11) Extend ethtool so that larger link mode masks can be supported. From David Decotigny. 12) Introduce devlink, which can be used to configure port link types (ethernet vs Infiniband, etc.), port splitting, and switch device level attributes as a whole. From Jiri Pirko. 13) Hardware offload support for flower classifiers, from Amir Vadai. 14) Add "Local Checksum Offload". Basically, for a tunneled packet the checksum of the outer header is 'constant' (because with the checksum field filled into the inner protocol header, the payload of the outer frame checksums to 'zero'), and we can take advantage of that in various ways. From Edward Cree" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits) bonding: fix bond_get_stats() net: bcmgenet: fix dma api length mismatch net/mlx4_core: Fix backward compatibility on VFs phy: mdio-thunder: Fix some Kconfig typos lan78xx: add ndo_get_stats64 lan78xx: handle statistics counter rollover RDS: TCP: Remove unused constant RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket net: smc911x: convert pxa dma to dmaengine team: remove duplicate set of flag IFF_MULTICAST bonding: remove duplicate set of flag IFF_MULTICAST net: fix a comment typo ethernet: micrel: fix some error codes ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it bpf, dst: add and use dst_tclassid helper bpf: make skb->tc_classid also readable net: mvneta: bm: clarify dependencies cls_bpf: reset class and reuse major in da ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c ldmvsw: Add ldmvsw.c driver code ...
2016-03-19Merge branch 'for-4.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "cgroup changes for v4.6-rc1. No userland visible behavior changes in this pull request. I'll send out a separate pull request for the addition of cgroup namespace support. - The biggest change is the revamping of cgroup core task migration and controller handling logic. There are quite a few places where controllers and tasks are manipulated. Previously, many of those places implemented custom operations for each specific use case assuming specific starting conditions. While this worked, it makes the code fragile and difficult to follow. The bulk of this pull request restructures these operations so that most related operations are performed through common helpers which implement recursive (subtrees are always processed consistently) and idempotent (they make cgroup hierarchy converge to the target state rather than performing operations assuming specific starting conditions). This makes the code a lot easier to understand, verify and extend. - Implicit controller support is added. This is primarily for using perf_event on the v2 hierarchy so that perf can match cgroup v2 path without requiring the user to do anything special. The kernel portion of perf_event changes is acked but userland changes are still pending review. - cgroup_no_v1= boot parameter added to ease testing cgroup v2 in certain environments. - There is a regression introduced during v4.4 devel cycle where attempts to migrate zombie tasks can mess up internal object management. This was fixed earlier this week and included in this pull request w/ stable cc'd. - Misc non-critical fixes and improvements" * 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (44 commits) cgroup: avoid false positive gcc-6 warning cgroup: ignore css_sets associated with dead cgroups during migration Documentation: cgroup v2: Trivial heading correction. cgroup: implement cgroup_subsys->implicit_on_dfl cgroup: use css_set->mg_dst_cgrp for the migration target cgroup cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup cgroup: move migration destination verification out of cgroup_migrate_prepare_dst() cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses() cgroup: Trivial correction to reflect controller. cgroup: remove stale item in cgroup-v1 document INDEX file. cgroup: update css iteration in cgroup_update_dfl_csses() cgroup: allocate 2x cgrp_cset_links when setting up a new root cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends cgroup: use cgroup_apply_enable_control() in cgroup creation path cgroup: combine cgroup_mutex locking and offline css draining cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write() cgroup: introduce cgroup_{save|propagate|restore}_control() cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write() ...
2016-03-19bonding: fix bond_get_stats()Eric Dumazet
bond_get_stats() can be called from rtnetlink (with RTNL held) or from /proc/net/dev seq handler (with RCU held) The logic added in commit 5f0c5f73e5ef ("bonding: make global bonding stats more reliable") kind of assumed only one cpu could run there. If multiple threads are reading /proc/net/dev, stats can be really messed up after a while. A second problem is that some fields are 32bit, so we need to properly handle the wrap around problem. Given that RTNL is not always held, we need to use bond_for_each_slave_rcu(). Fixes: 5f0c5f73e5ef ("bonding: make global bonding stats more reliable") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Andy Gospodarek <gospo@cumulusnetworks.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-19Merge branch 'for-4.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata updates from Tejun Heo: - ahci grew runtime power management support so that the controller can be turned off if no devices are attached. - sata_via isn't dead yet. It got hotplug support and more refined workaround for certain WD drives. - Misc cleanups. There's a merge from for-4.5-fixes to avoid confusing conflicts in ahci PCI ID table. * 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: ata: ahci_xgene: dereferencing uninitialized pointer in probe AHCI: Remove obsolete Intel Lewisburg SATA RAID device IDs ata: sata_rcar: Use ARCH_RENESAS sata_via: Implement hotplug for VT6421 sata_via: Apply WD workaround only when needed on VT6421 ahci: Add runtime PM support for the host controller ahci: Add functions to manage runtime PM of AHCI ports ahci: Convert driver to use modern PM hooks ahci: Cache host controller version scsi: Drop runtime PM usage count after host is added scsi: Set request queue runtime PM status back to active on resume block: Add blk_set_runtime_active() ata: ahci_mvebu: add support for Armada 3700 variant libata: fix unbalanced spin_lock_irqsave/spin_unlock_irq() in ata_scsi_park_show() libata: support AHCI on OCTEON platform
2016-03-19Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge second patch-bomb from Andrew Morton: - a couple of hotfixes - the rest of MM - a new timer slack control in procfs - a couple of procfs fixes - a few misc things - some printk tweaks - lib/ updates, notably to radix-tree. - add my and Nick Piggin's old userspace radix-tree test harness to tools/testing/radix-tree/. Matthew said it was a godsend during the radix-tree work he did. - a few code-size improvements, switching to __always_inline where gcc screwed up. - partially implement character sets in sscanf * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits) sscanf: implement basic character sets lib/bug.c: use common WARN helper param: convert some "on"/"off" users to strtobool lib: add "on"/"off" support to kstrtobool lib: update single-char callers of strtobool() lib: move strtobool() to kstrtobool() include/linux/unaligned: force inlining of byteswap operations include/uapi/linux/byteorder, swab: force inlining of some byteswap operations include/asm-generic/atomic-long.h: force inlining of some atomic_long operations usb: common: convert to use match_string() helper ide: hpt366: convert to use match_string() helper ata: hpt366: convert to use match_string() helper power: ab8500: convert to use match_string() helper power: charger_manager: convert to use match_string() helper drm/edid: convert to use match_string() helper pinctrl: convert to use match_string() helper device property: convert to use match_string() helper lib/string: introduce match_string() helper radix-tree tests: add test for radix_tree_iter_next radix-tree tests: add regression3 test ...
2016-03-19Merge branch 'for-4.6/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: "This is the block driver pull request for this merge window. It sits on top of for-4.6/core, that was just sent out. This contains: - A set of fixes for lightnvm. One from Alan, fixing an overflow, and the rest from the usual suspects, Javier and Matias. - A set of fixes for nbd from Markus and Dan, and a fixup from Arnd for correct usage of the signed 64-bit divider. - A set of bug fixes for the Micron mtip32xx, from Asai. - A fix for the brd discard handling from Bart. - Update the maintainers entry for cciss, since that hardware has transferred ownership. - Three bug fixes for bcache from Eric Wheeler. - Set of fixes for xen-blk{back,front} from Jan and Konrad. - Removal of the cpqarray driver. It has been disabled in Kconfig since 2013, and we were initially scheduled to remove it in 3.15. - Various updates and fixes for NVMe, with the most important being: - Removal of the per-device NVMe thread, replacing that with a watchdog timer instead. From Christoph. - Exposing the namespace WWID through sysfs, from Keith. - Set of cleanups from Ming Lin. - Logging the controller device name instead of the underlying PCI device name, from Sagi. - And a bunch of fixes and optimizations from the usual suspects in this area" * 'for-4.6/drivers' of git://git.kernel.dk/linux-block: (49 commits) NVMe: Expose ns wwid through single sysfs entry drivers:block: cpqarray clean up brd: Fix discard request processing cpqarray: remove it from the kernel cciss: update MAINTAINERS NVMe: Remove unused sq_head read in completion path bcache: fix cache_set_flush() NULL pointer dereference on OOM bcache: cleaned up error handling around register_cache() bcache: fix race of writeback thread starting before complete initialization NVMe: Create discard zero quirk white list nbd: use correct div_s64 helper mtip32xx: remove unneeded variable in mtip_cmd_timeout() lightnvm: generalize rrpc ppa calculations lightnvm: remove struct nvm_dev->total_blocks lightnvm: rename ->nr_pages to ->nr_sects lightnvm: update closed list outside of intr context xen/blback: Fit the important information of the thread in 17 characters lightnvm: fold get bb tbl when using dual/quad plane mode lightnvm: fix up nonsensical configure overrun checking xen-blkback: advertise indirect segment support earlier ...
2016-03-18Merge branch 'for-4.6/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block updates from Jens Axboe: "Here are the core block changes for this merge window. Not a lot of exciting stuff going on in this round, most of the changes have been on the driver side of things. That pull request is coming next. This pull request contains: - A set of fixes for chained bio handling from Christoph. - A tag bounds check for blk-mq from Hannes, ensuring that we don't do something stupid if a device reports an invalid tag value. - A set of fixes/updates for the CFQ IO scheduler from Jan Kara. - A set of blk-mq fixes from Keith, adding support for dynamic hardware queues, and fixing init of max_dev_sectors for stacking devices. - A fix for the dynamic hw context from Ming. - Enabling of cgroup writeback support on a block device, from Shaohua" * 'for-4.6/core' of git://git.kernel.dk/linux-block: blk-mq: add bounds check on tag-to-rq conversion block: bio_remaining_done() isn't unlikely block: cleanup bio_endio block: factor out chained bio completion block: don't unecessarily clobber bi_error for chained bios block-dev: enable writeback cgroup support blk-mq: Fix NULL pointer updating nr_requests blk-mq: mark request queue as mq asap block: Initialize max_dev_sectors to 0 blk-mq: dynamic h/w context count cfq-iosched: Allow parent cgroup to preempt its child cfq-iosched: Allow sync noidle workloads to preempt each other cfq-iosched: Reorder checks in cfq_should_preempt() cfq-iosched: Don't group_idle if cfqq has big thinktime
2016-03-18net: fix a comment typoZhang Shengju
Fix a comment typo. Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-18ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use itDaniel Borkmann
eBPF defines this as BPF_TUNLEN_MAX and OVS just uses the hard-coded value inside struct sw_flow_key. Thus, add and use IP_TUNNEL_OPTS_MAX for this, which makes the code a bit more generic and allows to remove BPF_TUNLEN_MAX from eBPF code. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-18bpf, dst: add and use dst_tclassid helperDaniel Borkmann
We can just add a small helper dst_tclassid() for retrieving the dst->tclassid value. It makes the code a bit better in that we can get rid of the ifdef from filter.c by moving this into the header. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-18Merge tag 'mfd-for-linus-4.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd Pull MFD updates from Lee Jones: "New Drivers: - Freescale Touch Screen ADC - X-Powers AXP PMIC with RSB - TI TPS65086 Power Management IC (PMIC) New Device Support: - Supply device PCI IDs for Intel Broxton Fix-ups: - Move to clkdev_create() API; intel_quark_i2c_gpio - Complete re-write of TI's TPS65912 Power Management IC (PMIC) - Remove unnecessary function argument; axp20x - Separate out bus related code; axp20x - Coding Style changes; axp20x - Allow more drivers to be compiled as modules - Work around false positive 'used uninitialised' warning; db8500-prcmu Bug Fixes: - Remove do_div(); fsl-imx25-gcq - Fix driver init when built-in; tps65010 - Fix clock-unregister leak; intel-lpss" * tag 'mfd-for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (53 commits) mfd: intel-lpss: Pass I2C configuration via properties on BXT mfd: imx6sx: Add PCIe register definitions for iomuxc gpr mfd: ipaq-micro: Use __maybe_unused to hide pm functions mfd: max77686: Add max77802 to I2C device ID table mfd: max77686: Export OF module alias information mfd: max77686: Allow driver to be built as a module mfd: stmpe: Add the proper PWM resources mfd: tps65090: Set regmap config reg counts properly mfd: syscon: Return ENOTSUPP instead of ENOSYS when disabled mfd: as3711: Set regmap config reg counts properly mfd: rc5t583: Set regmap config reg counts properly gpio: tps65086: Add GPO driver for the TPS65086 PMIC mfd: mt6397: Add platform device ID table mfd: da9063: Fix missing volatile registers in the core regmap_range volatile lists mfd: mt6397: Add MT6323 support to MT6397 driver mfd: mt6397: Add support for different Slave types mfd: mt6397: int_con and int_status may vary in location dt-bindings: mfd: Add bindings for the MediaTek MT6323 PMIC mfd: da9062: Fix missing volatile registers in the core regmap_range volatile lists mfd: Add documentation for ACT8945A DT bindings ...
2016-03-18Merge tag 'sound-4.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound updates from Takashi Iwai: "After a heavy storm by syzkaller in 4.5 cycle, we have relatively few changes in the core at this time while a lot of changes are found in the driver side, unsurprisingly. Below are some highlights: ALSA core: - A few more hardening in ALSA timer codes - An extension of sequencer API for advertising the card / pid - Small fixes in compress-offload and jack layers HD-audio: - Dynamic PCM assignment in HDMI/DP codec; preparation for upcoming DP-MST support - Lots of code refactoring for sharing with ASoC SKL driver - Regression fixes for Intel HDMI/DP - Fixups for CX20724 codec, Lenovo AiO USB-audio: - Add quirk_alias option to make quirk debugging easier - Fixes for possible Oops by malformed firmware Firewire: - Add support for FW-1804 in tascam driver - Improvements / changes in card registration, multi stream handling, etc for DICE - Lots of code refactoring ASoC: - Enhancements of still ongoing topology API - Lots of commits for Intel Skylake support including HDMI support - A few Intel Atom driver updates for recent devices - Lots of improvements to the Renesas drivers - Capture support for Qualcomm drivers - Support for TI DaVinci DRA7xxx devices - New machine drivers for Freescale systems with Cirrus CODECs, Mediatek systems with RT5650 CODECs - New CPU drivers for Allwinner S/PDIF controllers - New CODEC drivers for Maxim MAX9867 and MAX98926 and Realtek RT5514" * tag 'sound-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (291 commits) ALSA: hda - Fix mutex deadlock at HDMI/DP hotplug ALSA: ctl: change return value in compatibility layer so that it's the same value in core implementation ALSA: mixart: silence an uninitialized variable warning ALSA: usb-audio: Add sanity checks for endpoint accesses ALSA: usb-audio: Minor code cleanup in create_fixed_stream_quirk() ALSA: usb-audio: Fix NULL dereference in create_fixed_stream_quirk() ALSA: hda - Limit i915 HDMI binding only for HSW and later ALSA: hda - Fix unconditional GPIO toggle via automute ALSA: mixart: silence unitialized variable warnings ALSA: hda - Fixes double fault in nvhdmi_chmap_cea_alloc_validate_get_type ALSA: intel8x0: Add clock quirk entry for AD1981B on IBM ThinkPad X41. ALSA: hda - Add new GPU codec ID 0x10de0082 to snd-hda ASoC: rsnd: add simplified module explanation ASoC: hdac_hdmi: Add broxton device ID ASoC: Intel: Bxtn: Add Broxton PCI ID ASoC: Intel: Skylake: Move Skylake dsp ops & loader ops ASoC: Intel: add dmabuffer to common sst_dsp ASoC: Intel: Skylake: Unstatify skl_dsp_enable_core ASoC: Intel: Skylake: Fix whitepsace issues ASoC: Intel: Skylake: Move module id defines ...
2016-03-18Merge tag 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "Initial roundup of 4.6 merge window patches. This is the first of two pull requests. It is the smaller request, but touches for more different things (this is everything but what is in or going into staging). The pull request for the code in staging/rdma is on hold until after we decide what to do on the write/writev API issue and may be partially deferred until 4.7 as a result. Summary: - cxgb4 updates - nes updates - unification of iwarp portmapper code to core - add drain_cq API - various ib_core updates - minor ipoib updates - minor mlx4 updates - more significant mlx5 updates (including a minor merge conflict with net-next tree...merge is simple to resolve and Stephen's resolution was confirmed by Mellanox) - trivial net/9p rdma conversion - ocrdma RoCEv2 update - srpt updates" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (85 commits) iwpm: crash fix for large connections test iw_cxgb3: support for iWARP port mapping iw_cxgb4: remove port mapper related code iw_nes: remove port mapper related code iwcm: common code for port mapper net/9p: convert to new CQ API IB/mlx5: Add support for don't trap rules net/mlx5_core: Introduce forward to next priority action net/mlx5_core: Create anchor of last flow table iser: Accept arbitrary sg lists mapping if the device supports it mlx5: Add arbitrary sg list support IB/core: Add arbitrary sg_list support IB/mlx5: Expose correct max_fast_reg_page_list_len IB/mlx5: Make coding style more consistent IB/mlx5: Convert UMR CQ to new CQ API IB/ocrdma: Skip using unneeded intermediate variable IB/ocrdma: Skip using unneeded intermediate variable IB/ocrdma: Delete unnecessary variable initialisations in 11 functions IB/core: Documentation fix in the MAD header file IB/core: trivial prink cleanup. ...
2016-03-18Merge tag 'staging-4.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging driver updates from Greg KH: "Here is the big staging driver pull request for 4.6-rc1. Lots of little things here, over 1600 patches or so. Notable is all of the good Lustre work happening, those developers have finally woken up and are cleaning up their code greatly. The Outreachy intern application process is also happening, which brought in another 400 or so patches. Full details are in the very long shortlog. All of these have been in linux-next with no reported issues" * tag 'staging-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (1673 commits) staging: lustre: fix aligments in lnet selftest staging: lustre: report minimum of two buffers for LNet selftest load test staging: lustre: test for proper errno code in lstcon_rpc_trans_abort staging: lustre: filter remaining extra spacing for lnet selftest staging: lustre: remove extra spacing when setting variable for lnet selftest staging: lustre: remove extra spacing of variable declartions for lnet selftest staging: lustre: fix spacing issues checkpatch reported in lnet selftest staging: lustre: remove returns in void function for lnet selftest staging: lustre: fix bogus lst errors for lnet selftest staging: netlogic: Replacing pr_err with dev_err after the call to devm_kzalloc staging: mt29f_spinand: Replacing pr_info with dev_info after the call to devm_kzalloc staging: android: ion: fix up file mode staging: ion: debugfs invalid gfp mask staging: rts5208: Replace pci_enable_device with pcim_enable_device Staging: ieee80211: Place constant on right side of the test. staging: speakup: Replace del_timer with del_timer_sync staging: lowmemorykiller: fix 2 checks that checkpatch complained staging: mt29f_spinand: Drop void pointer cast staging: rdma: hfi1: file_ops: Replace ALIGN with PAGE_ALIGN staging: rdma: hfi1: driver: Replace IS_ALIGNED with PAGE_ALIGNED ...
2016-03-18Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input updates from Dmitry Torokhov: "The most notable item is addition of support for Synaptics RMI4 protocol which is native protocol for all current Synaptics devices (touchscreens, touchpads). In later releases we'll switch devices using HID and PS/2 protocol emulation to RMI4. You will also get: - BYD PS/2 touchpad protocol support for psmouse - MELFAS MIP4 Touchscreen driver - rotary encoder was moved away from legacy platform data and to generic device properties API, devm_* API, and can now handle encoders using more than 2 GPIOs - Cypress touchpad driver was switched to devm_* API and device properties - other assorted driver fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (40 commits) ARM: pxa/raumfeld: use PROPERTY_ENTRY_INTEGER to define props Input: synaptics-rmi4 - using logical instead of bitwise AND Input: powermate - fix oops with malicious USB descriptors Input: snvs_pwrkey - fix returned value check of syscon_regmap_lookup_by_phandle() MAINTAINERS: add devicetree bindings to Input Drivers section Input: synaptics-rmi4 - add device tree support to the SPI transport driver Input: synaptics-rmi4 - add SPI transport driver Input: synaptics-rmi4 - add support for F30 Input: synaptics-rmi4 - add support for F12 Input: synaptics-rmi4 - add device tree support for 2d sensors and F11 Input: synaptics-rmi4 - add support for 2D sensors and F11 Input: synaptics-rmi4 - add device tree support for RMI4 I2C devices Input: synaptics-rmi4 - add I2C transport driver Input: synaptics-rmi4 - add support for Synaptics RMI4 devices Input: ad7879 - add device tree support Input: ad7879 - fix default x/y axis assignment Input: ad7879 - move header to platform_data directory Input: ts4800 - add hardware dependency Input: cyapa - fix for losing events during device power transitions Input: sh_keysc - remove dependency on SUPERH ...
2016-03-18Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching Pull livepatching update from Jiri Kosina: - cleanup of module notifiers; this depends on a module.c cleanup which has been acked by Rusty; from Jessica Yu - small assorted fixes and MAINTAINERS update * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching: livepatch/module: remove livepatch module notifier modules: split part of complete_formation() into prepare_coming_module() livepatch: Update maintainers livepatch: Fix the error message about unresolvable ambiguity klp: remove CONFIG_LIVEPATCH dependency from klp headers klp: remove superfluous errors in asm/livepatch.h
2016-03-18Merge tag 'gpio-v4.6-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull GPIO updates from Linus Walleij: "This is the bulk of GPIO changes for kernel v4.6. There is quite a lot of interesting stuff going on. The patches to other subsystems and arch-wide are ACKed as far as possible, though I consider things like per-arch <asm/gpio.h> as essentially a part of the GPIO subsystem so it should not be needed. Core changes: - The gpio_chip is now a *real device*. Until now the gpio chips were just piggybacking the parent device or (gasp) floating in space outside of the device model. We now finally make GPIO chips devices. The gpio_chip will create a gpio_device which contains a struct device, and this gpio_device struct is kept private. Anything that needs to be kept private from the rest of the kernel will gradually be moved over to the gpio_device. - As a result of making the gpio_device a real device, we have added resource management, so devm_gpiochip_add_data() will cut down on overhead and reduce code lines. A huge slew of patches convert almost all drivers in the subsystem to use this. - Building on making the GPIO a real device, we add the first step of a new userspace ABI: the GPIO character device. We take small steps here, so we first add a pure *information* ABI and the tool "lsgpio" that will list all GPIO devices on the system and all lines on these devices. We can now discover GPIOs properly from userspace. We still have not come up with a way to actually *use* GPIOs from userspace. - To encourage people to use the character device for the future, we have it always-enabled when using GPIO. The old sysfs ABI is still opt-in (and can be used in parallel), but is marked as deprecated. We will keep it around for the foreseeable future, but it will not be extended to cover ever more use cases. Cleanup: - Bjorn Helgaas removed a whole slew of per-architecture <asm/gpio.h> includes. This dates back to when GPIO was an opt-in feature and no shared library even existed: just a header file with proper prototypes was provided and all semantics were up to the arch to implement. These patches make the GPIO chip even more a proper device and cleans out leftovers of the old in-kernel API here and there. Still some cruft is left but it's very little now. - There is still some clamping of return values for .get() going on, but we now return sane values in the vast majority of drivers and the errorpath is sanitized. Some patches for powerpc, blackfin and unicore still drop in. - We continue to switch the ARM, MIPS, blackfin, m68k local GPIO implementations to use gpiochip_add_data() and cut down on code lines. - MPC8xxx is converted to use the generic GPIO helpers. - ATH79 is converted to use the generic GPIO helpers. New drivers: - WinSystems WS16C48 - Acces 104-DIO-48E - F81866 (a F7188x variant) - Qoric (a MPC8xxx variant) - TS-4800 - SPI serializers (pisosr): simple 74xx shift registers connected to SPI to obtain a dirt-cheap output-only GPIO expander. - Texas Instruments TPIC2810 - Texas Instruments TPS65218 - Texas Instruments TPS65912 - X-Gene (ARM64) standby GPIO controller" * tag 'gpio-v4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (194 commits) Revert "Share upstreaming patches" gpio: mcp23s08: Fix clearing of interrupt. gpiolib: Fix comment referring to gpio_*() in gpiod_*() gpio: pca953x: Fix pca953x_gpio_set_multiple() on 64-bit gpio: xgene: Fix kconfig for standby GIPO contoller gpio: Add generic serializer DT binding gpio: uapi: use 0xB4 as ioctl() major gpio: tps65912: fix bad merge Revert "gpio: lp3943: Drop pin_used and lp3943_gpio_request/lp3943_gpio_free" gpio: omap: drop dev field from gpio_bank structure gpio: mpc8xxx: Slightly update the code for better readability gpio: mpc8xxx: Remove *read_reg and *write_reg from struct mpc8xxx_gpio_chip gpio: mpc8xxx: Fixup setting gpio direction output gpio: mcp23s08: Add support for mcp23s18 dt-bindings: gpio: altera: Fix altr,interrupt-type property gpio: add driver for MEN 16Z127 GPIO controller gpio: lp3943: Drop pin_used and lp3943_gpio_request/lp3943_gpio_free gpio: timberdale: Switch to devm_ioremap_resource() gpio: ts4800: Add IMX51 dependency gpiolib: rewrite gpiodev_add_to_list ...
2016-03-18Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: "Here are the main arm64 updates for 4.6. There are some relatively intrusive changes to support KASLR, the reworking of the kernel virtual memory layout and initial page table creation. Summary: - Initial page table creation reworked to avoid breaking large block mappings (huge pages) into smaller ones. The ARM architecture requires break-before-make in such cases to avoid TLB conflicts but that's not always possible on live page tables - Kernel virtual memory layout: the kernel image is no longer linked to the bottom of the linear mapping (PAGE_OFFSET) but at the bottom of the vmalloc space, allowing the kernel to be loaded (nearly) anywhere in physical RAM - Kernel ASLR: position independent kernel Image and modules being randomly mapped in the vmalloc space with the randomness is provided by UEFI (efi_get_random_bytes() patches merged via the arm64 tree, acked by Matt Fleming) - Implement relative exception tables for arm64, required by KASLR (initial code for ARCH_HAS_RELATIVE_EXTABLE added to lib/extable.c but actual x86 conversion to deferred to 4.7 because of the merge dependencies) - Support for the User Access Override feature of ARMv8.2: this allows uaccess functions (get_user etc.) to be implemented using LDTR/STTR instructions. Such instructions, when run by the kernel, perform unprivileged accesses adding an extra level of protection. The set_fs() macro is used to "upgrade" such instruction to privileged accesses via the UAO bit - Half-precision floating point support (part of ARMv8.2) - Optimisations for CPUs with or without a hardware prefetcher (using run-time code patching) - copy_page performance improvement to deal with 128 bytes at a time - Sanity checks on the CPU capabilities (via CPUID) to prevent incompatible secondary CPUs from being brought up (e.g. weird big.LITTLE configurations) - valid_user_regs() reworked for better sanity check of the sigcontext information (restored pstate information) - ACPI parking protocol implementation - CONFIG_DEBUG_RODATA enabled by default - VDSO code marked as read-only - DEBUG_PAGEALLOC support - ARCH_HAS_UBSAN_SANITIZE_ALL enabled - Erratum workaround Cavium ThunderX SoC - set_pte_at() fix for PROT_NONE mappings - Code clean-ups" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (99 commits) arm64: kasan: Fix zero shadow mapping overriding kernel image shadow arm64: kasan: Use actual memory node when populating the kernel image shadow arm64: Update PTE_RDONLY in set_pte_at() for PROT_NONE permission arm64: Fix misspellings in comments. arm64: efi: add missing frame pointer assignment arm64: make mrs_s prefixing implicit in read_cpuid arm64: enable CONFIG_DEBUG_RODATA by default arm64: Rework valid_user_regs arm64: mm: check at build time that PAGE_OFFSET divides the VA space evenly arm64: KVM: Move kvm_call_hyp back to its original localtion arm64: mm: treat memstart_addr as a signed quantity arm64: mm: list kernel sections in order arm64: lse: deal with clobbered IP registers after branch via PLT arm64: mm: dump: Use VA_START directly instead of private LOWEST_ADDR arm64: kconfig: add submenu for 8.2 architectural features arm64: kernel: acpi: fix ioremap in ACPI parking protocol cpu_postboot arm64: Add support for Half precision floating point arm64: Remove fixmap include fragility arm64: Add workaround for Cavium erratum 27456 arm64: mm: Mark .rodata as RO ...
2016-03-17Merge tag 'please-pull-pstore' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux Pull pstore update from Tony Luck: "Allow ram backend to be configured with addresses above 4GB" * tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: pstore: Add support for 64 Bit address space
2016-03-17Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Performance improvements in SEEK_DATA and xattr scalability improvements, plus a lot of clean ups and bug fixes" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (38 commits) ext4: clean up error handling in the MMP support jbd2: do not fail journal because of frozen_buffer allocation failure ext4: use __GFP_NOFAIL in ext4_free_blocks() ext4: fix compile error while opening the macro DOUBLE_CHECK ext4: print ext4 mount option data_err=abort correctly ext4: fix NULL pointer dereference in ext4_mark_inode_dirty() ext4: drop unneeded BUFFER_TRACE in ext4_delete_inline_entry() ext4: fix misspellings in comments. jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path ext4: more efficient SEEK_DATA implementation ext4: cleanup handling of bh->b_state in DAX mmap ext4: return hole from ext4_map_blocks() ext4: factor out determining of hole size ext4: fix setting of referenced bit in ext4_es_lookup_extent() ext4: remove i_ioend_count ext4: simplify io_end handling for AIO DIO ext4: move trans handling and completion deferal out of _ext4_get_block ext4: rename and split get blocks functions ext4: use i_mutex to serialize unaligned AIO DIO ext4: pack ioend structure better ...
2016-03-17Merge tag 'configfs-for-linus' of git://git.infradead.org/users/hch/configfsLinus Torvalds
Pull configfs updates from Christoph Hellwig: - A large patch from me to simplify setting up the list of default groups by actually implementing it as a list instead of an array. - a small Y2083 prep patch from Deepa Dinamani. Probably doesn't matter on it's own, but it seems like he is trying to get rid of all CURRENT_TIME uses in file systems, which is a worthwhile goal. * tag 'configfs-for-linus' of git://git.infradead.org/users/hch/configfs: configfs: switch ->default groups to a linked list configfs: Replace CURRENT_TIME by current_fs_time()
2016-03-17lib/bug.c: use common WARN helperJosh Poimboeuf
The traceoff_on_warning option doesn't have any effect on s390, powerpc, arm64, parisc, and sh because there are two different types of WARN implementations: 1) The above mentioned architectures treat WARN() as a special case of a BUG() exception. They handle warnings in report_bug() in lib/bug.c. 2) All other architectures just call warn_slowpath_*() directly. Their warnings are handled in warn_slowpath_common() in kernel/panic.c. Support traceoff_on_warning on all architectures and prevent any future divergence by using a single common function to emit the warning. Also remove the '()' from '%pS()', because the parentheses look funky: [ 45.607629] WARNING: at /root/warn_mod/warn_mod.c:17 .init_dummy+0x20/0x40 [warn_mod]() Reported-by: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: Prarit Bhargava <prarit@redhat.com> Acked-by: Prarit Bhargava <prarit@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17param: convert some "on"/"off" users to strtoboolKees Cook
This changes several users of manual "on"/"off" parsing to use strtobool. Some side-effects: - these uses will now parse y/n/1/0 meaningfully too - the early_param uses will now bubble up parse errors Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Amitkumar Karwar <akarwar@marvell.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Joe Perches <joe@perches.com> Cc: Kalle Valo <kvalo@codeaurora.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Nishant Sarmukadam <nishants@marvell.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Steve French <sfrench@samba.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17lib: move strtobool() to kstrtobool()Kees Cook
Create the kstrtobool_from_user() helper and move strtobool() logic into the new kstrtobool() (matching all the other kstrto* functions). Provides an inline wrapper for existing strtobool() callers. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Joe Perches <joe@perches.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Amitkumar Karwar <akarwar@marvell.com> Cc: Nishant Sarmukadam <nishants@marvell.com> Cc: Kalle Valo <kvalo@codeaurora.org> Cc: Steve French <sfrench@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17include/linux/unaligned: force inlining of byteswap operationsDenys Vlasenko
Sometimes gcc mysteriously doesn't inline very small functions we expect to be inlined. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122 With this .config: http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os, the following functions get deinlined many times. Examples of disassembly: <get_unaligned_be16> (24 copies, 108 calls): 66 8b 07 mov (%rdi),%ax 55 push %rbp 48 89 e5 mov %rsp,%rbp 86 e0 xchg %ah,%al 5d pop %rbp c3 retq <get_unaligned_be32> (25 copies, 181 calls): 8b 07 mov (%rdi),%eax 55 push %rbp 48 89 e5 mov %rsp,%rbp 0f c8 bswap %eax 5d pop %rbp c3 retq <get_unaligned_be64> (23 copies, 94 calls): 48 8b 07 mov (%rdi),%rax 55 push %rbp 48 89 e5 mov %rsp,%rbp 48 0f c8 bswap %rax 5d pop %rbp c3 retq <put_unaligned_be16> (2 copies, 11 calls): 89 f8 mov %edi,%eax 55 push %rbp c1 ef 08 shr $0x8,%edi c1 e0 08 shl $0x8,%eax 09 c7 or %eax,%edi 48 89 e5 mov %rsp,%rbp 66 89 3e mov %di,(%rsi) <put_unaligned_be32> (8 copies, 43 calls): 55 push %rbp 0f cf bswap %edi 89 3e mov %edi,(%rsi) 48 89 e5 mov %rsp,%rbp 5d pop %rbp c3 retq <put_unaligned_be64> (26 copies, 157 calls): 55 push %rbp 48 0f cf bswap %rdi 48 89 3e mov %rdi,(%rsi) 48 89 e5 mov %rsp,%rbp 5d pop %rbp c3 retq This patch fixes this via s/inline/__always_inline/. It only affects arches with efficient unaligned access insns, such as x86. (arched which lack such ops do not include linux/unaligned/access_ok.h) Code size decrease after the patch is ~8.5k: text data bss dec hex filename 92197848 20826112 36417536 149441496 8e84bd8 vmlinux 92189231 20826144 36417536 149432911 8e82a4f vmlinux6_unaligned_be_after Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Graf <tgraf@suug.ch> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Rientjes <rientjes@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17include/uapi/linux/byteorder, swab: force inlining of some byteswap operationsDenys Vlasenko
Sometimes gcc mysteriously doesn't inline very small functions we expect to be inlined. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122 With this .config: http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os, the following functions get deinlined many times. Examples of disassembly: <get_unaligned_be16> (12 copies, 51 calls): 66 8b 07 mov (%rdi),%ax 55 push %rbp 48 89 e5 mov %rsp,%rbp 86 e0 xchg %ah,%al 5d pop %rbp c3 retq <get_unaligned_be32> (12 copies, 135 calls): 8b 07 mov (%rdi),%eax 55 push %rbp 48 89 e5 mov %rsp,%rbp 0f c8 bswap %eax 5d pop %rbp c3 retq <get_unaligned_be64> (2 copies, 20 calls): 48 8b 07 mov (%rdi),%rax 55 push %rbp 48 89 e5 mov %rsp,%rbp 48 0f c8 bswap %rax 5d pop %rbp c3 retq <__swab16p> (16 copies, 146 calls): 55 push %rbp 89 f8 mov %edi,%eax 86 e0 xchg %ah,%al 48 89 e5 mov %rsp,%rbp 5d pop %rbp c3 retq <__swab32p> (43 copies, ~560 calls): 55 push %rbp 89 f8 mov %edi,%eax 0f c8 bswap %eax 48 89 e5 mov %rsp,%rbp 5d pop %rbp c3 retq <__swab64p> (21 copies, 119 calls): 55 push %rbp 48 89 f8 mov %rdi,%rax 48 0f c8 bswap %rax 48 89 e5 mov %rsp,%rbp 5d pop %rbp c3 retq <__swab32s> (6 copies, 47 calls): 8b 07 mov (%rdi),%eax 55 push %rbp 48 89 e5 mov %rsp,%rbp 0f c8 bswap %eax 89 07 mov %eax,(%rdi) 5d pop %rbp c3 retq This patch fixes this via s/inline/__always_inline/. Code size decrease after the patch is ~4.5k: text data bss dec hex filename 92202377 20826112 36417536 149446025 8e85d89 vmlinux 92197848 20826112 36417536 149441496 8e84bd8 vmlinux5_swap_after Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Graf <tgraf@suug.ch> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Rientjes <rientjes@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17include/asm-generic/atomic-long.h: force inlining of some atomic_long operationsDenys Vlasenko
Sometimes gcc mysteriously doesn't inline very small functions we expect to be inlined. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122 With this .config: http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os, atomic_long_inc(), atomic_long_dec() and atomic_long_add() functions get deinlined about 40 times. Examples of disassembly: <atomic_long_inc> (21 copies, 147 calls): 55 push %rbp 48 89 e5 mov %rsp,%rbp f0 48 ff 07 lock incq (%rdi) 5d pop %rbp c3 retq <atomic_long_dec> (4 copies, 14 calls) is similar to inc. <atomic_long_add> (11 copies, 41 calls): 55 push %rbp 48 89 e5 mov %rsp,%rbp f0 48 01 3e lock add %rdi,(%rsi) 5d pop %rbp c3 retq This patch fixes this via s/inline/__always_inline/. Code size decrease after the patch is ~1.3k: text data bss dec hex filename 92203657 20826112 36417536 149447305 8e86289 vmlinux 92202377 20826112 36417536 149446025 8e85d89 vmlinux4_atomiclong_after Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Graf <tgraf@suug.ch> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Rientjes <rientjes@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17lib/string: introduce match_string() helperAndy Shevchenko
Occasionally we have to search for an occurrence of a string in an array of strings. Make a simple helper for that purpose. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: David Airlie <airlied@linux.ie> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heikki Krogerus <heikki.krogerus@linux.intel.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Mika Westerberg <mika.westerberg@linux.intel.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Sebastian Reichel <sre@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17radix-tree,shmem: introduce radix_tree_iter_next()Matthew Wilcox
shmem likes to occasionally drop the lock, schedule, then reacqire the lock and continue with the iteration from the last place it left off. This is currently done with a pretty ugly goto. Introduce radix_tree_iter_next() and use it throughout shmem.c. [koct9i@gmail.com: fix bug in radix_tree_iter_next() for tagged iteration] Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17radix_tree: add support for multi-order entriesMatthew Wilcox
With huge pages, it is convenient to have the radix tree be able to return an entry that covers multiple indices. Previous attempts to deal with the problem have involved inserting N duplicate entries, which is a waste of memory and leads to problems trying to handle aliased tags, or probing the tree multiple times to find alternative entries which might cover the requested index. This approach inserts one canonical entry into the tree for a given range of indices, and may also insert other entries in order to ensure that lookups find the canonical entry. This solution only tolerates inserting powers of two that are greater than the fanout of the tree. If we wish to expand the radix tree's abilities to support large-ish pages that is less than the fanout at the penultimate level of the tree, then we would need to add one more step in lookup to ensure that any sibling nodes in the final level of the tree are dereferenced and we return the canonical entry that they reference. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17radix-tree: add an explicit include of bitops.hMatthew Wilcox
The radix-tree header uses the __ffs() function, which is defined in bitops.h. The current kernel headers implicitly include bitops.h, but the userspace test harness does not. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17include/linux/list_bl.h: use bool instead of int for boolean functionsChen Gang
hlist_bl_unhashed() and hlist_bl_empty() are all boolean functions, so return bool instead of int. Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17include/uapi/linux/elf-em.h: remove v850Rob Landley
The v850 port was removed by commits f606ddf42fd4 and 07a887d399b8 in 2008. These #defines are not used in the current kernel. Signed-off-by: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17fix Christoph's email addressesChristoph Lameter
There are various email addresses for me throughout the kernel. Use the one that will always be valid. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17bug: set warn variable before calling WARN()Steven Rostedt
This has hit me a couple of times already. I would be debugging code and the system would simply hang and then reboot. Finally, I found that the problem was caused by WARN_ON_ONCE() and friends. The macro WARN_ON_ONCE(condition) is defined as: static bool __section(.data.unlikely) __warned; int __ret_warn_once = !!(condition); if (unlikely(__ret_warn_once)) if (WARN_ON(!__warned)) __warned = true; unlikely(__ret_warn_once); Which looks great and all. But what I have hit, is an issue when WARN_ON() itself hits the same WARN_ON_ONCE() code. Because, the variable __warned is not yet set. Then it too calls WARN_ON() and that triggers the warning again. It keeps doing this until the stack is overflowed and the system crashes. By setting __warned first before calling WARN_ON() makes the original WARN_ON_ONCE() really only warn once, and not an infinite amount of times if the WARN_ON() also triggers the warning. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17timer: convert timer_slack_ns from unsigned long to u64John Stultz
This patchset introduces a /proc/<pid>/timerslack_ns interface which would allow controlling processes to be able to set the timerslack value on other processes in order to save power by avoiding wakeups (Something Android currently does via out-of-tree patches). The first patch tries to fix the internal timer_slack_ns usage which was defined as a long, which limits the slack range to ~4 seconds on 32bit systems. It converts it to a u64, which provides the same basically unlimited slack (500 years) on both 32bit and 64bit machines. The second patch introduces the /proc/<pid>/timerslack_ns interface which allows the full 64bit slack range for a task to be read or set on both 32bit and 64bit machines. With these two patches, on a 32bit machine, after setting the slack on bash to 10 seconds: $ time sleep 1 real 0m10.747s user 0m0.001s sys 0m0.005s The first patch is a little ugly, since I had to chase the slack delta arguments through a number of functions converting them to u64s. Let me know if it makes sense to break that up more or not. Other than that things are fairly straightforward. This patch (of 2): The timer_slack_ns value in the task struct is currently a unsigned long. This means that on 32bit applications, the maximum slack is just over 4 seconds. However, on 64bit machines, its much much larger (~500 years). This disparity could make application development a little (as well as the default_slack) to a u64. This means both 32bit and 64bit systems have the same effective internal slack range. Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify the interface as a unsigned long, so we preserve that limitation on 32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is actually larger then what can be stored by an unsigned long. This patch also modifies hrtimer functions which specified the slack delta as a unsigned long. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oren Laadan <orenl@cellrox.com> Cc: Ruchi Kandoi <kandoiruchi@google.com> Cc: Rom Lemarchand <romlem@android.com> Cc: Kees Cook <keescook@chromium.org> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkersKirill A. Shutemov
freeze_page() and unfreeze_page() helpers evolved in rather complex beasts. It would be nice to cut complexity of this code. This patch rewrites freeze_page() using standard try_to_unmap(). unfreeze_page() is rewritten with remove_migration_ptes(). The result is much simpler. But the new variant is somewhat slower for PTE-mapped THPs. Current helpers iterates over VMAs the compound page is mapped to, and then over ptes within this VMA. New helpers iterates over small page, then over VMA the small page mapped to, and only then find relevant pte. We have short cut for PMD-mapped THP: we directly install migration entries on PMD split. I don't think the slowdown is critical, considering how much simpler result is and that split_huge_page() is quite rare nowadays. It only happens due memory pressure or migration. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: make remove_migration_ptes() beyond mm/migration.cKirill A. Shutemov
Make remove_migration_ptes() available to be used in split_huge_page(). New parameter 'locked' added: as with try_to_umap() we need a way to indicate that caller holds rmap lock. We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP pages are never mlocked. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17rmap: extend try_to_unmap() to be usable by split_huge_page()Kirill A. Shutemov
Add support for two ttu_flags: - TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to unmap page; - TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock; Also, change rwc->done to !page_mapcount() instead of !page_mapped(). try_to_unmap() works on pte level, so we are really interested in the mappedness of this small page rather than of the compound page it's a part of. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17rmap: introduce rmap_walk_locked()Kirill A. Shutemov
This patchset rewrites freeze_page() and unfreeze_page() using try_to_unmap() and remove_migration_ptes(). Result is much simpler, but somewhat slower. Migration 8GiB worth of PMD-mapped THP: Baseline 20.21 +/- 0.393 Patched 20.73 +/- 0.082 Slowdown 1.03x It's 3% slower, comparing to 14% in v1. I don't it should be a stopper. Splitting of PTE-mapped pages slowed more. But this is not a common case. Migration 8GiB worth of PMD-mapped THP: Baseline 20.39 +/- 0.225 Patched 22.43 +/- 0.496 Slowdown 1.10x rmap_walk_locked() is the same as rmap_walk(), but the caller takes care of the relevant rmap lock. This is preparation for switching THP splitting from custom rmap walk in freeze_page()/unfreeze_page() to the generic one. There is no support for KSM pages for now: not clear which lock is implied. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: remove VM_FAULT_MINORJan Kara
The define has a comment from Nick Piggin from 2007: /* For backwards compat. Remove me quickly. */ I guess 9 years should not be too hurried sense of 'quickly' even for kernel measures. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: exclude ZONE_DEVICE from GFP_ZONE_TABLEDan Williams
ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new mm zones that are bumping up against the current maximum limit of 4 zones, i.e. 2 bits in page->flags for the GFP_ZONE_TABLE. The GFP_ZONE_TABLE poses an interesting constraint since include/linux/gfp.h gets included by the 32-bit portion of a 64-bit build. We need to be careful to only build the table for zones that have a corresponding gfp_t flag. GFP_ZONES_SHIFT is introduced for this purpose. This patch does not attempt to solve the problem of adding a new zone that also has a corresponding GFP_ flag. Vlastimil points out that ZONE_DEVICE, by depending on x86_64 and SPARSEMEM_VMEMMAP implies that SECTIONS_WIDTH is zero. In other words even though ZONE_DEVICE does not fit in GFP_ZONE_TABLE it is free to consume another bit in page->flags (expand ZONES_WIDTH) with room to spare. Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931 Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Mark <markk@clara.co.uk> Reported-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm/page_ref: add tracepoint to track down page reference manipulationJoonsoo Kim
CMA allocation should be guaranteed to succeed by definition, but, unfortunately, it would be failed sometimes. It is hard to track down the problem, because it is related to page reference manipulation and we don't have any facility to analyze it. This patch adds tracepoints to track down page reference manipulation. With it, we can find exact reason of failure and can fix the problem. Following is an example of tracepoint output. (note: this example is stale version that printing flags as the number. Recent version will print it as human readable string.) <...>-9018 [004] 92.678375: page_ref_set: pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1 <...>-9018 [004] 92.678378: kernel_stack: => get_page_from_freelist (ffffffff81176659) => __alloc_pages_nodemask (ffffffff81176d22) => alloc_pages_vma (ffffffff811bf675) => handle_mm_fault (ffffffff8119e693) => __do_page_fault (ffffffff810631ea) => trace_do_page_fault (ffffffff81063543) => do_async_page_fault (ffffffff8105c40a) => async_page_fault (ffffffff817581d8) [snip] <...>-9018 [004] 92.678379: page_ref_mod: pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1 [snip] ... ... <...>-9131 [001] 93.174468: test_pages_isolated: start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail [snip] <...>-9018 [004] 93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1 => release_pages (ffffffff8117c9e4) => free_pages_and_swap_cache (ffffffff811b0697) => tlb_flush_mmu_free (ffffffff81199616) => tlb_finish_mmu (ffffffff8119a62c) => exit_mmap (ffffffff811a53f7) => mmput (ffffffff81073f47) => do_exit (ffffffff810794e9) => do_group_exit (ffffffff81079def) => SyS_exit_group (ffffffff81079e74) => entry_SYSCALL_64_fastpath (ffffffff817560b6) This output shows that problem comes from exit path. In exit path, to improve performance, pages are not freed immediately. They are gathered and processed by batch. During this process, migration cannot be possible and CMA allocation is failed. This problem is hard to find without this page reference tracepoint facility. Enabling this feature bloat kernel text 30 KB in my configuration. text data bss dec hex filename 12127327 2243616 1507328 15878271 f2487f vmlinux_disabled 12157208 2258880 1507328 15923416 f2f8d8 vmlinux_enabled Note that, due to header file dependency problem between mm.h and tracepoint.h, this feature has to open code the static key functions for tracepoints. Proposed by Steven Rostedt in following link. https://lkml.org/lkml/2015/12/9/699 [arnd@arndb.de: crypto/async_pq: use __free_page() instead of put_page()] [iamjoonsoo.kim@lge.com: fix build failure for xtensa] [akpm@linux-foundation.org: tweak Kconfig text, per Vlastimil] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: introduce page reference manipulation functionsJoonsoo Kim
The success of CMA allocation largely depends on the success of migration and key factor of it is page reference count. Until now, page reference is manipulated by direct calling atomic functions so we cannot follow up who and where manipulate it. Then, it is hard to find actual reason of CMA allocation failure. CMA allocation should be guaranteed to succeed so finding offending place is really important. In this patch, call sites where page reference is manipulated are converted to introduced wrapper function. This is preparation step to add tracepoint to each page reference manipulation function. With this facility, we can easily find reason of CMA allocation failure. There is no functional change in this patch. In addition, this patch also converts reference read sites. It will help a second step that renames page._count to something else and prevents later attempt to direct access to it (Suggested by Andrew). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: thp: set THP defrag by default to madvise and add a stall-free defrag optionMel Gorman
THP defrag is enabled by default to direct reclaim/compact but not wake kswapd in the event of a THP allocation failure. The problem is that THP allocation requests potentially enter reclaim/compaction. This potentially incurs a severe stall that is not guaranteed to be offset by reduced TLB misses. While there has been considerable effort to reduce the impact of reclaim/compaction, it is still a high cost and workloads that should fit in memory fail to do so. Specifically, a simple anon/file streaming workload will enter direct reclaim on NUMA at least even though the working set size is 80% of RAM. It's been years and it's time to throw in the towel. First, this patch defines THP defrag as follows; madvise: A failed allocation will direct reclaim/compact if the application requests it never: Neither reclaim/compact nor wake kswapd defer: A failed allocation will wake kswapd/kcompactd always: A failed allocation will direct reclaim/compact (historical behaviour) khugepaged defrag will enter direct/reclaim but not wake kswapd. Next it sets the default defrag option to be "madvise" to only enter direct reclaim/compaction for applications that specifically requested it. Lastly, it removes a check from the page allocator slowpath that is related to __GFP_THISNODE to allow "defer" to work. The callers that really cares are slub/slab and they are updated accordingly. The slab one may be surprising because it also corrects a comment as kswapd was never woken up by that path. This means that a THP fault will no longer stall for most applications by default and the ideal for most users that get THP if they are immediately available. There are still options for users that prefer a stall at startup of a new application by either restoring historical behaviour with "always" or pick a half-way point with "defer" where kswapd does some of the work in the background and wakes kcompactd if necessary. THP defrag for khugepaged remains enabled and will enter direct/reclaim but no wakeup kswapd or kcompactd. After this patch a THP allocation failure will quickly fallback and rely on khugepaged to recover the situation at some time in the future. In some cases, this will reduce THP usage but the benefit of THP is hard to measure and not a universal win where as a stall to reclaim/compaction is definitely measurable and can be painful. The first test for this is using "usemem" to read a large file and write a large anonymous mapping (to avoid the zero page) multiple times. The total size of the mappings is 80% of RAM and the benchmark simply measures how long it takes to complete. It uses multiple threads to see if that is a factor. On UMA, the performance is almost identical so is not reported but on NUMA, we see this usemem 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%) Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%) Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%) Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%) Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%) Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%) Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%) Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%) Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%) Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%) Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%) Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%) Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%) Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%) For a single thread, the benchmark completes 43.23% faster with this patch applied with smaller benefits as the thread increases. Similar, notice the large reduction in most cases in system CPU usage. The overall CPU time is 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 User 10357.65 10438.33 System 3988.88 3543.94 Elapsed 2203.01 1634.41 Which is substantial. Now, the reclaim figures 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 Minor Faults 128458477 278352931 Major Faults 2174976 225 Swap Ins 16904701 0 Swap Outs 17359627 0 Allocation stalls 43611 0 DMA allocs 0 0 DMA32 allocs 19832646 19448017 Normal allocs 614488453 580941839 Movable allocs 0 0 Direct pages scanned 24163800 0 Kswapd pages scanned 0 0 Kswapd pages reclaimed 0 0 Direct pages reclaimed 20691346 0 Compaction stalls 42263 0 Compaction success 938 0 Compaction failures 41325 0 This patch eliminates almost all swapping and direct reclaim activity. There is still overhead but it's from NUMA balancing which does not identify that it's pointless trying to do anything with this workload. I also tried the thpscale benchmark which forces a corner case where compaction can be used heavily and measures the latency of whether base or huge pages were used thpscale Fault Latencies 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%) Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%) Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%) Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%) Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%) Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%) Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%) Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%) Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%) Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%) Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%) Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%) Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%) Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%) Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%) Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%) Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%) Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%) The average time to fault pages is substantially reduced in the majority of caseds but with the obvious caveat that fewer THPs are actually used in this adverse workload 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%) Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%) Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%) Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%) Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%) Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%) Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%) Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%) Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%) 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 Minor Faults 37429143 47564000 Major Faults 1916 1558 Swap Ins 1466 1079 Swap Outs 2936863 149626 Allocation stalls 62510 3 DMA allocs 0 0 DMA32 allocs 6566458 6401314 Normal allocs 216361697 216538171 Movable allocs 0 0 Direct pages scanned 25977580 17998 Kswapd pages scanned 0 3638931 Kswapd pages reclaimed 0 207236 Direct pages reclaimed 8833714 88 Compaction stalls 103349 5 Compaction success 270 4 Compaction failures 103079 1 Note again that while this does swap as it's an aggressive workload, the direct relcim activity and allocation stalls is substantially reduced. There is some kswapd activity but ftrace showed that the kswapd activity was due to normal wakeups from 4K pages being allocated. Compaction-related stalls and activity are almost eliminated. I also tried the stutter benchmark. For this, I do not have figures for NUMA but it's something that does impact UMA so I'll report what is available stutter 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%) 1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%) 2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%) 3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%) Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%) Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%) Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%) Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%) Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%) Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%) This benchmark is trying to fault an anonymous mapping while there is a heavy IO load -- a scenario that desktop users used to complain about frequently. This shows a mix because the ideal case of mapping with THP is not hit as often. However, note that 99% of the mappings complete 13.79% faster. The CPU usage here is particularly interesting 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 User 67.50 0.99 System 1327.88 91.30 Elapsed 2079.00 2128.98 And once again we look at the reclaim figures 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 Minor Faults 335241922 1314582827 Major Faults 715 819 Swap Ins 0 0 Swap Outs 0 0 Allocation stalls 532723 0 DMA allocs 0 0 DMA32 allocs 1822364341 1177950222 Normal allocs 1815640808 1517844854 Movable allocs 0 0 Direct pages scanned 21892772 0 Kswapd pages scanned 20015890 41879484 Kswapd pages reclaimed 19961986 41822072 Direct pages reclaimed 21892741 0 Compaction stalls 1065755 0 Compaction success 514 0 Compaction failures 1065241 0 Allocation stalls and all direct reclaim activity is eliminated as well as compaction-related stalls. THP gives impressive gains in some cases but only if they are quickly available. We're not going to reach the point where they are completely free so lets take the costs out of the fast paths finally and defer the cost to kswapd, kcompactd and khugepaged where it belongs. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: remove unnecessary description about a non-exist gfp flagSatoru Takeuchi
Since __GFP_NOACCOUNT was removed by commit 20b5c3039863 ("Revert 'gfp: add __GFP_NOACCOUNT'"), its description is not necessary. Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: scale kswapd watermarks in proportion to memoryJohannes Weiner
In machines with 140G of memory and enterprise flash storage, we have seen read and write bursts routinely exceed the kswapd watermarks and cause thundering herds in direct reclaim. Unfortunately, the only way to tune kswapd aggressiveness is through adjusting min_free_kbytes - the system's emergency reserves - which is entirely unrelated to the system's latency requirements. In order to get kswapd to maintain a 250M buffer of free memory, the emergency reserves need to be set to 1G. That is a lot of memory wasted for no good reason. On the other hand, it's reasonable to assume that allocation bursts and overall allocation concurrency scale with memory capacity, so it makes sense to make kswapd aggressiveness a function of that as well. Change the kswapd watermark scale factor from the currently fixed 25% of the tunable emergency reserve to a tunable 0.1% of memory. Beyond 1G of memory, this will produce bigger watermark steps than the current formula in default settings. Ensure that the new formula never chooses steps smaller than that, i.e. 25% of the emergency reserve. On a 140G machine, this raises the default watermark steps - the distance between min and low, and low and high - from 16M to 143M. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: cleanup *pte_alloc* interfacesKirill A. Shutemov
There are few things about *pte_alloc*() helpers worth cleaning up: - 'vma' argument is unused, let's drop it; - most __pte_alloc() callers do speculative check for pmd_none(), before taking ptl: let's introduce pte_alloc() macro which does the check. The only direct user of __pte_alloc left is userfaultfd, which has different expectation about atomicity wrt pmd. - pte_alloc_map() and pte_alloc_map_lock() are redefined using pte_alloc(). [sudeep.holla@arm.com: fix build for arm64 hugetlbpage] [sfr@canb.auug.org.au: fix arch/arm/mm/mmu.c some more] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17virtio_balloon: export 'available' memory to balloon statisticsIgor Redko
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory statistics protocol, corresponding to 'Available' in /proc/meminfo. It indicates to the hypervisor how big the balloon can be inflated without pushing the guest system to swap. Signed-off-by: Igor Redko <redkoi@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm/page_alloc.c: calculate 'available' memory in a separate functionIgor Redko
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory statistics protocol, corresponding to 'Available' in /proc/meminfo. It indicates to the hypervisor how big the balloon can be inflated without pushing the guest system to swap. This metric would be very useful in VM orchestration software to improve memory management of different VMs under overcommit. This patch (of 2): Factor out calculation of the available memory counter into a separate exportable function, in order to be able to use it in other parts of the kernel. In particular, it appears a relevant metric to report to the hypervisor via virtio-balloon statistics interface (in a followup patch). Signed-off-by: Igor Redko <redkoi@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>