summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2013-09-11mm/writeback: make writeback_inodes_wb staticWanpeng Li
It's not used globally and could be static. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: vmscan: fix do_try_to_free_pages() livelockLisa Du
This patch is based on KOSAKI's work and I add a little more description, please refer https://lkml.org/lkml/2012/6/14/74. Currently, I found system can enter a state that there are lots of free pages in a zone but only order-0 and order-1 pages which means the zone is heavily fragmented, then high order allocation could make direct reclaim path's long stall(ex, 60 seconds) especially in no swap and no compaciton enviroment. This problem happened on v3.4, but it seems issue still lives in current tree, the reason is do_try_to_free_pages enter live lock: kswapd will go to sleep if the zones have been fully scanned and are still not balanced. As kswapd thinks there's little point trying all over again to avoid infinite loop. Instead it changes order from high-order to 0-order because kswapd think order-0 is the most important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep and may leave zone->all_unreclaimable =3D 0. It assume high-order users can still perform direct reclaim if they wish. Direct reclaim continue to reclaim for a high order which is not a COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble= . This is because to avoid too early oom-kill. So it means direct_reclaim depends on kswapd to break this loop. In worst case, direct-reclaim may continue to page reclaim forever when kswapd sleeps forever until someone like watchdog detect and finally kill the process. As described in: http://thread.gmane.org/gmane.linux.kernel.mm/103737 We can't turn on zone->all_unreclaimable from direct reclaim path because direct reclaim path don't take any lock and this way is racy. Thus this patch removes zone->all_unreclaimable field completely and recalculates zone reclaimable state every time. Note: we can't take the idea that direct-reclaim see zone->pages_scanned directly and kswapd continue to use zone->all_unreclaimable. Because, it is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use zone->all_unreclaimable as a name) describes the detail. [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()] Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Nick Piggin <npiggin@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Neil Zhang <zhangwm@marvell.com> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Lisa Du <cldu@marvell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: munlock: manual pte walk in fast path instead of follow_page_mask()Vlastimil Babka
Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each individual struct page. This entails repeated full page table translations and page table lock taken for each page separately. This patch avoids the costly follow_page_mask() where possible, by iterating over ptes within single pmd under single page table lock. The first pte is obtained by get_locked_pte() for non-THP page acquired by the initial follow_page_mask(). The rest of the on-stack pagevec for munlock is filled up using pte_walk as long as pte_present() and vm_normal_page() are sufficient to obtain the struct page. After this patch, a 14% speedup was measured for munlocking a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jörn Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: track vma changes with VM_SOFTDIRTY bitCyrill Gorcunov
Pavel reported that in case if vma area get unmapped and then mapped (or expanded) in-place, the soft dirty tracker won't be able to recognize this situation since it works on pte level and ptes are get zapped on unmap, loosing soft dirty bit of course. So to resolve this situation we need to track actions on vma level, there VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded) we set this bit, and keep it here until application calls for clearing soft dirty bit. Thus when user space application track memory changes now it can detect if vma area is renewed. Reported-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11memblock, numa: binary search node idYinghai Lu
Current early_pfn_to_nid() on arch that support memblock go over memblock.memory one by one, so will take too many try near the end. We can use existing memblock_search to find the node id for given pfn, that could save some time on bigger system that have many entries memblock.memory array. Here are the timing differences for several machines. In each case with the patch less time was spent in __early_pfn_to_nid(). 3.11-rc5 with patch difference (%) -------- ---------- -------------- UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%) UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%) UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%) UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%) Time in seconds. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Acked-by: Russ Anderson <rja@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: migrate: check movability of hugepage in unmap_and_move_huge_page()Naoya Horiguchi
Currently hugepage migration works well only for pmd-based hugepages (mainly due to lack of testing,) so we had better not enable migration of other levels of hugepages until we are ready for it. Some users of hugepage migration (mbind, move_pages, and migrate_pages) do page table walk and check pud/pmd_huge() there, so they are safe. But the other users (softoffline and memory hotremove) don't do this, so without this patch they can try to migrate unexpected types of hugepages. To prevent this, we introduce hugepage_migration_support() as an architecture dependent check of whether hugepage are implemented on a pmd basis or not. And on some architecture multiple sizes of hugepages are available, so hugepage_migration_support() also checks hugepage size. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: memory-hotplug: enable memory hotplug to handle hugepageNaoya Horiguchi
Until now we can't offline memory blocks which contain hugepages because a hugepage is considered as an unmovable page. But now with this patch series, a hugepage has become movable, so by using hugepage migration we can offline such memory blocks. What's different from other users of hugepage migration is that we need to decompose all the hugepages inside the target memory block into free buddy pages after hugepage migration, because otherwise free hugepages remaining in the memory block intervene the memory offlining. For this reason we introduce new functions dissolve_free_huge_page() and dissolve_free_huge_pages(). Other than that, what this patch does is straightforwardly to add hugepage migration code, that is, adding hugepage code to the functions which scan over pfn and collect hugepages to be migrated, and adding a hugepage allocation function to alloc_migrate_target(). As for larger hugepages (1GB for x86_64), it's not easy to do hotremove over them because it's larger than memory block. So we now simply leave it to fail as it is. [yongjun_wei@trendmicro.com.cn: remove duplicated include] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: migrate: remove VM_HUGETLB from vma flag check in vma_migratable()Naoya Horiguchi
Enable hugepage migration from migrate_pages(2), move_pages(2), and mbind(2). Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: mbind: add hugepage migration code to mbind()Naoya Horiguchi
Extend do_mbind() to handle vma with VM_HUGETLB set. We will be able to migrate hugepage with mbind(2) after applying the enablement patch which comes later in this series. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: soft-offline: use migrate_pages() instead of migrate_huge_page()Naoya Horiguchi
Currently migrate_huge_page() takes a pointer to a hugepage to be migrated as an argument, instead of taking a pointer to the list of hugepages to be migrated. This behavior was introduced in commit 189ebff28 ("hugetlb: simplify migrate_huge_page()"), and was OK because until now hugepage migration is enabled only for soft-offlining which migrates only one hugepage in a single call. But the situation will change in the later patches in this series which enable other users of page migration to support hugepage migration. They can kick migration for both of normal pages and hugepages in a single call, so we need to go back to original implementation which uses linked lists to collect the hugepages to be migrated. With this patch, soft_offline_huge_page() switches to use migrate_pages(), and migrate_huge_page() is not used any more. So let's remove it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: migrate: make core migration code aware of hugepageNaoya Horiguchi
Currently hugepage migration is available only for soft offlining, but it's also useful for some other users of page migration (clearly because users of hugepage can enjoy the benefit of mempolicy and memory hotplug.) So this patchset tries to extend such users to support hugepage migration. The target of this patchset is to enable hugepage migration for NUMA related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and memory hotplug. This patchset does not add hugepage migration for memory compaction, because users of memory compaction mainly expect to construct thp by arranging raw pages, and there's little or no need to compact hugepages. CMA, another user of page migration, can have benefit from hugepage migration, but is not enabled to support it for now (just because of lack of testing and expertise in CMA.) Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in x86_64, or hugepages in architectures like ia64) is not enabled for now (again, because of lack of testing.) As for how these are achived, I extended the API (migrate_pages()) to handle hugepage (with patch 1 and 2) and adjusted code of each caller to check and collect movable hugepages (with patch 3-7). Remaining 2 patches are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is about making sure that we only migrate pmd-based hugepages. And patch 9 is about choosing appropriate zone for hugepage allocation. My test is mainly functional one, simply kicking hugepage migration via each entry point and confirm that migration is done correctly. Test code is available here: git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git And I always run libhugetlbfs test when changing hugetlbfs's code. With this patchset, no regression was found in the test. This patch (of 9): Before enabling each user of page migration to support hugepage, this patch enables the list of pages for migration to link not only LRU pages, but also hugepages. As a result, putback_movable_pages() and migrate_pages() can handle both of LRU pages and hugepages. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11lib/genalloc.c: fix overflow of ending address of memory chunkJoonyoung Shim
In struct gen_pool_chunk, end_addr means the end address of memory chunk (inclusive), but in the implementation it is treated as address + size of memory chunk (exclusive), so it points to the address plus one instead of correct ending address. The ending address of memory chunk plus one will cause overflow on the memory chunk including the last address of memory map, e.g. when starting address is 0xFFF00000 and size is 0x100000 on 32bit machine, ending address will be 0x100000000. Use correct ending address like starting address + size - 1. [akpm@linux-foundation.org: add comment to struct gen_pool_chunk:end_addr] Signed-off-by: Joonyoung Shim <jy0922.shim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11vmstat: create separate function to fold per cpu diffs into local countersChristoph Lameter
The main idea behind this patchset is to reduce the vmstat update overhead by avoiding interrupt enable/disable and the use of per cpu atomics. This patch (of 3): It is better to have a separate folding function because refresh_cpu_vm_stats() also does other things like expire pages in the page allocator caches. If we have a separate function then refresh_cpu_vm_stats() is only called from the local cpu which allows additional optimizations. The folding function is only called when a cpu is being downed and therefore no other processor will be accessing the counters. Also simplifies synchronization. [akpm@linux-foundation.org: fix UP build] Signed-off-by: Christoph Lameter <cl@linux.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11swap: clean-up #ifdef in page_mapping()Joonsoo Kim
PageSwapCache() is always false when !CONFIG_SWAP, so compiler properly discard related code. Therefore, we don't need #ifdef explicitly. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: page_alloc: fair zone allocator policyJohannes Weiner
Each zone that holds userspace pages of one workload must be aged at a speed proportional to the zone size. Otherwise, the time an individual page gets to stay in memory depends on the zone it happened to be allocated in. Asymmetry in the zone aging creates rather unpredictable aging behavior and results in the wrong pages being reclaimed, activated etc. But exactly this happens right now because of the way the page allocator and kswapd interact. The page allocator uses per-node lists of all zones in the system, ordered by preference, when allocating a new page. When the first iteration does not yield any results, kswapd is woken up and the allocator retries. Due to the way kswapd reclaims zones below the high watermark while a zone can be allocated from when it is above the low watermark, the allocator may keep kswapd running while kswapd reclaim ensures that the page allocator can keep allocating from the first zone in the zonelist for extended periods of time. Meanwhile the other zones rarely see new allocations and thus get aged much slower in comparison. The result is that the occasional page placed in lower zones gets relatively more time in memory, even gets promoted to the active list after its peers have long been evicted. Meanwhile, the bulk of the working set may be thrashing on the preferred zone even though there may be significant amounts of memory available in the lower zones. Even the most basic test -- repeatedly reading a file slightly bigger than memory -- shows how broken the zone aging is. In this scenario, no single page should be able stay in memory long enough to get referenced twice and activated, but activation happens in spades: $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 0 nr_active_file 8 nr_inactive_file 1582 nr_active_file 11994 $ cat data data data data >/dev/null $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 70 nr_inactive_file 258753 nr_active_file 443214 nr_inactive_file 149793 nr_active_file 12021 Fix this with a very simple round robin allocator. Each zone is allowed a batch of allocations that is proportional to the zone's size, after which it is treated as full. The batch counters are reset when all zones have been tried and the allocator enters the slowpath and kicks off kswapd reclaim. Allocation and reclaim is now fairly spread out to all available/allowable zones: $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 174 nr_active_file 4865 nr_inactive_file 53 nr_active_file 860 $ cat data data data data >/dev/null $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 666622 nr_active_file 4988 nr_inactive_file 190969 nr_active_file 937 When zone_reclaim_mode is enabled, allocations will now spread out to all zones on the local node, not just the first preferred zone (which on a 4G node might be a tiny Normal zone). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paul Bolle <paul.bollee@gmail.com> Cc: Zlatko Calusic <zcalusic@bitsync.net> Tested-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm/page_alloc.c: fix the value of fallback_migratetype in alloc_extfrag ↵Srivatsa S. Bhat
tracepoint() In the current code, the value of fallback_migratetype that is printed using the mm_page_alloc_extfrag tracepoint, is the value of the migratetype *after* it has been set to the preferred migratetype (if the ownership was changed). Obviously that wouldn't have been the original intent. (We already have a separate 'change_ownership' field to tell whether the ownership of the pageblock was changed from the fallback_migratetype to the preferred type.) The intent of the fallback_migratetype field is to show the migratetype from which we borrowed pages in order to satisfy the allocation request. So fix the code to print that value correctly. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan@kernel.org> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11swap: make cluster allocation per-cpuShaohua Li
swap cluster allocation is to get better request merge to improve performance. But the cluster is shared globally, if multiple tasks are doing swap, this will cause interleave disk access. While multiple tasks swap is quite common, for example, each numa node has a kswapd thread doing swap and multiple threads/processes doing direct page reclaim. ioscheduler can't help too much here, because tasks don't send swapout IO down to block layer in the meantime. Block layer does merge some IOs, but a lot not, depending on how many tasks are doing swapout concurrently. In practice, I've seen a lot of small size IO in swapout workloads. We makes the cluster allocation per-cpu here. The interleave disk access issue goes away. All tasks swapout to their own cluster, so swapout will become sequential, which can be easily merged to big size IO. If one CPU can't get its per-cpu cluster (for example, there is no free cluster anymore in the swap), it will fallback to scan swap_map. The CPU can still continue swap. We don't need recycle free swap entries of other CPUs. In my test (swap to a 2-disk raid0 partition), this improves around 10% swapout throughput, and request size is increased significantly. How does this impact swap readahead is uncertain though. On one side, page reclaim always isolates and swaps several adjancent pages, this will make page reclaim write the pages sequentially and benefit readahead. On the other side, several CPU write pages interleave means the pages don't live _sequentially_ but relatively _near_. In the per-cpu allocation case, if adjancent pages are written by different cpus, they will live relatively _far_. So how this impacts swap readahead depends on how many pages page reclaim isolates and swaps one time. If the number is big, this patch will benefit swap readahead. Of course, this is about sequential access pattern. The patch has no impact for random access pattern, because the new cluster allocation algorithm is just for SSD. Alternative solution is organizing swap layout to be per-mm instead of this per-cpu approach. In the per-mm layout, we allocate a disk range for each mm, so pages of one mm live in swap disk adjacently. per-mm layout has potential issues of lock contention if multiple reclaimers are swap pages from one mm. For a sequential workload, per-mm layout is better to implement swap readahead, because pages from the mm are adjacent in disk. But per-cpu layout isn't very bad in this workload, as page reclaim always isolates and swaps several pages one time, such pages will still live in disk sequentially and readahead can utilize this. For a random workload, per-mm layout isn't beneficial of request merge, because it's quite possible pages from different mm are swapout in the meantime and IO can't be merged in per-mm layout. while with per-cpu layout we can merge requests from any mm. Considering random workload is more popular in workloads with swap (and per-cpu approach isn't too bad for sequential workload too), I'm choosing per-cpu layout. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11swap: make swap discard asyncShaohua Li
swap can do cluster discard for SSD, which is good, but there are some problems here: 1. swap do the discard just before page reclaim gets a swap entry and writes the disk sectors. This is useless for high end SSD, because an overwrite to a sector implies a discard to original sector too. A discard + overwrite == overwrite. 2. the purpose of doing discard is to improve SSD firmware garbage collection. Idealy we should send discard as early as possible, so firmware can do something smart. Sending discard just after swap entry is freed is considered early compared to sending discard before write. Of course, if workload is already bound to gc speed, sending discard earlier or later doesn't make 3. block discard is a sync API, which will delay scan_swap_map() significantly. 4. Write and discard command can be executed parallel in PCIe SSD. Making swap discard async can make execution more efficiently. This patch makes swap discard async and moves discard to where swap entry is freed. Discard and write have no dependence now, so above issues can be avoided. Idealy we should do discard for any freed sectors, but some SSD discard is very slow. This patch still does discard for a whole cluster. My test does a several round of 'mmap, write, unmap', which will trigger a lot of swap discard. In a fusionio card, with this patch, the test runtime is reduced to 18% of the time without it, so around 5.5x faster. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11swap: change block allocation algorithm for SSDShaohua Li
I'm using a fast SSD to do swap. scan_swap_map() sometimes uses up to 20~30% CPU time (when cluster is hard to find, the CPU time can be up to 80%), which becomes a bottleneck. scan_swap_map() scans a byte array to search a 256 page cluster, which is very slow. Here I introduced a simple algorithm to search cluster. Since we only care about 256 pages cluster, we can just use a counter to track if a cluster is free. Every 256 pages use one int to store the counter. If the counter of a cluster is 0, the cluster is free. All free clusters will be added to a list, so searching cluster is very efficient. With this, scap_swap_map() overhead disappears. This might help low end SD card swap too. Because if the cluster is aligned, SD firmware can do flash erase more efficiently. We only enable the algorithm for SSD. Hard disk swap isn't fast enough and has downside with the algorithm which might introduce regression (see below). The patch slightly changes which cluster is choosen. It always adds free cluster to list tail. This can help wear leveling for low end SSD too. And if no cluster found, the scan_swap_map() will do search from the end of last cluster. So if no cluster found, the scan_swap_map() will do search from the end of last free cluster, which is random. For SSD, this isn't a problem at all. Another downside is the cluster must be aligned to 256 pages, which will reduce the chance to find a cluster. I would expect this isn't a big problem for SSD because of the non-seek penality. (And this is the reason I only enable the algorithm for SSD). Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: vmstats: track TLB flush stats on UP tooDave Hansen
The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled for SMP. UP systems do not do remote TLB flushes, so compile those counters out on UP. arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly. This is probably an optimization since both the mtrr code and __flush_tlb() write cr4. It would probably be safe to make that a flush_tlb_all() (and then get these statistics), but the mtrr code is ancient and I'm hesitant to touch it other than to just stick in the counters. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: vmstats: tlb flush countersDave Hansen
I was investigating some TLB flush scaling issues and realized that we do not have any good methods for figuring out how many TLB flushes we are doing. It would be nice to be able to do these in generic code, but the arch-independent calls don't explicitly specify whether we actually need to do remote flushes or not. In the end, we really need to know if we actually _did_ global vs. local invalidations, so that leaves us with few options other than to muck with the counters from arch-specific code. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: mempolicy: turn vma_set_policy() into vma_dup_policy()Oleg Nesterov
Simple cleanup. Every user of vma_set_policy() does the same work, this looks a bit annoying imho. And the new trivial helper which does mpol_dup() + vma_set_policy() to simplify the callers. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11block: support embedded device command line partitionCai Zhiyong
Read block device partition table from command line. The partition used for fixed block device (eMMC) embedded device. It is no MBR, save storage space. Bootloader can be easily accessed by absolute address of data on the block device. Users can easily change the partition. This code reference MTD partition, source "drivers/mtd/cmdlinepart.c" About the partition verbose reference "Documentation/block/cmdline-partition.txt" [akpm@linux-foundation.org: fix printk text] [yongjun_wei@trendmicro.com.cn: fix error return code in parse_parts()] Signed-off-by: Cai Zhiyong <caizhiyong@huawei.com> Cc: Karel Zak <kzak@redhat.com> Cc: "Wanglin (Albert)" <albert.wanglin@huawei.com> Cc: Marius Groeger <mag@sysgo.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Brian Norris <computersforpeace@gmail.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11include/linux/sched.h: don't use task->pid/tgid in ↵Oleg Nesterov
same_thread_group/has_group_leader_pid task_struct->pid/tgid should go away. 1. Change same_thread_group() to use task->signal for comparison. 2. Change has_group_leader_pid(task) to compare task_pid(task) with signal->leader_pid. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11include/linux/smp.h:on_each_cpu(): switch back to a C functionAndrew Morton
Revert commit c846ef7deba2 ("include/linux/smp.h:on_each_cpu(): switch back to a macro"). It turns out that the problematic linux/irqflags.h include was fixed within ia64 and mn10300. Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11Merge tag 'for-v3.12' of git://git.infradead.org/battery-2.6Linus Torvalds
Pull battery/power supply driver updates from Anton Vorontsov: "New drivers: - APM X-Gene system reboot driver by Feng Kan and Loc Ho (APM). - Qualcomm MSM reboot/poweroff driver by Abhimanyu Kapur (Codeaurora). - Texas Instruments BQ24190 charger driver by Mark A. Greer (Animal Creek Technologies). - Texas Instruments TWL4030 MADC battery driver by Lukas Märdian and Marek Belisko (Golden Delicious Computers). The driver is used on Freerunner GTA04 phones. Highlighted fixes and improvements: - Suspend/wakeup logic improvements: power supply objects will block system suspend until all power supply events are processed. Thanks to Zoran Markovic (Linaro), Arve Hjonnevag and Todd Poynor (Google)" * tag 'for-v3.12' of git://git.infradead.org/battery-2.6: rx51_battery: Fix channel number when reading adc value power: Add twl4030_madc battery driver. bq24190_charger: Workaround SS definition problem on i386 builds power_supply: Prevent suspend until power supply events are processed vexpress-poweroff: Should depend on the required infrastructure twl4030-charger: Fix compiler warning with regulator_enable() rx51_battery: Replace hardcoded channels values. bq24190_charger: Add support for TI BQ24190 Battery Charger ab8500-charger: We print an unintended error message max8925_power: Fix missing of_node_put power_supply: Replace strict_strtol() with kstrtol() power: Add APM X-Gene system reboot driver power_supply: tosa_battery: Get rid of irq_to_gpio usage power supply: collie_battery: Convert to use dev_pm_ops power_supply: Make goldfish_battery depend on GOLDFISH || COMPILE_TEST power: reset: Add msm restart support MAINTAINERS: drivers/power: add entry for SmartReflex AVS drivers
2013-09-11Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds
Pull drm fixes from Dave Airlie: "Daniel had some fixes queued up, that were delayed, the stolen memory ones and vga arbiter ones are quite useful, along with his usual bunch of stuff, nothing for HSW outputs yet. The one nouveau fix is for a regression I caused with the poweroff stuff" * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (30 commits) drm/nouveau: fix oops on runtime suspend/resume drm/i915: Delay disabling of VGA memory until vgacon->fbcon handoff is done drm/i915: try not to lose backlight CBLV precision drm/i915: Confine page flips to BCS on Valleyview drm/i915: Skip stolen region initialisation if none is reserved drm/i915: fix gpu hang vs. flip stall deadlocks drm/i915: Hold an object reference whilst we shrink it drm/i915: fix i9xx_crtc_clock_get for multiplied pixels drm/i915: handle sdvo input pixel multiplier correctly again drm/i915: fix hpd work vs. flush_work in the pageflip code deadlock drm/i915: fix up the relocate_entry refactoring drm/i915: Fix pipe config warnings when dealing with LVDS fixed mode drm/i915: Don't call sg_free_table() if sg_alloc_table() fails i915: Update VGA arbiter support for newer devices vgaarb: Fix VGA decodes changes vgaarb: Don't disable resources that are not owned drm/i915: Pin pages whilst mapping the dma-buf drm/i915: enable trickle feed on Haswell x86: add early quirk for reserving Intel graphics stolen memory v5 drm/i915: split PCI IDs out into i915_drm.h v4 ...
2013-09-11Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "This was a very quiet cycle! Just a few bugfixes and some cleanup" * 'nfsd-next' of git://linux-nfs.org/~bfields/linux: rpc: let xdr layer allocate gssproxy receieve pages rpc: fix huge kmalloc's in gss-proxy rpc: comment on linux_cred encoding, treat all as unsigned rpc: clean up decoding of gssproxy linux creds svcrpc: remove unused rq_resused nfsd4: nfsd4_create_clid_dir prints uninitialized data nfsd4: fix leak of inode reference on delegation failure Revert "nfsd: nfs4_file_get_access: need to be more careful with O_RDWR" sunrpc: prepare NFS for 2038 nfsd4: fix setlease error return nfsd: nfs4_file_get_access: need to be more careful with O_RDWR
2013-09-10Merge tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linuxLinus Torvalds
Pull device tree core updates from Grant Likely: "Generally minor changes. A bunch of bug fixes, particularly for initialization and some refactoring. Most notable change if feeding the entire flattened tree into the random pool at boot. May not be significant, but shouldn't hurt either" Tim Bird questions whether the boot time cost of the random feeding may be noticeable. And "add_device_randomness()" is definitely not some speed deamon of a function. * tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux: of/platform: add error reporting to of_amba_device_create() irq/of: Fix comment typo for irq_of_parse_and_map of: Feed entire flattened device tree into the random pool of/fdt: Clean up casting in unflattening path of/fdt: Remove duplicate memory clearing on FDT unflattening gpio: implement gpio-ranges binding document fix of: call __of_parse_phandle_with_args from of_parse_phandle of: introduce of_parse_phandle_with_fixed_args of: move of_parse_phandle() of: move documentation of of_parse_phandle_with_args of: Fix missing memory initialization on FDT unflattening of: consolidate definition of early_init_dt_alloc_memory_arch() of: Make of_get_phy_mode() return int i.s.o. const int include: dt-binding: input: create a DT header defining key codes. of/platform: Staticize of_platform_device_create_pdata() of: Specify initrd location using 64-bit dt: Typo fix OF: make of_property_for_each_{u32|string}() use parameters if OF is not enabled
2013-09-10Merge branch 'for-linus' of git://git.infradead.org/users/vkoul/slave-dmaLinus Torvalds
Pull slave-dmaengine updates from Vinod Koul: "This pull brings: - Andy's DW driver updates - Guennadi's sh driver updates - Pl08x driver fixes from Tomasz & Alban - Improvements to mmp_pdma by Daniel - TI EDMA fixes by Joel - New drivers: - Hisilicon k3dma driver - Renesas rcar dma driver - New API for publishing slave driver capablities - Various fixes across the subsystem by Andy, Jingoo, Sachin etc..." * 'for-linus' of git://git.infradead.org/users/vkoul/slave-dma: (94 commits) dma: edma: Remove limits on number of slots dma: edma: Leave linked to Null slot instead of DUMMY slot dma: edma: Find missed events and issue them ARM: edma: Add function to manually trigger an EDMA channel dma: edma: Write out and handle MAX_NR_SG at a given time dma: edma: Setup parameters to DMA MAX_NR_SG at a time dmaengine: pl330: use dma_set_max_seg_size to set the sg limit dmaengine: dma_slave_caps: remove sg entries dma: replace devm_request_and_ioremap by devm_ioremap_resource dma: ste_dma40: Fix potential null pointer dereference dma: ste_dma40: Remove duplicate const dma: imx-dma: Remove redundant NULL check dma: dmagengine: fix function names in comments dma: add driver for R-Car HPB-DMAC dma: k3dma: use devm_ioremap_resource() instead of devm_request_and_ioremap() dma: imx-sdma: Staticize sdma_driver_data structures pch_dma: Add MODULE_DEVICE_TABLE dmaengine: PL08x: Add cyclic transfer support dmaengine: PL08x: Fix reading the byte count in cctl dmaengine: PL08x: Add support for different maximum transfer size ...
2013-09-10Merge tag 'mmc-updates-for-3.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc Pull MMC updates from Chris Ball: "MMC highlights for 3.12: Core: - Support Allocation Units 8MB-64MB in SD3.0, previous max was 4MB. - The slot-gpio helper can now handle GPIO debouncing card-detect. - Read supported voltages from DT "voltage-ranges" property. Drivers: - dw_mmc: Add support for ARC architecture, and support exynos5420. - mmc_spi: Support CD/RO GPIOs. - sh_mobile_sdhi: Add compatibility for more Renesas SoCs. - sh_mmcif: Add DT support for DMA channels" * tag 'mmc-updates-for-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (50 commits) Revert "mmc: tmio-mmc: Remove .set_pwr() callback from platform data" mmc: dw_mmc: Add support for ARC mmc: sdhci-s3c: initialize host->quirks2 for using quirks2 mmc: sdhci-s3c: fix the wrong register value, when clock is disabled mmc: esdhc: add support to get voltage from device-tree mmc: sdhci: get voltage from sdhc host mmc: core: parse voltage from device-tree mmc: omap_hsmmc: use the generic config for omap2plus devices mmc: omap_hsmmc: clear status flags before starting a new command mmc: dw_mmc: exynos: Add a new compatible string for exynos5420 mmc: sh_mmcif: revision-specific CLK_CTRL2 handling mmc: sh_mmcif: revision-specific Command Completion Signal handling mmc: sh_mmcif: add support for Device Tree DMA bindings mmc: sh_mmcif: move header include from header into .c mmc: SDHI: add DT compatibility strings for further SoCs mmc: dw_mmc-pci: enable bus-mastering mode mmc: dw_mmc-pci: get resources from a proper BAR mmc: tmio-mmc: Remove .set_pwr() callback from platform data mmc: tmio-mmc: Remove .get_cd() callback from platform data mmc: sh_mobile_sdhi: Remove .set_pwr() callback from platform data ...
2013-09-10Merge tag 'dm-3.12-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device-mapper updates from Mike Snitzer: "Add the ability to collect I/O statistics on user-defined regions of a device-mapper device. This dm-stats code required the reintroduction of a div64_u64_rem() helper, but as a separate method that doesn't slow down div64_u64() -- especially on 32-bit systems. Allow the error target to replace request-based DM devices (e.g. multipath) in addition to bio-based DM devices. Various other small code fixes and improvements to thin-provisioning, DM cache and the DM ioctl interface" * tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm stripe: silence a couple sparse warnings dm: add statistics support dm thin: always return -ENOSPC if no_free_space is set dm ioctl: cleanup error handling in table_load dm ioctl: increase granularity of type_lock when loading table dm ioctl: prevent rename to empty name or uuid dm thin: set pool read-only if breaking_sharing fails block allocation dm thin: prefix pool error messages with pool device name dm: allow error target to replace bio-based and request-based targets math64: New separate div64_u64_rem helper dm space map: optimise sm_ll_dec and sm_ll_inc dm btree: prefetch child nodes when walking tree for a dm_btree_del dm btree: use pop_frame in dm_btree_del to cleanup code dm cache: eliminate holes in cache structure dm cache: fix stacking of geometry limits dm thin: fix stacking of geometry limits dm thin: add data block size limits to Documentation dm cache: add data block size limits to code and Documentation dm cache: document metadata device is exclussive to a cache dm: stop using WQ_NON_REENTRANT
2013-09-10Merge tag 'md/3.12' of git://neil.brown.name/mdLinus Torvalds
Pull md update from Neil Brown: "Headline item is multithreading for RAID5 so that more IO/sec can be supported on fast (SSD) devices. Also TILE-Gx SIMD suppor for RAID6 calculations and an assortment of bug fixes" * tag 'md/3.12' of git://neil.brown.name/md: raid5: only wakeup necessary threads md/raid5: flush out all pending requests before proceeding with reshape. md/raid5: use seqcount to protect access to shape in make_request. raid5: sysfs entry to control worker thread number raid5: offload stripe handle to workqueue raid5: fix stripe release order raid5: make release_stripe lockless md: avoid deadlock when dirty buffers during md_stop. md: Don't test all of mddev->flags at once. md: Fix apparent cut-and-paste error in super_90_validate raid6/test: replace echo -e with printf RAID: add tilegx SIMD implementation of raid6 md: fix safe_mode buglet. md: don't call md_allow_write in get_bitmap_file.
2013-09-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 3 (of many) from Al Viro: "Waiman's conversion of d_path() and bits related to it, kern_path_mountpoint(), several cleanups and fixes (exportfs one is -stable fodder, IMO). There definitely will be more... ;-/" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: split read_seqretry_or_unlock(), convert d_walk() to resulting primitives dcache: Translating dentry into pathname without taking rename_lock autofs4 - fix device ioctl mount lookup introduce kern_path_mountpoint() rename user_path_umountat() to user_path_mountpoint_at() take unlazy_walk() into umount_lookup_last() Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c prune_super(): sb->s_op is never NULL exportfs: don't assume that ->iterate() won't feed us too long entries afs: get rid of redundant ->d_name.len checks
2013-09-10Merge tag 'drm-intel-fixes-2013-09-06' of ↵Dave Airlie
git://people.freedesktop.org/~danvet/drm-intel into drm-fixes - Early stolen mem reservation from Jesse in x86 boot code. Acked by Ingo and hpa. This was ready much earlier but somehow I've thought it'd go in through x86 trees, hence why this is late. Avoids the pci resource code to plant mmiobars in the middle of stolen mem and other ugliness. - vgaarb improvements from Alex Williamson plus the fix from Ville for the vgacon->fbcon smooth transition "feature". - Render pageflips on ivb/hsw to avoid stalls due to the ring switching when only flipping on the blitter (Chris). - Deadlock fixes around our flush_workqueue which crept back in - lockdep isn't clever enough :( - Shrinker recursion fix from Chris - this is the thing that blew the vma patches from Ben I've taken out of 3.12. - Fixup for the relocation refactoring. Also an igt testcase to make sure we don't break this again. - Pile of smaller fixups all over, shortlog has full details. * tag 'drm-intel-fixes-2013-09-06' of git://people.freedesktop.org/~danvet/drm-intel: (29 commits) drm/i915: Delay disabling of VGA memory until vgacon->fbcon handoff is done drm/i915: try not to lose backlight CBLV precision drm/i915: Confine page flips to BCS on Valleyview drm/i915: Skip stolen region initialisation if none is reserved drm/i915: fix gpu hang vs. flip stall deadlocks drm/i915: Hold an object reference whilst we shrink it drm/i915: fix i9xx_crtc_clock_get for multiplied pixels drm/i915: handle sdvo input pixel multiplier correctly again drm/i915: fix hpd work vs. flush_work in the pageflip code deadlock drm/i915: fix up the relocate_entry refactoring drm/i915: Fix pipe config warnings when dealing with LVDS fixed mode drm/i915: Don't call sg_free_table() if sg_alloc_table() fails i915: Update VGA arbiter support for newer devices vgaarb: Fix VGA decodes changes vgaarb: Don't disable resources that are not owned drm/i915: Pin pages whilst mapping the dma-buf drm/i915: enable trickle feed on Haswell x86: add early quirk for reserving Intel graphics stolen memory v5 drm/i915: split PCI IDs out into i915_drm.h v4 i915_gem: Convert kmem_cache_alloc(...GFP_ZERO) to kmem_cache_zalloc ...
2013-09-10Merge tag 'dmaengine-3.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine Pull dmaengine update from Dan Williams: "Collection of random updates to the core and some end-driver fixups for ioatdma and mv_xor: - NUMA aware channel allocation - Cleanup dmatest debugfs interface - ioat: make raid-support Atom only - mv_xor: big endian Aside from the top three commits these have all had some soak time in -next. The top commit fixes a recent build breakage. It has been a long while since my last pull request, hopefully it does not show. Thanks to Vinod for keeping an eye on drivers/dma/ this past year" * tag 'dmaengine-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine: dmaengine: dma_sync_wait and dma_find_channel undefined MAINTAINERS: update email for Dan Williams dma: mv_xor: Fix incorrect error path ioatdma: silence GCC warnings dmaengine: make dma_channel_rebalance() NUMA aware dmaengine: make dma_submit_error() return an error code ioatdma: disable RAID on non-Atom platforms and reenable unaligned copies mv_xor: support big endian systems using descriptor swap feature mv_xor: use {readl, writel}_relaxed instead of __raw_{readl, writel} dmatest: print message on debug level in case of no error dmatest: remove IS_ERR_OR_NULL checks of debugfs calls dmatest: make module parameters writable
2013-09-10dmaengine: dma_sync_wait and dma_find_channel undefinedJon Mason
dma_sync_wait and dma_find_channel are declared regardless of whether CONFIG_DMA_ENGINE is enabled, but calling the function without CONFIG_DMA_ENGINE enabled results "undefined reference" errors. To get around this, declare dma_sync_wait and dma_find_channel as inline functions if CONFIG_DMA_ENGINE is undefined. Signed-off-by: Jon Mason <jon.mason@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2013-09-09Merge tag 'late-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC late changes from Kevin Hilman: "These are changes that arrived a little late before the merge window, or had dependencies on previous branches. Highlights: - ux500: misc. cleanup, fixup I2C devices - exynos: DT updates for RTC; PM updates - at91: DT updates for NAND; new platforms added to generic defconfig - sunxi: DT updates: cubieboard2, pinctrl driver, gated clocks - highbank: LPAE fixes, select necessary ARM errata - omap: PM fixes and improvements; OMAP5 mailbox support - omap: basic support for new DRA7xx SoCs" * tag 'late-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (60 commits) ARM: dts: vexpress: Add CCI node to TC2 device-tree ARM: EXYNOS: Skip C1 cpuidle state for exynos5440 ARM: EXYNOS: always enable PM domains support for EXYNOS4X12 ARM: highbank: clean-up some unused includes ARM: sun7i: Enable the A20 clocks in the DTSI ARM: sun6i: Enable clock support in the DTSI ARM: sun5i: dt: Use the A10s gates in the DTSI ARM: at91: at91_dt_defconfig: enable rm9200 support ARM: dts: add ADC device tree node for exynos5420/5250 ARM: dts: Add RTC DT node to Exynos5420 SoC ARM: dts: Update the "status" property of RTC DT node for Exynos5250 SoC ARM: dts: Fix the RTC DT node name for Exynos5250 irqchip: mmp: avoid to include irqs head file ARM: mmp: avoid to include head file in mach-mmp irqchip: mmp: support irqchip irqchip: move mmp irq driver ARM: OMAP: AM33xx: clock: Add RNG clock data ARM: OMAP: TI81XX: add always-on powerdomain for TI81XX ARM: OMAP4: clock: Lock PLLs in the right sequence ARM: OMAP: AM33XX: hwmod: Add hwmod data for debugSS ...
2013-09-09Merge tag 'drivers-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC driver update from Kevin Hilman: "This contains the ARM SoC related driver updates for v3.12. The only thing this cycle are core PM updates and CPUidle support for ARM's TC2 big.LITTLE development platform" * tag 'drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: cpuidle: big.LITTLE: vexpress-TC2 CPU idle driver ARM: vexpress: tc2: disable GIC CPU IF in tc2_pm_suspend drivers: irq-chip: irq-gic: introduce gic_cpu_if_down()
2013-09-09Merge tag 'clk-for-linus-3.12' of git://git.linaro.org/people/mturquette/linuxLinus Torvalds
Pull clock framework changes from Michael Turquette: "The common clk framework changes for 3.12 are dominated by clock driver patches, both new drivers and fixes to existing. A high percentage of these are for Samsung platforms like Exynos. Core framework fixes and some new features like automagical clock re-parenting round out the patches" * tag 'clk-for-linus-3.12' of git://git.linaro.org/people/mturquette/linux: (102 commits) clk: only call get_parent if there is one clk: samsung: exynos5250: Simplify registration of PLL rate tables clk: samsung: exynos4: Register PLL rate tables for Exynos4x12 clk: samsung: exynos4: Register PLL rate tables for Exynos4210 clk: samsung: exynos4: Reorder registration of mout_vpllsrc clk: samsung: pll: Add support for rate configuration of PLL46xx clk: samsung: pll: Use new registration method for PLL46xx clk: samsung: pll: Add support for rate configuration of PLL45xx clk: samsung: pll: Use new registration method for PLL45xx clk: samsung: exynos4: Rename exynos4_plls to exynos4x12_plls clk: samsung: exynos4: Remove checks for DT node clk: samsung: exynos4: Remove unused static clkdev aliases clk: samsung: Modify _get_rate() helper to use __clk_lookup() clk: samsung: exynos4: Use separate aliases for cpufreq related clocks clocksource: samsung_pwm_timer: Get clock from device tree ARM: dts: exynos4: Specify PWM clocks in PWM node pwm: samsung: Update DT bindings documentation to cover clocks clk: Move symbol export to proper location clk: fix new_parent dereference before null check clk: wm831x: Initialise wm831x pointer on init ...
2013-09-09Merge tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfsLinus Torvalds
Pull xfs updates from Ben Myers: "For 3.12-rc1 there are a number of bugfixes in addition to work to ease usage of shared code between libxfs and the kernel, the rest of the work to enable project and group quotas to be used simultaneously, performance optimisations in the log and the CIL, directory entry file type support, fixes for log space reservations, some spelling/grammar cleanups, and the addition of user namespace support. - introduce readahead to log recovery - add directory entry file type support - fix a number of spelling errors in comments - introduce new Q_XGETQSTATV quotactl for project quotas - add USER_NS support - log space reservation rework - CIL optimisations - kernel/userspace libxfs rework" * tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs: (112 commits) xfs: XFS_MOUNT_QUOTA_ALL needed by userspace xfs: dtype changed xfs_dir2_sfe_put_ino to xfs_dir3_sfe_put_ino Fix wrong flag ASSERT in xfs_attr_shortform_getvalue xfs: finish removing IOP_* macros. xfs: inode log reservations are too small xfs: check correct status variable for xfs_inobt_get_rec() call xfs: inode buffers may not be valid during recovery readahead xfs: check LSN ordering for v5 superblocks during recovery xfs: btree block LSN escaping to disk uninitialised XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568 xfs: fix bad dquot buffer size in log recovery readahead xfs: don't account buffer cancellation during log recovery readahead xfs: check for underflow in xfs_iformat_fork() xfs: xfs_dir3_sfe_put_ino can be static xfs: introduce object readahead to log recovery xfs: Simplify xfs_ail_min() with list_first_entry_or_null() xfs: Register hotcpu notifier after initialization xfs: add xfs sb v4 support for dirent filetype field xfs: Add write support for dirent filetype field xfs: Add read-only support for dirent filetype field ...
2013-09-09Merge tag 'for-linus-20130909' of git://git.infradead.org/linux-mtdLinus Torvalds
Pull mtd updates from David Woodhouse: - factor out common code from MTD tests - nand-gpio cleanup and portability to non-ARM - m25p80 support for 4-byte addressing chips, other new chips - pxa3xx cleanup and support for new platforms - remove obsolete alauda, octagon-5066 drivers - erase/write support for bcm47xxsflash - improve detection of ECC requirements for NAND, controller setup - NFC acceleration support for atmel-nand, read/write via SRAM - etc * tag 'for-linus-20130909' of git://git.infradead.org/linux-mtd: (184 commits) mtd: chips: Add support for PMC SPI Flash chips in m25p80.c mtd: ofpart: use for_each_child_of_node() macro mtd: mtdswap: replace strict_strtoul() with kstrtoul() mtd cs553x_nand: use kzalloc() instead of memset mtd: atmel_nand: fix error return code in atmel_nand_probe() mtd: bcm47xxsflash: writing support mtd: bcm47xxsflash: implement erasing support mtd: bcm47xxsflash: convert to module_platform_driver instead of init/exit mtd: bcm47xxsflash: convert kzalloc to avoid invalid access mtd: remove alauda driver mtd: nand: mxc_nand: mark 'const' properly mtd: maps: cfi_flagadm: add missing __iomem annotation mtd: spear_smi: add missing __iomem annotation mtd: r852: Staticize local symbols mtd: nandsim: Staticize local symbols mtd: impa7: add missing __iomem annotation mtd: sm_ftl: Staticize local symbols mtd: m25p80: add support for mr25h10 mtd: m25p80: make CONFIG_M25PXX_USE_FAST_READ safe to enable mtd: m25p80: Pass flags through CAT25_INFO macro ...
2013-09-09Merge branch 'for-v3.12' of ↵Linus Torvalds
git://git.linaro.org/people/mszyprowski/linux-dma-mapping Pull DMA mapping update from Marek Szyprowski: "This contains an addition of Device Tree support for reserved memory regions (Contiguous Memory Allocator is one of the drivers for it) and changes required by the KVM extensions for PowerPC architectue" * 'for-v3.12' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: ARM: init: add support for reserved memory defined by device tree drivers: of: add initialization code for dma reserved memory drivers: of: add function to scan fdt nodes given by path drivers: dma-contiguous: clean source code and prepare for device tree
2013-09-09Merge tag 'vfio-v3.12-rc0' of git://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO update from Alex Williamson: "VFIO updates include safer default file flags for VFIO device fds, an external user interface exported to allow other modules to hold references to VFIO groups, a fix to test for extended config space on PCIe and PCI-x, and new hot reset interfaces for PCI devices which allows the user to do PCI bus/slot resets when all of the devices affected by the reset are owned by the user. For this last feature, the PCI bus reset interface, I depend on changes already merged from Bjorn's PCI pull request. I therefore merged my tree up to commit cb3e433, which I think was the correct action, but as Stephen Rothwell noted, I failed to provide a commit message indicating why the merge was required. Sorry for that. Thanks, Alex" * tag 'vfio-v3.12-rc0' of git://github.com/awilliam/linux-vfio: vfio: fix documentation vfio-pci: PCI hot reset interface vfio-pci: Test for extended config space vfio-pci: Use fdget() rather than eventfd_fget() vfio: Add O_CLOEXEC flag to vfio device fd vfio: use get_unused_fd_flags(0) instead of get_unused_fd() vfio: add external user support
2013-09-09Merge tag 'nfs-for-3.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: - Fix NFSv4 recovery so that it doesn't recover lost locks in cases such as lease loss due to a network partition, where doing so may result in data corruption. Add a kernel parameter to control choice of legacy behaviour or not. - Performance improvements when 2 processes are writing to the same file. - Flush data to disk when an RPCSEC_GSS session timeout is imminent. - Implement NFSv4.1 SP4_MACH_CRED state protection to prevent other NFS clients from being able to manipulate our lease and file locking state. - Allow sharing of RPCSEC_GSS caches between different rpc clients. - Fix the broken NFSv4 security auto-negotiation between client and server. - Fix rmdir() to wait for outstanding sillyrename unlinks to complete - Add a tracepoint framework for debugging NFSv4 state recovery issues. - Add tracing to the generic NFS layer. - Add tracing for the SUNRPC socket connection state. - Clean up the rpc_pipefs mount/umount event management. - Merge more patches from Chuck in preparation for NFSv4 migration support" * tag 'nfs-for-3.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (107 commits) NFSv4: use mach cred for SECINFO_NO_NAME w/ integrity NFS: nfs_compare_super shouldn't check the auth flavour unless 'sec=' was set NFSv4: Allow security autonegotiation for submounts NFSv4: Disallow security negotiation for lookups when 'sec=' is specified NFSv4: Fix security auto-negotiation NFS: Clean up nfs_parse_security_flavors() NFS: Clean up the auth flavour array mess NFSv4.1 Use MDS auth flavor for data server connection NFS: Don't check lock owner compatability unless file is locked (part 2) NFS: Don't check lock owner compatibility in writes unless file is locked nfs4: Map NFS4ERR_WRONG_CRED to EPERM nfs4.1: Add SP4_MACH_CRED write and commit support nfs4.1: Add SP4_MACH_CRED stateid support nfs4.1: Add SP4_MACH_CRED secinfo support nfs4.1: Add SP4_MACH_CRED cleanup support nfs4.1: Add state protection handler nfs4.1: Minimal SP4_MACH_CRED implementation SUNRPC: Replace pointer values with task->tk_pid and rpc_clnt->cl_clid SUNRPC: Add an identifier for struct rpc_clnt SUNRPC: Ensure rpc_task->tk_pid is available for tracepoints ...
2013-09-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph updates from Sage Weil: "This includes both the first pile of Ceph patches (which I sent to torvalds@vger, sigh) and a few new patches that add support for fscache for Ceph. That includes a few fscache core fixes that David Howells asked go through the Ceph tree. (Thanks go to Milosz Tanski for putting this feature together) This first batch of patches (included here) had (has) several important RBD bug fixes, hole punch support, several different cleanups in the page cache interactions, improvements in the truncate code (new truncate mutex to avoid shenanigans with i_mutex), and a series of fixes in the synchronous striping read/write code. On top of that is a random collection of small fixes all across the tree (error code checks and error path cleanup, obsolete wq flags, etc)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (43 commits) ceph: use d_invalidate() to invalidate aliases ceph: remove ceph_lookup_inode() ceph: trivial buildbot warnings fix ceph: Do not do invalidate if the filesystem is mounted nofsc ceph: page still marked private_2 ceph: ceph_readpage_to_fscache didn't check if marked ceph: clean PgPrivate2 on returning from readpages ceph: use fscache as a local presisent cache fscache: Netfs function for cleanup post readpages FS-Cache: Fix heading in documentation CacheFiles: Implement interface to check cache consistency FS-Cache: Add interface to check consistency of a cached object rbd: fix null dereference in dout rbd: fix buffer size for writes to images with snapshots libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc rbd: fix I/O error propagation for reads ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem ceph: allow sync_read/write return partial successed size of read/write. ceph: fix bugs about handling short-read for sync read mode. ceph: remove useless variable revoked_rdcache ...
2013-09-09introduce kern_path_mountpoint()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-09rename user_path_umountat() to user_path_mountpoint_at()Al Viro
... and move the extern from linux/namei.h to fs/internal.h, along with that of vfs_path_lookup(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-08vfs: reorganize dput() memory accessesLinus Torvalds
This is me being a bit OCD after all the dentry optimization work this merge window: profiles end up showing 'dput()' as a rather expensive operation, and there were two unrelated bad reasons for that. The first reason was reading d_lockref.count for debugging purposes, which touches the lockref cacheline (for reads) before really need to. More importantly, the debugging test in question is _wrong_, and has hidden bugs. It's true that we can only sleep when the count goes down to zero, but the test as-is hides the much more subtle bug that happens if we race with somebody else deleting the file. Anyway we _will_ touch that cacheline, but let's do it for a write and in the right routine (ie in "lockref_put_or_lock()") which annotates the costs better. So remove the misleading debug code. The other was an unnecessary access to the cacheline that contains the d_lru list, just to check whether we already were on the LRU list or not. This is exactly what we have d_flags for, so that we can avoid touching extra cache lines for the common case. So just add another bit for "is this dentry on the LRU". Finally, mark the tests properly likely/unlikely, so that the common fast-paths are dense in the instruction stream. This makes the profiles look much saner. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-08Merge git://git.infradead.org/users/willy/linux-nvmeLinus Torvalds
Pull NVM Express driver update from Matthew Wilcox. * git://git.infradead.org/users/willy/linux-nvme: NVMe: Merge issue on character device bring-up NVMe: Handle ioremap failure NVMe: Add pci suspend/resume driver callbacks NVMe: Use normal shutdown NVMe: Separate controller init from disk discovery NVMe: Separate queue alloc/free from create/delete NVMe: Group pci related actions in functions NVMe: Disk stats for read/write commands only NVMe: Bring up cdev on set feature failure NVMe: Fix checkpatch issues NVMe: Namespace IDs are unsigned NVMe: Update nvme_id_power_state with latest spec NVMe: Split header file into user-visible and kernel-visible pieces NVMe: Call nvme_process_cq from submission path NVMe: Remove "process_cq did something" message NVMe: Return correct value from interrupt handler NVMe: Disk IO statistics NVMe: Restructure MSI / MSI-X setup NVMe: Use kzalloc instead of kmalloc+memset