From f0281a00fe80f0e689dd51e68c3aed5f6ef1bf58 Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Fri, 20 May 2016 16:56:25 -0700 Subject: mm: workingset: only do workingset activations on reads This is a follow-up to http://www.spinics.net/lists/linux-mm/msg101739.html where Andres reported his database workingset being pushed out by the minimum size enforcement of the inactive file list - currently 50% of cache - as well as repeatedly written file pages that are never actually read. Two changes fell out of the discussions. The first change observes that pages that are only ever written don't benefit from caching beyond what the writeback cache does for partial page writes, and so we shouldn't promote them to the active file list where they compete with pages whose cached data is actually accessed repeatedly. This change comes in two patches - one for in-cache write accesses and one for refaults triggered by writes, neither of which should promote a cache page. Second, with the refault detection we don't need to set 50% of the cache aside for used-once cache anymore since we can detect frequently used pages even when they are evicted between accesses. We can allow the active list to be bigger and thus protect a bigger workingset that isn't challenged by streamers. Depending on the access patterns, this can increase major faults during workingset transitions for better performance during stable phases. This patch (of 3): When rewriting a page, the data in that page is replaced with new data. This means that evicting something else from the active file list, in order to cache data that will be replaced by something else, is likely to be a waste of memory. It is better to save the active list for frequently read pages, because reads actually use the data that is in the page. This patch ignores partial writes, because it is unclear whether the complexity of identifying those is worth any potential performance gain obtained from better caching pages that see repeated partial writes at large enough intervals to not get caught by the use-twice promotion code used for the inactive file list. Signed-off-by: Rik van Riel Signed-off-by: Johannes Weiner Reported-by: Andres Freund Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/filemap.c b/mm/filemap.c index 0169033..beba6bd 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -713,8 +713,12 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, * The page might have been evicted from cache only * recently, in which case it should be activated like * any other repeatedly accessed page. + * The exception is pages getting rewritten; evicting other + * data from the working set, only to cache data that will + * get overwritten with something else, is a waste of memory. */ - if (shadow && workingset_refault(shadow)) { + if (!(gfp_mask & __GFP_WRITE) && + shadow && workingset_refault(shadow)) { SetPageActive(page); workingset_activation(page); } else -- cgit v0.10.2 From bbddabe2e436aa7869b3ac5248df5c14ddde0cbf Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Fri, 20 May 2016 16:56:28 -0700 Subject: mm: filemap: only do access activations on reads Andres observed that his database workload is struggling with the transaction journal creating pressure on frequently read pages. Access patterns like transaction journals frequently write the same pages over and over, but in the majority of cases those pages are never read back. There are no caching benefits to be had for those pages, so activating them and having them put pressure on pages that do benefit from caching is a bad choice. Leave page activations to read accesses and don't promote pages based on writes alone. It could be said that partially written pages do contain cache-worthy data, because even if *userspace* does not access the unwritten part, the kernel still has to read it from the filesystem for correctness. However, a counter argument is that these pages enjoy at least *some* protection over other inactive file pages through the writeback cache, in the sense that dirty pages are written back with a delay and cache reclaim leaves them alone until they have been written back to disk. Should that turn out to be insufficient and we see increased read IO from partial writes under memory pressure, we can always go back and update grab_cache_page_write_begin() to take (pos, len) so that it can tell partial writes from pages that don't need partial reads. But for now, keep it simple. Signed-off-by: Johannes Weiner Reported-by: Andres Freund Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/filemap.c b/mm/filemap.c index beba6bd..8f48599 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2578,7 +2578,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index, unsigned flags) { struct page *page; - int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT; + int fgp_flags = FGP_LOCK|FGP_WRITE|FGP_CREAT; if (flags & AOP_FLAG_NOFS) fgp_flags |= FGP_NOFS; -- cgit v0.10.2 From 59dc76b0d4dfdd7dc46a1010e4afb44f60f3e97f Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Fri, 20 May 2016 16:56:31 -0700 Subject: mm: vmscan: reduce size of inactive file list The inactive file list should still be large enough to contain readahead windows and freshly written file data, but it no longer is the only source for detecting multiple accesses to file pages. The workingset refault measurement code causes recently evicted file pages that get accessed again after a shorter interval to be promoted directly to the active list. With that mechanism in place, we can afford to (on a larger system) dedicate more memory to the active file list, so we can actually cache more of the frequently used file pages in memory, and not have them pushed out by streaming writes, once-used streaming file reads, etc. This can help things like database workloads, where only half the page cache can currently be used to cache the database working set. This patch automatically increases that fraction on larger systems, using the same ratio that has already been used for anonymous memory. [hannes@cmpxchg.org: cgroup-awareness] Signed-off-by: Rik van Riel Signed-off-by: Johannes Weiner Reported-by: Andres Freund Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 94da967..a805474 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -415,25 +415,6 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru) return mz->lru_size[lru]; } -static inline bool mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec) -{ - unsigned long inactive_ratio; - unsigned long inactive; - unsigned long active; - unsigned long gb; - - inactive = mem_cgroup_get_lru_size(lruvec, LRU_INACTIVE_ANON); - active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_ANON); - - gb = (inactive + active) >> (30 - PAGE_SHIFT); - if (gb) - inactive_ratio = int_sqrt(10 * gb); - else - inactive_ratio = 1; - - return inactive * inactive_ratio < active; -} - void mem_cgroup_handle_over_high(void); void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, @@ -646,12 +627,6 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg) return true; } -static inline bool -mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec) -{ - return true; -} - static inline unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5c469c1..edbdf56 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6671,49 +6671,6 @@ void setup_per_zone_wmarks(void) } /* - * The inactive anon list should be small enough that the VM never has to - * do too much work, but large enough that each inactive page has a chance - * to be referenced again before it is swapped out. - * - * The inactive_anon ratio is the target ratio of ACTIVE_ANON to - * INACTIVE_ANON pages on this zone's LRU, maintained by the - * pageout code. A zone->inactive_ratio of 3 means 3:1 or 25% of - * the anonymous pages are kept on the inactive list. - * - * total target max - * memory ratio inactive anon - * ------------------------------------- - * 10MB 1 5MB - * 100MB 1 50MB - * 1GB 3 250MB - * 10GB 10 0.9GB - * 100GB 31 3GB - * 1TB 101 10GB - * 10TB 320 32GB - */ -static void __meminit calculate_zone_inactive_ratio(struct zone *zone) -{ - unsigned int gb, ratio; - - /* Zone size in gigabytes */ - gb = zone->managed_pages >> (30 - PAGE_SHIFT); - if (gb) - ratio = int_sqrt(10 * gb); - else - ratio = 1; - - zone->inactive_ratio = ratio; -} - -static void __meminit setup_per_zone_inactive_ratio(void) -{ - struct zone *zone; - - for_each_zone(zone) - calculate_zone_inactive_ratio(zone); -} - -/* * Initialise min_free_kbytes. * * For small machines we want it small (128k min). For large machines @@ -6758,7 +6715,6 @@ int __meminit init_per_zone_wmark_min(void) setup_per_zone_wmarks(); refresh_zone_stat_thresholds(); setup_per_zone_lowmem_reserve(); - setup_per_zone_inactive_ratio(); return 0; } core_initcall(init_per_zone_wmark_min) diff --git a/mm/vmscan.c b/mm/vmscan.c index dcfdfc1..38d6d06 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1862,83 +1862,63 @@ static void shrink_active_list(unsigned long nr_to_scan, free_hot_cold_page_list(&l_hold, true); } -#ifdef CONFIG_SWAP -static bool inactive_anon_is_low_global(struct zone *zone) -{ - unsigned long active, inactive; - - active = zone_page_state(zone, NR_ACTIVE_ANON); - inactive = zone_page_state(zone, NR_INACTIVE_ANON); - - return inactive * zone->inactive_ratio < active; -} - -/** - * inactive_anon_is_low - check if anonymous pages need to be deactivated - * @lruvec: LRU vector to check +/* + * The inactive anon list should be small enough that the VM never has + * to do too much work. * - * Returns true if the zone does not have enough inactive anon pages, - * meaning some active anon pages need to be deactivated. - */ -static bool inactive_anon_is_low(struct lruvec *lruvec) -{ - /* - * If we don't have swap space, anonymous page deactivation - * is pointless. - */ - if (!total_swap_pages) - return false; - - if (!mem_cgroup_disabled()) - return mem_cgroup_inactive_anon_is_low(lruvec); - - return inactive_anon_is_low_global(lruvec_zone(lruvec)); -} -#else -static inline bool inactive_anon_is_low(struct lruvec *lruvec) -{ - return false; -} -#endif - -/** - * inactive_file_is_low - check if file pages need to be deactivated - * @lruvec: LRU vector to check + * The inactive file list should be small enough to leave most memory + * to the established workingset on the scan-resistant active list, + * but large enough to avoid thrashing the aggregate readahead window. * - * When the system is doing streaming IO, memory pressure here - * ensures that active file pages get deactivated, until more - * than half of the file pages are on the inactive list. + * Both inactive lists should also be large enough that each inactive + * page has a chance to be referenced again before it is reclaimed. * - * Once we get to that situation, protect the system's working - * set from being evicted by disabling active file page aging. + * The inactive_ratio is the target ratio of ACTIVE to INACTIVE pages + * on this LRU, maintained by the pageout code. A zone->inactive_ratio + * of 3 means 3:1 or 25% of the pages are kept on the inactive list. * - * This uses a different ratio than the anonymous pages, because - * the page cache uses a use-once replacement algorithm. + * total target max + * memory ratio inactive + * ------------------------------------- + * 10MB 1 5MB + * 100MB 1 50MB + * 1GB 3 250MB + * 10GB 10 0.9GB + * 100GB 31 3GB + * 1TB 101 10GB + * 10TB 320 32GB */ -static bool inactive_file_is_low(struct lruvec *lruvec) +static bool inactive_list_is_low(struct lruvec *lruvec, bool file) { + unsigned long inactive_ratio; unsigned long inactive; unsigned long active; + unsigned long gb; - inactive = lruvec_lru_size(lruvec, LRU_INACTIVE_FILE); - active = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE); + /* + * If we don't have swap space, anonymous page deactivation + * is pointless. + */ + if (!file && !total_swap_pages) + return false; - return active > inactive; -} + inactive = lruvec_lru_size(lruvec, file * LRU_FILE); + active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE); -static bool inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru) -{ - if (is_file_lru(lru)) - return inactive_file_is_low(lruvec); + gb = (inactive + active) >> (30 - PAGE_SHIFT); + if (gb) + inactive_ratio = int_sqrt(10 * gb); else - return inactive_anon_is_low(lruvec); + inactive_ratio = 1; + + return inactive * inactive_ratio < active; } static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc) { if (is_active_lru(lru)) { - if (inactive_list_is_low(lruvec, lru)) + if (inactive_list_is_low(lruvec, is_file_lru(lru))) shrink_active_list(nr_to_scan, lruvec, sc, lru); return 0; } @@ -2059,7 +2039,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, * lruvec even if it has plenty of old anonymous pages unless the * system is under heavy pressure. */ - if (!inactive_file_is_low(lruvec) && + if (!inactive_list_is_low(lruvec, true) && lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) { scan_balance = SCAN_FILE; goto out; @@ -2301,7 +2281,7 @@ static void shrink_zone_memcg(struct zone *zone, struct mem_cgroup *memcg, * Even if we did not try to evict anon pages at all, we want to * rebalance the anon lru active/inactive ratio. */ - if (inactive_anon_is_low(lruvec)) + if (inactive_list_is_low(lruvec, false)) shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON); @@ -2962,7 +2942,7 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc) do { struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg); - if (inactive_anon_is_low(lruvec)) + if (inactive_list_is_low(lruvec, false)) shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON); -- cgit v0.10.2 From b6459cc154e804f0de0d61fa023c4946b742cc96 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:56:34 -0700 Subject: vmscan: consider classzone_idx in compaction_ready Motivation: As pointed out by Linus [2][3] relying on zone_reclaimable as a way to communicate the reclaim progress is rater dubious. I tend to agree, not only it is really obscure, it is not hard to imagine cases where a single page freed in the loop keeps all the reclaimers looping without getting any progress because their gfp_mask wouldn't allow to get that page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather rare so it doesn't happen in the practice but the current logic which we have is rather obscure and hard to follow a also non-deterministic. This is an attempt to make the OOM detection more deterministic and easier to follow because each reclaimer basically tracks its own progress which is implemented at the page allocator layer rather spread out between the allocator and the reclaim. The more on the implementation is described in the first patch. I have tested several different scenarios but it should be clear that testing OOM killer is quite hard to be representative. There is usually a tiny gap between almost OOM and full blown OOM which is often time sensitive. Anyway, I have tested the following 2 scenarios and I would appreciate if there are more to test. Testing environment: a virtual machine with 2G of RAM and 2CPUs without any swap to make the OOM more deterministic. 1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G file size, removes the files and starts over again) running in parallel for 10s to build up a lot of dirty pages when 100 parallel mem_eaters (anon private populated mmap which waits until it gets signal) with 80M each. This causes an OOM flood of course and I have compared both patched and unpatched kernels. The test is considered finished after there are no OOM conditions detected. This should tell us whether there are any excessive kills or some of them premature (e.g. due to dirty pages): I have performed two runs this time each after a fresh boot. * base kernel $ grep "Out of memory:" base-oom-run1.log | wc -l 78 $ grep "Out of memory:" base-oom-run2.log | wc -l 78 $ grep "Kill process" base-oom-run1.log | tail -n1 [ 91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child $ grep "Kill process" base-oom-run2.log | tail -n1 [ 82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child $ grep "DMA32 free:" base-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61 $ grep "DMA32 free:" base-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52 $ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l 1 $ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l 3 * patched kernel $ grep "Out of memory:" patched-oom-run1.log | wc -l 78 miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log | wc -l 77 e grep "Kill process" patched-oom-run1.log | tail -n1 [ 497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child $ grep "Kill process" patched-oom-run2.log | tail -n1 [ 316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child $ grep "DMA32 free:" patched-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78 $ grep "DMA32 free:" patched-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77 e grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l 2 $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l 3 The patched kernel run noticeably longer while invoking OOM killer same number of times. This means that the original implementation is much more aggressive and triggers the OOM killer sooner. free pages stats show that neither kernels went OOM too early most of the time, though. I guess the difference is in the backoff when retries without any progress do sleep for a while if there is memory under writeback or dirty which is highly likely considering the parallel IO. Both kernels have seen races where zone wasn't marked unreclaimable and we still hit the OOM killer. This is most likely a race where a task managed to exit between the last allocation attempt and the oom killer invocation. 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much memory as possible without triggering the OOM killer. This required a lot of tuning but I've considered 3 consecutive runs in three different boots without OOM as a success. * base kernel size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo) * patched kernel size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo) That means 40M more memory was usable without triggering OOM killer. The base kernel sometimes managed to handle the same as patched but it wasn't consistent and failed in at least on of the 3 runs. This seems like a minor improvement. I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented memory and under memory pressure. The results are in patch 11 where the logic is implemented. In short I can see huge improvement there. I am certainly interested in other usecases as well as well as any feedback. Especially those which require higher order requests. This patch (of 14): While playing with the oom detection rework [1] I have noticed that my heavy order-9 (hugetlb) load close to OOM ended up in an endless loop where the reclaim hasn't made any progress but did_some_progress didn't reflect that and compaction_suitable was backing off because no zone is above low wmark + 1 << order. It turned out that this is in fact an old standing bug in compaction_ready which ignores the requested_highidx and did the watermark check for 0 classzone_idx. This succeeds for zone DMA most of the time as the zone is mostly unused because of lowmem protection. As a result costly high order allocatios always report a successfull progress even when there was none. This wasn't a problem so far because these allocations usually fail quite early or retry only few times with __GFP_REPEAT but this will change after later patch in this series so make sure to not lie about the progress and propagate requested_highidx down to compaction_ready and use it for both the watermak check and compaction_suitable to fix this issue. [1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org [2] https://lkml.org/lkml/2015/10/12/808 [3] https://lkml.org/lkml/2015/10/13/597 Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: Hillf Danton Cc: Johannes Weiner Cc: Mel Gorman Cc: David Rientjes Cc: Tetsuo Handa Cc: Joonsoo Kim Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/vmscan.c b/mm/vmscan.c index 38d6d06..a386454 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2459,7 +2459,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, * Returns true if compaction should go ahead for a high-order request, or * the high-order allocation would succeed without compaction. */ -static inline bool compaction_ready(struct zone *zone, int order) +static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx) { unsigned long balance_gap, watermark; bool watermark_ok; @@ -2473,7 +2473,7 @@ static inline bool compaction_ready(struct zone *zone, int order) balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP( zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO)); watermark = high_wmark_pages(zone) + balance_gap + (2UL << order); - watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0); + watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx); /* * If compaction is deferred, reclaim up to a point where @@ -2486,7 +2486,7 @@ static inline bool compaction_ready(struct zone *zone, int order) * If compaction is not ready to start and allocation is not likely * to succeed without it, then keep reclaiming. */ - if (compaction_suitable(zone, order, 0, 0) == COMPACT_SKIPPED) + if (compaction_suitable(zone, order, 0, classzone_idx) == COMPACT_SKIPPED) return false; return watermark_ok; @@ -2566,7 +2566,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) if (IS_ENABLED(CONFIG_COMPACTION) && sc->order > PAGE_ALLOC_COSTLY_ORDER && zonelist_zone_idx(z) <= requested_highidx && - compaction_ready(zone, sc->order)) { + compaction_ready(zone, sc->order, requested_highidx)) { sc->compaction_ready = true; continue; } -- cgit v0.10.2 From ea7ab982b6bdb7ce218fd3a7850bb2e2b414fdd0 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:56:38 -0700 Subject: mm, compaction: change COMPACT_ constants into enum Compaction code is doing weird dances between COMPACT_FOO -> int -> unsigned long But there doesn't seem to be any reason for that. All functions which return/use one of those constants are not expecting any other value so it really makes sense to define an enum for them and make it clear that no other values are expected. This is a pure cleanup and shouldn't introduce any functional changes. Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: Hillf Danton Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Mel Gorman Cc: Tetsuo Handa Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 242b660..706cbf0 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -2,21 +2,29 @@ #define _LINUX_COMPACTION_H /* Return values for compact_zone() and try_to_compact_pages() */ -/* compaction didn't start as it was deferred due to past failures */ -#define COMPACT_DEFERRED 0 -/* compaction didn't start as it was not possible or direct reclaim was more suitable */ -#define COMPACT_SKIPPED 1 -/* compaction should continue to another pageblock */ -#define COMPACT_CONTINUE 2 -/* direct compaction partially compacted a zone and there are suitable pages */ -#define COMPACT_PARTIAL 3 -/* The full zone was compacted */ -#define COMPACT_COMPLETE 4 -/* For more detailed tracepoint output */ -#define COMPACT_NO_SUITABLE_PAGE 5 -#define COMPACT_NOT_SUITABLE_ZONE 6 -#define COMPACT_CONTENDED 7 /* When adding new states, please adjust include/trace/events/compaction.h */ +enum compact_result { + /* compaction didn't start as it was deferred due to past failures */ + COMPACT_DEFERRED, + /* + * compaction didn't start as it was not possible or direct reclaim + * was more suitable + */ + COMPACT_SKIPPED, + /* compaction should continue to another pageblock */ + COMPACT_CONTINUE, + /* + * direct compaction partially compacted a zone and there are suitable + * pages + */ + COMPACT_PARTIAL, + /* The full zone was compacted */ + COMPACT_COMPLETE, + /* For more detailed tracepoint output */ + COMPACT_NO_SUITABLE_PAGE, + COMPACT_NOT_SUITABLE_ZONE, + COMPACT_CONTENDED, +}; /* Used to signal whether compaction detected need_sched() or lock contention */ /* No contention detected */ @@ -38,12 +46,13 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write, extern int sysctl_compact_unevictable_allowed; extern int fragmentation_index(struct zone *zone, unsigned int order); -extern unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order, +extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, + unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, enum migrate_mode mode, int *contended); extern void compact_pgdat(pg_data_t *pgdat, int order); extern void reset_isolation_suitable(pg_data_t *pgdat); -extern unsigned long compaction_suitable(struct zone *zone, int order, +extern enum compact_result compaction_suitable(struct zone *zone, int order, unsigned int alloc_flags, int classzone_idx); extern void defer_compaction(struct zone *zone, int order); @@ -57,7 +66,7 @@ extern void kcompactd_stop(int nid); extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx); #else -static inline unsigned long try_to_compact_pages(gfp_t gfp_mask, +static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, int alloc_flags, const struct alloc_context *ac, enum migrate_mode mode, int *contended) @@ -73,7 +82,7 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat) { } -static inline unsigned long compaction_suitable(struct zone *zone, int order, +static inline enum compact_result compaction_suitable(struct zone *zone, int order, int alloc_flags, int classzone_idx) { return COMPACT_SKIPPED; diff --git a/mm/compaction.c b/mm/compaction.c index eda3c22..e721d25 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1229,7 +1229,7 @@ static inline bool is_via_compact_memory(int order) return order == -1; } -static int __compact_finished(struct zone *zone, struct compact_control *cc, +static enum compact_result __compact_finished(struct zone *zone, struct compact_control *cc, const int migratetype) { unsigned int order; @@ -1292,8 +1292,9 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc, return COMPACT_NO_SUITABLE_PAGE; } -static int compact_finished(struct zone *zone, struct compact_control *cc, - const int migratetype) +static enum compact_result compact_finished(struct zone *zone, + struct compact_control *cc, + const int migratetype) { int ret; @@ -1312,7 +1313,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc, * COMPACT_PARTIAL - If the allocation would succeed without compaction * COMPACT_CONTINUE - If compaction should run now */ -static unsigned long __compaction_suitable(struct zone *zone, int order, +static enum compact_result __compaction_suitable(struct zone *zone, int order, unsigned int alloc_flags, int classzone_idx) { @@ -1358,11 +1359,11 @@ static unsigned long __compaction_suitable(struct zone *zone, int order, return COMPACT_CONTINUE; } -unsigned long compaction_suitable(struct zone *zone, int order, +enum compact_result compaction_suitable(struct zone *zone, int order, unsigned int alloc_flags, int classzone_idx) { - unsigned long ret; + enum compact_result ret; ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx); trace_mm_compaction_suitable(zone, order, ret); @@ -1372,9 +1373,9 @@ unsigned long compaction_suitable(struct zone *zone, int order, return ret; } -static int compact_zone(struct zone *zone, struct compact_control *cc) +static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc) { - int ret; + enum compact_result ret; unsigned long start_pfn = zone->zone_start_pfn; unsigned long end_pfn = zone_end_pfn(zone); const int migratetype = gfpflags_to_migratetype(cc->gfp_mask); @@ -1530,11 +1531,11 @@ out: return ret; } -static unsigned long compact_zone_order(struct zone *zone, int order, +static enum compact_result compact_zone_order(struct zone *zone, int order, gfp_t gfp_mask, enum migrate_mode mode, int *contended, unsigned int alloc_flags, int classzone_idx) { - unsigned long ret; + enum compact_result ret; struct compact_control cc = { .nr_freepages = 0, .nr_migratepages = 0, @@ -1572,7 +1573,7 @@ int sysctl_extfrag_threshold = 500; * * This is the main entry point for direct page compaction. */ -unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order, +enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, enum migrate_mode mode, int *contended) { @@ -1580,7 +1581,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order, int may_perform_io = gfp_mask & __GFP_IO; struct zoneref *z; struct zone *zone; - int rc = COMPACT_DEFERRED; + enum compact_result rc = COMPACT_DEFERRED; int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */ *contended = COMPACT_CONTENDED_NONE; @@ -1594,7 +1595,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order, /* Compact each zone in the list */ for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) { - int status; + enum compact_result status; int zone_contended; if (compaction_deferred(zone, order)) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index edbdf56..ed62c4b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3188,7 +3188,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, enum migrate_mode mode, int *contended_compaction, bool *deferred_compaction) { - unsigned long compact_result; + enum compact_result compact_result; struct page *page; if (!order) -- cgit v0.10.2 From c46649deae3f00aa8ba8716f0ddb8eef2dc9532f Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:56:41 -0700 Subject: mm, compaction: cover all compaction mode in compact_zone The compiler is complaining after "mm, compaction: change COMPACT_ constants into enum" mm/compaction.c: In function `compact_zone': mm/compaction.c:1350:2: warning: enumeration value `COMPACT_DEFERRED' not handled in switch [-Wswitch] switch (ret) { ^ mm/compaction.c:1350:2: warning: enumeration value `COMPACT_COMPLETE' not handled in switch [-Wswitch] mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NO_SUITABLE_PAGE' not handled in switch [-Wswitch] mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NOT_SUITABLE_ZONE' not handled in switch [-Wswitch] mm/compaction.c:1350:2: warning: enumeration value `COMPACT_CONTENDED' not handled in switch [-Wswitch] compaction_suitable is allowed to return only COMPACT_PARTIAL, COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply impossible. Put a VM_BUG_ON to catch an impossible return value. Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: Hillf Danton Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Mel Gorman Cc: Tetsuo Handa Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/compaction.c b/mm/compaction.c index e721d25..455ecd8 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1383,15 +1383,12 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro ret = compaction_suitable(zone, cc->order, cc->alloc_flags, cc->classzone_idx); - switch (ret) { - case COMPACT_PARTIAL: - case COMPACT_SKIPPED: - /* Compaction is likely to fail */ + /* Compaction is likely to fail */ + if (ret == COMPACT_PARTIAL || ret == COMPACT_SKIPPED) return ret; - case COMPACT_CONTINUE: - /* Fall through to compaction */ - ; - } + + /* huh, compaction_suitable is returning something unexpected */ + VM_BUG_ON(ret != COMPACT_CONTINUE); /* * Clear pageblock skip if there were failures recently and compaction -- cgit v0.10.2 From 1d4746d395975e0ff5103e20ab169d1a95b4ef9e Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:56:44 -0700 Subject: mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED try_to_compact_pages() can currently return COMPACT_SKIPPED even when the compaction is defered for some zone just because zone DMA is skipped in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED basically unusable for the page allocator as a feedback mechanism. Make sure we distinguish those two states properly and switch their ordering in the enum. This would mean that the COMPACT_SKIPPED will be returned only when all eligible zones are skipped. As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath will be more precise and we would bail out rather than reclaim. Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: Hillf Danton Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Mel Gorman Cc: Tetsuo Handa Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 706cbf0..11f2287 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -4,13 +4,16 @@ /* Return values for compact_zone() and try_to_compact_pages() */ /* When adding new states, please adjust include/trace/events/compaction.h */ enum compact_result { - /* compaction didn't start as it was deferred due to past failures */ - COMPACT_DEFERRED, /* * compaction didn't start as it was not possible or direct reclaim * was more suitable */ COMPACT_SKIPPED, + /* compaction didn't start as it was deferred due to past failures */ + COMPACT_DEFERRED, + /* compaction not active last round */ + COMPACT_INACTIVE = COMPACT_DEFERRED, + /* compaction should continue to another pageblock */ COMPACT_CONTINUE, /* diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h index e215bf6..6ba16c8 100644 --- a/include/trace/events/compaction.h +++ b/include/trace/events/compaction.h @@ -10,8 +10,8 @@ #include #define COMPACTION_STATUS \ - EM( COMPACT_DEFERRED, "deferred") \ EM( COMPACT_SKIPPED, "skipped") \ + EM( COMPACT_DEFERRED, "deferred") \ EM( COMPACT_CONTINUE, "continue") \ EM( COMPACT_PARTIAL, "partial") \ EM( COMPACT_COMPLETE, "complete") \ diff --git a/mm/compaction.c b/mm/compaction.c index 455ecd8..b2b9447 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1578,7 +1578,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, int may_perform_io = gfp_mask & __GFP_IO; struct zoneref *z; struct zone *zone; - enum compact_result rc = COMPACT_DEFERRED; + enum compact_result rc = COMPACT_SKIPPED; int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */ *contended = COMPACT_CONTENDED_NONE; @@ -1595,8 +1595,10 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, enum compact_result status; int zone_contended; - if (compaction_deferred(zone, order)) + if (compaction_deferred(zone, order)) { + rc = max_t(enum compact_result, COMPACT_DEFERRED, rc); continue; + } status = compact_zone_order(zone, order, gfp_mask, mode, &zone_contended, alloc_flags, @@ -1667,7 +1669,7 @@ break_loop: * If at least one zone wasn't deferred or skipped, we report if all * zones that were tried were lock contended. */ - if (rc > COMPACT_SKIPPED && all_zones_contended) + if (rc > COMPACT_INACTIVE && all_zones_contended) *contended = COMPACT_CONTENDED_LOCK; return rc; -- cgit v0.10.2 From c8f7de0bfae36e8532e5e25a39d15407f02aca78 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:56:47 -0700 Subject: mm, compaction: distinguish between full and partial COMPACT_COMPLETE COMPACT_COMPLETE now means that compaction and free scanner met. This is not very useful information if somebody just wants to use this feedback and make any decisions based on that. The current caller might be a poor guy who just happened to scan tiny portion of the zone and that could be the reason no suitable pages were compacted. Make sure we distinguish the full and partial zone walks. Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success and be optimistic in retrying. The existing users of COMPACT_COMPLETE are conservatively changed to use COMPACT_PARTIAL_SKIPPED as well but some of them should be probably reconsidered and only defer the compaction only for COMPACT_COMPLETE with the new semantic. This patch shouldn't introduce any functional changes. Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: Hillf Danton Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Mel Gorman Cc: Tetsuo Handa Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 11f2287..9b37f9d 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -21,7 +21,15 @@ enum compact_result { * pages */ COMPACT_PARTIAL, - /* The full zone was compacted */ + /* + * direct compaction has scanned part of the zone but wasn't successfull + * to compact suitable pages. + */ + COMPACT_PARTIAL_SKIPPED, + /* + * The full zone was compacted scanned but wasn't successfull to compact + * suitable pages. + */ COMPACT_COMPLETE, /* For more detailed tracepoint output */ COMPACT_NO_SUITABLE_PAGE, diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h index 6ba16c8..36e2d6f 100644 --- a/include/trace/events/compaction.h +++ b/include/trace/events/compaction.h @@ -14,6 +14,7 @@ EM( COMPACT_DEFERRED, "deferred") \ EM( COMPACT_CONTINUE, "continue") \ EM( COMPACT_PARTIAL, "partial") \ + EM( COMPACT_PARTIAL_SKIPPED, "partial_skipped") \ EM( COMPACT_COMPLETE, "complete") \ EM( COMPACT_NO_SUITABLE_PAGE, "no_suitable_page") \ EM( COMPACT_NOT_SUITABLE_ZONE, "not_suitable_zone") \ diff --git a/mm/compaction.c b/mm/compaction.c index b2b9447..4af1577 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1252,7 +1252,10 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_ if (cc->direct_compaction) zone->compact_blockskip_flush = true; - return COMPACT_COMPLETE; + if (cc->whole_zone) + return COMPACT_COMPLETE; + else + return COMPACT_PARTIAL_SKIPPED; } if (is_via_compact_memory(cc->order)) @@ -1413,6 +1416,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn; zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn; } + + if (cc->migrate_pfn == start_pfn) + cc->whole_zone = true; + cc->last_migrated_pfn = 0; trace_mm_compaction_begin(start_pfn, cc->migrate_pfn, @@ -1634,7 +1641,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, goto break_loop; } - if (mode != MIGRATE_ASYNC && status == COMPACT_COMPLETE) { + if (mode != MIGRATE_ASYNC && (status == COMPACT_COMPLETE || + status == COMPACT_PARTIAL_SKIPPED)) { /* * We think that allocation won't succeed in this zone * so we defer compaction there. If it ends up @@ -1881,7 +1889,7 @@ static void kcompactd_do_work(pg_data_t *pgdat) cc.classzone_idx, 0)) { success = true; compaction_defer_reset(zone, cc.order, false); - } else if (status == COMPACT_COMPLETE) { + } else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) { /* * We use sync migration mode here, so we defer like * sync direct compaction does. diff --git a/mm/internal.h b/mm/internal.h index 3ac544f..f6f3353 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -174,6 +174,7 @@ struct compact_control { enum migrate_mode mode; /* Async or sync migration mode */ bool ignore_skip_hint; /* Scan blocks even if marked skip */ bool direct_compaction; /* False from kcompactd or /proc/... */ + bool whole_zone; /* Whole zone has been scanned */ int order; /* order a direct compactor needs */ const gfp_t gfp_mask; /* gfp mask of a direct compactor */ const unsigned int alloc_flags; /* alloc flags of a direct compactor */ -- cgit v0.10.2 From 4f9a358c36fcdad3ea1db263ec4d484a70ad543e Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:56:50 -0700 Subject: mm, compaction: update compaction_result ordering compaction_result will be used as the primary feedback channel for compaction users. At the same time try_to_compact_pages (and potentially others) assume a certain ordering where a more specific feedback takes precendence. This gets a bit awkward when we have conflicting feedback from different zones. E.g one returing COMPACT_COMPLETE meaning the full zone has been scanned without any outcome while other returns with COMPACT_PARTIAL aka made some progress. The caller should get COMPACT_PARTIAL because that means that the compaction still can make some progress. The same applies for COMPACT_PARTIAL vs COMPACT_PARTIAL_SKIPPED. Reorder PARTIAL to be the largest one so the larger the value is the more progress we have done. Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: Hillf Danton Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Mel Gorman Cc: Tetsuo Handa Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 9b37f9d..ff39fa0 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -4,6 +4,8 @@ /* Return values for compact_zone() and try_to_compact_pages() */ /* When adding new states, please adjust include/trace/events/compaction.h */ enum compact_result { + /* For more detailed tracepoint output - internal to compaction */ + COMPACT_NOT_SUITABLE_ZONE, /* * compaction didn't start as it was not possible or direct reclaim * was more suitable @@ -11,30 +13,34 @@ enum compact_result { COMPACT_SKIPPED, /* compaction didn't start as it was deferred due to past failures */ COMPACT_DEFERRED, + /* compaction not active last round */ COMPACT_INACTIVE = COMPACT_DEFERRED, + /* For more detailed tracepoint output - internal to compaction */ + COMPACT_NO_SUITABLE_PAGE, /* compaction should continue to another pageblock */ COMPACT_CONTINUE, + /* - * direct compaction partially compacted a zone and there are suitable - * pages + * The full zone was compacted scanned but wasn't successfull to compact + * suitable pages. */ - COMPACT_PARTIAL, + COMPACT_COMPLETE, /* * direct compaction has scanned part of the zone but wasn't successfull * to compact suitable pages. */ COMPACT_PARTIAL_SKIPPED, + + /* compaction terminated prematurely due to lock contentions */ + COMPACT_CONTENDED, + /* - * The full zone was compacted scanned but wasn't successfull to compact - * suitable pages. + * direct compaction partially compacted a zone and there might be + * suitable pages */ - COMPACT_COMPLETE, - /* For more detailed tracepoint output */ - COMPACT_NO_SUITABLE_PAGE, - COMPACT_NOT_SUITABLE_ZONE, - COMPACT_CONTENDED, + COMPACT_PARTIAL, }; /* Used to signal whether compaction detected need_sched() or lock contention */ -- cgit v0.10.2 From c5d01d0d18e2ab7a21f0371b00e4d1a06f79cdf5 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:56:53 -0700 Subject: mm, compaction: simplify __alloc_pages_direct_compact feedback interface __alloc_pages_direct_compact communicates potential back off by two variables: - deferred_compaction tells that the compaction returned COMPACT_DEFERRED - contended_compaction is set when there is a contention on zone->lock resp. zone->lru_lock locks __alloc_pages_slowpath then backs of for THP allocation requests to prevent from long stalls. This is rather messy and it would be much cleaner to return a single compact result value and hide all the nasty details into __alloc_pages_direct_compact. This patch shouldn't introduce any functional changes. Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: Hillf Danton Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Mel Gorman Cc: Tetsuo Handa Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ed62c4b..8bcc106 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3185,29 +3185,21 @@ out: static struct page * __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, - enum migrate_mode mode, int *contended_compaction, - bool *deferred_compaction) + enum migrate_mode mode, enum compact_result *compact_result) { - enum compact_result compact_result; struct page *page; + int contended_compaction; if (!order) return NULL; current->flags |= PF_MEMALLOC; - compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac, - mode, contended_compaction); + *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac, + mode, &contended_compaction); current->flags &= ~PF_MEMALLOC; - switch (compact_result) { - case COMPACT_DEFERRED: - *deferred_compaction = true; - /* fall-through */ - case COMPACT_SKIPPED: + if (*compact_result <= COMPACT_INACTIVE) return NULL; - default: - break; - } /* * At least in one zone compaction wasn't deferred or skipped, so let's @@ -3233,6 +3225,24 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, */ count_vm_event(COMPACTFAIL); + /* + * In all zones where compaction was attempted (and not + * deferred or skipped), lock contention has been detected. + * For THP allocation we do not want to disrupt the others + * so we fallback to base pages instead. + */ + if (contended_compaction == COMPACT_CONTENDED_LOCK) + *compact_result = COMPACT_CONTENDED; + + /* + * If compaction was aborted due to need_resched(), we do not + * want to further increase allocation latency, unless it is + * khugepaged trying to collapse. + */ + if (contended_compaction == COMPACT_CONTENDED_SCHED + && !(current->flags & PF_KTHREAD)) + *compact_result = COMPACT_CONTENDED; + cond_resched(); return NULL; @@ -3241,8 +3251,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, static inline struct page * __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, - enum migrate_mode mode, int *contended_compaction, - bool *deferred_compaction) + enum migrate_mode mode, enum compact_result *compact_result) { return NULL; } @@ -3387,8 +3396,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, unsigned long pages_reclaimed = 0; unsigned long did_some_progress; enum migrate_mode migration_mode = MIGRATE_ASYNC; - bool deferred_compaction = false; - int contended_compaction = COMPACT_CONTENDED_NONE; + enum compact_result compact_result; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3475,8 +3483,7 @@ retry: */ page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac, migration_mode, - &contended_compaction, - &deferred_compaction); + &compact_result); if (page) goto got_pg; @@ -3489,25 +3496,14 @@ retry: * to heavily disrupt the system, so we fail the allocation * instead of entering direct reclaim. */ - if (deferred_compaction) - goto nopage; - - /* - * In all zones where compaction was attempted (and not - * deferred or skipped), lock contention has been detected. - * For THP allocation we do not want to disrupt the others - * so we fallback to base pages instead. - */ - if (contended_compaction == COMPACT_CONTENDED_LOCK) + if (compact_result == COMPACT_DEFERRED) goto nopage; /* - * If compaction was aborted due to need_resched(), we do not - * want to further increase allocation latency, unless it is - * khugepaged trying to collapse. + * Compaction is contended so rather back off than cause + * excessive stalls. */ - if (contended_compaction == COMPACT_CONTENDED_SCHED - && !(current->flags & PF_KTHREAD)) + if(compact_result == COMPACT_CONTENDED) goto nopage; } @@ -3555,8 +3551,7 @@ noretry: */ page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac, migration_mode, - &contended_compaction, - &deferred_compaction); + &compact_result); if (page) goto got_pg; nopage: -- cgit v0.10.2 From cab1802b5f0dddea30547a7451fda8c7e4c593f0 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:56:56 -0700 Subject: mm, compaction: abstract compaction feedback to helpers Compaction can provide a wild variation of feedback to the caller. Many of them are implementation specific and the caller of the compaction (especially the page allocator) shouldn't be bound to specifics of the current implementation. This patch abstracts the feedback into three basic types: - compaction_made_progress - compaction was active and made some progress. - compaction_failed - compaction failed and further attempts to invoke it would most probably fail and therefore it is not worth retrying - compaction_withdrawn - compaction wasn't invoked for an implementation specific reasons. In the current implementation it means that the compaction was deferred, contended or the page scanners met too early without any progress. Retrying is still worthwhile. [vbabka@suse.cz: do not change thp back off behavior] [akpm@linux-foundation.org: fix typo in comment, per Hillf] Signed-off-by: Michal Hocko Acked-by: Hillf Danton Acked-by: Vlastimil Babka Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Mel Gorman Cc: Tetsuo Handa Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/compaction.h b/include/linux/compaction.h index ff39fa0..8d8c916 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -78,6 +78,70 @@ extern void compaction_defer_reset(struct zone *zone, int order, bool alloc_success); extern bool compaction_restarting(struct zone *zone, int order); +/* Compaction has made some progress and retrying makes sense */ +static inline bool compaction_made_progress(enum compact_result result) +{ + /* + * Even though this might sound confusing this in fact tells us + * that the compaction successfully isolated and migrated some + * pageblocks. + */ + if (result == COMPACT_PARTIAL) + return true; + + return false; +} + +/* Compaction has failed and it doesn't make much sense to keep retrying. */ +static inline bool compaction_failed(enum compact_result result) +{ + /* All zones were scanned completely and still not result. */ + if (result == COMPACT_COMPLETE) + return true; + + return false; +} + +/* + * Compaction has backed off for some reason. It might be throttling or + * lock contention. Retrying is still worthwhile. + */ +static inline bool compaction_withdrawn(enum compact_result result) +{ + /* + * Compaction backed off due to watermark checks for order-0 + * so the regular reclaim has to try harder and reclaim something. + */ + if (result == COMPACT_SKIPPED) + return true; + + /* + * If compaction is deferred for high-order allocations, it is + * because sync compaction recently failed. If this is the case + * and the caller requested a THP allocation, we do not want + * to heavily disrupt the system, so we fail the allocation + * instead of entering direct reclaim. + */ + if (result == COMPACT_DEFERRED) + return true; + + /* + * If compaction in async mode encounters contention or blocks higher + * priority task we back off early rather than cause stalls. + */ + if (result == COMPACT_CONTENDED) + return true; + + /* + * Page scanners have met but we haven't scanned full zones so this + * is a back off in fact. + */ + if (result == COMPACT_PARTIAL_SKIPPED) + return true; + + return false; +} + extern int kcompactd_run(int nid); extern void kcompactd_stop(int nid); extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx); @@ -114,6 +178,21 @@ static inline bool compaction_deferred(struct zone *zone, int order) return true; } +static inline bool compaction_made_progress(enum compact_result result) +{ + return false; +} + +static inline bool compaction_failed(enum compact_result result) +{ + return false; +} + +static inline bool compaction_withdrawn(enum compact_result result) +{ + return true; +} + static inline int kcompactd_run(int nid) { return 0; -- cgit v0.10.2 From 0a0337e0d1d134465778a16f5cbea95086e8e9e0 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:57:00 -0700 Subject: mm, oom: rework oom detection __alloc_pages_slowpath has traditionally relied on the direct reclaim and did_some_progress as an indicator that it makes sense to retry allocation rather than declaring OOM. shrink_zones had to rely on zone_reclaimable if shrink_zone didn't make any progress to prevent from a premature OOM killer invocation - the LRU might be full of dirty or writeback pages and direct reclaim cannot clean those up. zone_reclaimable allows to rescan the reclaimable lists several times and restart if a page is freed. This is really subtle behavior and it might lead to a livelock when a single freed page keeps allocator looping but the current task will not be able to allocate that single page. OOM killer would be more appropriate than looping without any progress for unbounded amount of time. This patch changes OOM detection logic and pulls it out from shrink_zone which is too low to be appropriate for any high level decisions such as OOM which is per zonelist property. It is __alloc_pages_slowpath which knows how many attempts have been done and what was the progress so far therefore it is more appropriate to implement this logic. The new heuristic is implemented in should_reclaim_retry helper called from __alloc_pages_slowpath. It tries to be more deterministic and easier to follow. It builds on an assumption that retrying makes sense only if the currently reclaimable memory + free pages would allow the current allocation request to succeed (as per __zone_watermark_ok) at least for one zone in the usable zonelist. This alone wouldn't be sufficient, though, because the writeback might get stuck and reclaimable pages might be pinned for a really long time or even depend on the current allocation context. Therefore there is a backoff mechanism implemented which reduces the reclaim target after each reclaim round without any progress. This means that we should eventually converge to only NR_FREE_PAGES as the target and fail on the wmark check and proceed to OOM. The backoff is simple and linear with 1/16 of the reclaimable pages for each round without any progress. We are optimistic and reset counter for successful reclaim rounds. Costly high order pages mostly preserve their semantic and those without __GFP_REPEAT fail right away while those which have the flag set will back off after the amount of reclaimable pages reaches equivalent of the requested order. The only difference is that if there was no progress during the reclaim we rely on zone watermark check. This is more logical thing to do than previous 1< Acked-by: Hillf Danton Cc: Vladimir Davydov Cc: Johannes Weiner Cc: David Rientjes Cc: Joonsoo Kim Cc: Mel Gorman Cc: Tetsuo Handa Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/swap.h b/include/linux/swap.h index ad22035..0af2bb2 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -316,6 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, struct vm_area_struct *vma); /* linux/mm/vmscan.c */ +extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8bcc106..fa39efc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3386,6 +3386,77 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask) return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE; } +/* + * Maximum number of reclaim retries without any progress before OOM killer + * is consider as the only way to move forward. + */ +#define MAX_RECLAIM_RETRIES 16 + +/* + * Checks whether it makes sense to retry the reclaim to make a forward progress + * for the given allocation request. + * The reclaim feedback represented by did_some_progress (any progress during + * the last reclaim round), pages_reclaimed (cumulative number of reclaimed + * pages) and no_progress_loops (number of reclaim rounds without any progress + * in a row) is considered as well as the reclaimable pages on the applicable + * zone list (with a backoff mechanism which is a function of no_progress_loops). + * + * Returns true if a retry is viable or false to enter the oom path. + */ +static inline bool +should_reclaim_retry(gfp_t gfp_mask, unsigned order, + struct alloc_context *ac, int alloc_flags, + bool did_some_progress, unsigned long pages_reclaimed, + int no_progress_loops) +{ + struct zone *zone; + struct zoneref *z; + + /* + * Make sure we converge to OOM if we cannot make any progress + * several times in the row. + */ + if (no_progress_loops > MAX_RECLAIM_RETRIES) + return false; + + if (order > PAGE_ALLOC_COSTLY_ORDER) { + if (pages_reclaimed >= (1<zonelist, ac->high_zoneidx, + ac->nodemask) { + unsigned long available; + + available = zone_reclaimable_pages(zone); + available -= DIV_ROUND_UP(no_progress_loops * available, + MAX_RECLAIM_RETRIES); + available += zone_page_state_snapshot(zone, NR_FREE_PAGES); + + /* + * Would the allocation succeed if we reclaimed the whole + * available? + */ + if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), + ac->high_zoneidx, alloc_flags, available)) { + /* Wait for some write requests to complete then retry */ + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50); + return true; + } + } + + return false; +} + static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) @@ -3397,6 +3468,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, unsigned long did_some_progress; enum migrate_mode migration_mode = MIGRATE_ASYNC; enum compact_result compact_result; + int no_progress_loops = 0; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3525,23 +3597,35 @@ retry: if (gfp_mask & __GFP_NORETRY) goto noretry; - /* Keep reclaiming pages as long as there is reasonable progress */ - pages_reclaimed += did_some_progress; - if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || - ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) { - /* Wait for some write requests to complete then retry */ - wait_iff_congested(ac->preferred_zoneref->zone, BLK_RW_ASYNC, HZ/50); - goto retry; + /* + * Do not retry costly high order allocations unless they are + * __GFP_REPEAT + */ + if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT)) + goto noretry; + + if (did_some_progress) { + no_progress_loops = 0; + pages_reclaimed += did_some_progress; + } else { + no_progress_loops++; } + if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, + did_some_progress > 0, pages_reclaimed, + no_progress_loops)) + goto retry; + /* Reclaim has failed us, start killing things */ page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); if (page) goto got_pg; /* Retry as long as the OOM killer is making progress */ - if (did_some_progress) + if (did_some_progress) { + no_progress_loops = 0; goto retry; + } noretry: /* diff --git a/mm/vmscan.c b/mm/vmscan.c index a386454..c4a2f45 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -191,7 +191,7 @@ static bool sane_reclaim(struct scan_control *sc) } #endif -static unsigned long zone_reclaimable_pages(struct zone *zone) +unsigned long zone_reclaimable_pages(struct zone *zone) { unsigned long nr; @@ -2507,10 +2507,8 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_ * * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. - * - * Returns true if a zone was reclaimable. */ -static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) +static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) { struct zoneref *z; struct zone *zone; @@ -2518,7 +2516,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) unsigned long nr_soft_scanned; gfp_t orig_mask; enum zone_type requested_highidx = gfp_zone(sc->gfp_mask); - bool reclaimable = false; /* * If the number of buffer_heads in the machine exceeds the maximum @@ -2583,17 +2580,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) &nr_soft_scanned); sc->nr_reclaimed += nr_soft_reclaimed; sc->nr_scanned += nr_soft_scanned; - if (nr_soft_reclaimed) - reclaimable = true; /* need some check for avoid more shrink_zone() */ } - if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx)) - reclaimable = true; - - if (global_reclaim(sc) && - !reclaimable && zone_reclaimable(zone)) - reclaimable = true; + shrink_zone(zone, sc, zone_idx(zone) == classzone_idx); } /* @@ -2601,8 +2591,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) * promoted it to __GFP_HIGHMEM. */ sc->gfp_mask = orig_mask; - - return reclaimable; } /* @@ -2627,7 +2615,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, int initial_priority = sc->priority; unsigned long total_scanned = 0; unsigned long writeback_threshold; - bool zones_reclaimable; retry: delayacct_freepages_start(); @@ -2638,7 +2625,7 @@ retry: vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup, sc->priority); sc->nr_scanned = 0; - zones_reclaimable = shrink_zones(zonelist, sc); + shrink_zones(zonelist, sc); total_scanned += sc->nr_scanned; if (sc->nr_reclaimed >= sc->nr_to_reclaim) @@ -2685,10 +2672,6 @@ retry: goto retry; } - /* Any of the zones still reclaimable? Don't OOM. */ - if (zones_reclaimable) - return 1; - return 0; } -- cgit v0.10.2 From ede37713737834d98ec72ed299a305d53e909f73 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:57:03 -0700 Subject: mm: throttle on IO only when there are too many dirty and writeback pages wait_iff_congested has been used to throttle allocator before it retried another round of direct reclaim to allow the writeback to make some progress and prevent reclaim from looping over dirty/writeback pages without making any progress. We used to do congestion_wait before commit 0e093d99763e ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone") but that led to undesirable stalls and sleeping for the full timeout even when the BDI wasn't congested. Hence wait_iff_congested was used instead. But it seems that even wait_iff_congested doesn't work as expected. We might have a small file LRU list with all pages dirty/writeback and yet the bdi is not congested so this is just a cond_resched in the end and can end up triggering pre mature OOM. This patch replaces the unconditional wait_iff_congested by congestion_wait which is executed only if we _know_ that the last round of direct reclaim didn't make any progress and dirty+writeback pages are more than a half of the reclaimable pages on the zone which might be usable for our target allocation. This shouldn't reintroduce stalls fixed by 0e093d99763e because congestion_wait is called only when we are getting hopeless when sleeping is a better choice than OOM with many pages under IO. We have to preserve logic introduced by commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress") into the __alloc_pages_slowpath now that wait_iff_congested is not used anymore. As the only remaining user of wait_iff_congested is shrink_inactive_list we can remove the WQ specific short sleep from wait_iff_congested because the sleep is needed to be done only once in the allocation retry cycle. [mhocko@suse.com: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly] Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Hillf Danton Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Mel Gorman Cc: Tetsuo Handa Cc: Vladimir Davydov Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 0c6317b..ed173b8 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -957,9 +957,8 @@ EXPORT_SYMBOL(congestion_wait); * jiffies for either a BDI to exit congestion of the given @sync queue * or a write to complete. * - * In the absence of zone congestion, a short sleep or a cond_resched is - * performed to yield the processor and to allow other subsystems to make - * a forward progress. + * In the absence of zone congestion, cond_resched() is called to yield + * the processor if necessary but otherwise does not sleep. * * The return value is 0 if the sleep is for the full timeout. Otherwise, * it is the number of jiffies that were still remaining when the function @@ -979,20 +978,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) */ if (atomic_read(&nr_wb_congested[sync]) == 0 || !test_bit(ZONE_CONGESTED, &zone->flags)) { - - /* - * Memory allocation/reclaim might be called from a WQ - * context and the current implementation of the WQ - * concurrency control doesn't recognize that a particular - * WQ is congested if the worker thread is looping without - * ever sleeping. Therefore we have to do a short sleep - * here rather than calling cond_resched(). - */ - if (current->flags & PF_WQ_WORKER) - schedule_timeout_uninterruptible(1); - else - cond_resched(); - + cond_resched(); /* In case we scheduled, work out time remaining */ ret = timeout - (jiffies - start); if (ret < 0) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fa39efc..f51c302 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3436,8 +3436,9 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) { unsigned long available; + unsigned long reclaimable; - available = zone_reclaimable_pages(zone); + available = reclaimable = zone_reclaimable_pages(zone); available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES); available += zone_page_state_snapshot(zone, NR_FREE_PAGES); @@ -3447,9 +3448,41 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, * available? */ if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), - ac->high_zoneidx, alloc_flags, available)) { - /* Wait for some write requests to complete then retry */ - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50); + ac_classzone_idx(ac), alloc_flags, available)) { + /* + * If we didn't make any progress and have a lot of + * dirty + writeback pages then we should wait for + * an IO to complete to slow down the reclaim and + * prevent from pre mature OOM + */ + if (!did_some_progress) { + unsigned long writeback; + unsigned long dirty; + + writeback = zone_page_state_snapshot(zone, + NR_WRITEBACK); + dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY); + + if (2*(writeback + dirty) > reclaimable) { + congestion_wait(BLK_RW_ASYNC, HZ/10); + return true; + } + } + + /* + * Memory allocation/reclaim might be called from a WQ + * context and the current implementation of the WQ + * concurrency control doesn't recognize that + * a particular WQ is congested if the worker thread is + * looping without ever sleeping. Therefore we have to + * do a short sleep here rather than calling + * cond_resched(). + */ + if (current->flags & PF_WQ_WORKER) + schedule_timeout_uninterruptible(1); + else + cond_resched(); + return true; } } -- cgit v0.10.2 From 33c2d21438daea807947923377995c73ee8ed3fc Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:57:06 -0700 Subject: mm, oom: protect !costly allocations some more should_reclaim_retry will give up retries for higher order allocations if none of the eligible zones has any requested or higher order pages available even if we pass the watermak check for order-0. This is done because there is no guarantee that the reclaimable and currently free pages will form the required order. This can, however, lead to situations where the high-order request (e.g. order-2 required for the stack allocation during fork) will trigger OOM too early - e.g. after the first reclaim/compaction round. Such a system would have to be highly fragmented and there is no guarantee further reclaim/compaction attempts would help but at least make sure that the compaction was active before we go OOM and keep retrying even if should_reclaim_retry tells us to oom if - the last compaction round backed off or - we haven't completed at least MAX_COMPACT_RETRIES active compaction rounds. The first rule ensures that the very last attempt for compaction was not ignored while the second guarantees that the compaction has done some work. Multiple retries might be needed to prevent occasional pigggy backing of other contexts to steal the compacted pages before the current context manages to retry to allocate them. compaction_failed() is taken as a final word from the compaction that the retry doesn't make much sense. We have to be careful though because the first compaction round is MIGRATE_ASYNC which is rather weak as it ignores pages under writeback and gives up too easily in other situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode has been used before we give up. With this logic in place we do not have to increase the migration mode unconditionally and rather do it only if the compaction failed for the weaker mode. A nice side effect is that the stronger migration mode is used only when really needed so this has a potential of smaller latencies in some cases. Please note that the compaction doesn't tell us much about how successful it was when returning compaction_made_progress so we just have to blindly trust that another retry is worthwhile and cap the number to something reasonable to guarantee a convergence. If the given number of successful retries is not sufficient for a reasonable workloads we should focus on the collected compaction tracepoints data and try to address the issue in the compaction code. If this is not feasible we can increase the retries limit. [mhocko@suse.com: fix warning] Link: http://lkml.kernel.org/r/20160512061636.GA4200@dhcp22.suse.cz Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: Hillf Danton Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Mel Gorman Cc: Tetsuo Handa Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f51c302..38ad6dd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3180,6 +3180,13 @@ out: return page; } + +/* + * Maximum number of compaction retries wit a progress before OOM + * killer is consider as the only way to move forward. + */ +#define MAX_COMPACT_RETRIES 16 + #ifdef CONFIG_COMPACTION /* Try memory compaction for high-order allocations before reclaim */ static struct page * @@ -3247,14 +3254,60 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, return NULL; } + +static inline bool +should_compact_retry(unsigned int order, enum compact_result compact_result, + enum migrate_mode *migrate_mode, + int compaction_retries) +{ + if (!order) + return false; + + /* + * compaction considers all the zone as desperately out of memory + * so it doesn't really make much sense to retry except when the + * failure could be caused by weak migration mode. + */ + if (compaction_failed(compact_result)) { + if (*migrate_mode == MIGRATE_ASYNC) { + *migrate_mode = MIGRATE_SYNC_LIGHT; + return true; + } + return false; + } + + /* + * !costly allocations are really important and we have to make sure + * the compaction wasn't deferred or didn't bail out early due to locks + * contention before we go OOM. Still cap the reclaim retry loops with + * progress to prevent from looping forever and potential trashing. + */ + if (order <= PAGE_ALLOC_COSTLY_ORDER) { + if (compaction_withdrawn(compact_result)) + return true; + if (compaction_retries <= MAX_COMPACT_RETRIES) + return true; + } + + return false; +} #else static inline struct page * __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, const struct alloc_context *ac, enum migrate_mode mode, enum compact_result *compact_result) { + *compact_result = COMPACT_SKIPPED; return NULL; } + +static inline bool +should_compact_retry(unsigned int order, enum compact_result compact_result, + enum migrate_mode *migrate_mode, + int compaction_retries) +{ + return false; +} #endif /* CONFIG_COMPACTION */ /* Perform direct synchronous page reclaim */ @@ -3501,6 +3554,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, unsigned long did_some_progress; enum migrate_mode migration_mode = MIGRATE_ASYNC; enum compact_result compact_result; + int compaction_retries = 0; int no_progress_loops = 0; /* @@ -3612,13 +3666,8 @@ retry: goto nopage; } - /* - * It can become very expensive to allocate transparent hugepages at - * fault, so use asynchronous memory compaction for THP unless it is - * khugepaged trying to collapse. - */ - if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD)) - migration_mode = MIGRATE_SYNC_LIGHT; + if (order && compaction_made_progress(compact_result)) + compaction_retries++; /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac, @@ -3649,6 +3698,17 @@ retry: no_progress_loops)) goto retry; + /* + * It doesn't make any sense to retry for the compaction if the order-0 + * reclaim is not able to make any progress because the current + * implementation of the compaction depends on the sufficient amount + * of free memory (see __compaction_suitable) + */ + if (did_some_progress > 0 && + should_compact_retry(order, compact_result, + &migration_mode, compaction_retries)) + goto retry; + /* Reclaim has failed us, start killing things */ page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); if (page) @@ -3662,10 +3722,18 @@ retry: noretry: /* - * High-order allocations do not necessarily loop after - * direct reclaim and reclaim/compaction depends on compaction - * being called after reclaim so call directly if necessary + * High-order allocations do not necessarily loop after direct reclaim + * and reclaim/compaction depends on compaction being called after + * reclaim so call directly if necessary. + * It can become very expensive to allocate transparent hugepages at + * fault, so use asynchronous memory compaction for THP unless it is + * khugepaged trying to collapse. All other requests should tolerate + * at least light sync migration. */ + if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD)) + migration_mode = MIGRATE_ASYNC; + else + migration_mode = MIGRATE_SYNC_LIGHT; page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac, migration_mode, &compact_result); -- cgit v0.10.2 From 7854ea6c28c6076050e24773eeb78e2925bd7411 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:57:09 -0700 Subject: mm: consider compaction feedback also for costly allocation PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside should_reclaim_retry currently where we decide to not retry after at least order worth of pages were reclaimed or the watermark check for at least one zone would succeed after reclaiming all pages if the reclaim hasn't made any progress. Compaction feedback is mostly ignored and we just try to make sure that the compaction did at least something before giving up. The first condition was added by a41f24ea9fd6 ("page allocator: smarter retry of costly-order allocations) and it assumed that lumpy reclaim could have created a page of the sufficient order. Lumpy reclaim, has been removed quite some time ago so the assumption doesn't hold anymore. Remove the check for the number of reclaimed pages and rely on the compaction feedback solely. should_reclaim_retry now only makes sure that we keep retrying reclaim for high order pages only if they are hidden by watermaks so order-0 reclaim makes really sense. should_compact_retry now keeps retrying even for the costly allocations. The number of retries is reduced wrt. !costly requests because they are less important and harder to grant and so their pressure shouldn't cause contention for other requests or cause an over reclaim. We also do not reset no_progress_loops for costly request to make sure we do not keep reclaiming too agressively. This has been tested by running a process which fragments memory: - compact memory - mmap large portion of the memory (1920M on 2GRAM machine with 2G of swapspace) - MADV_DONTNEED single page in PAGE_SIZE*((1UL< /proc/sys/vm/overcommit_memory;echo 1 > /proc/sys/vm/compact_memory; ./fragment-mem-and-run /root/alloc_hugepages.sh 1920M 250M Node 0, zone DMA 31 28 31 10 2 0 2 1 2 3 1 Node 0, zone DMA32 437 319 171 50 28 25 20 16 16 14 437 * This is the /proc/buddyinfo after the compaction Done fragmenting. size=2013265920 freed=262144000 Node 0, zone DMA 165 48 3 1 2 0 2 2 2 2 0 Node 0, zone DMA32 35109 14575 185 51 41 12 6 0 0 0 0 * /proc/buddyinfo after memory got fragmented Executing "/root/alloc_hugepages.sh" Eating some pagecache 508623+0 records in 508623+0 records out 2083319808 bytes (2.1 GB) copied, 11.7292 s, 178 MB/s Node 0, zone DMA 3 5 3 1 2 0 2 2 2 2 0 Node 0, zone DMA32 111 344 153 20 24 10 3 0 0 0 0 * /proc/buddyinfo after page cache got eaten Trying to allocate 129 129 * 129 hugepages requested and all of them granted. Node 0, zone DMA 3 5 3 1 2 0 2 2 2 2 0 Node 0, zone DMA32 127 97 30 99 11 6 2 1 4 0 0 * /proc/buddyinfo after hugetlb allocation. 10 runs will behave as follows: Trying to allocate 130 130 -- Trying to allocate 129 129 -- Trying to allocate 128 128 -- Trying to allocate 129 129 -- Trying to allocate 128 128 -- Trying to allocate 129 129 -- Trying to allocate 132 132 -- Trying to allocate 129 129 -- Trying to allocate 128 128 -- Trying to allocate 129 129 So basically 100% success for all 10 attempts. Without the patch numbers looked much worse: Trying to allocate 128 12 -- Trying to allocate 129 14 -- Trying to allocate 129 7 -- Trying to allocate 129 16 -- Trying to allocate 129 30 -- Trying to allocate 129 38 -- Trying to allocate 129 19 -- Trying to allocate 129 37 -- Trying to allocate 129 28 -- Trying to allocate 129 37 Just for completness the base kernel without oom detection rework looks as follows: Trying to allocate 127 30 -- Trying to allocate 129 12 -- Trying to allocate 129 52 -- Trying to allocate 128 32 -- Trying to allocate 129 12 -- Trying to allocate 129 10 -- Trying to allocate 129 32 -- Trying to allocate 128 14 -- Trying to allocate 128 16 -- Trying to allocate 129 8 As we can see the success rate is much more volatile and smaller without this patch. So the patch not only makes the retry logic for costly requests more sensible the success rate is even higher. Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: Hillf Danton Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Mel Gorman Cc: Tetsuo Handa Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 38ad6dd..dea406a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3260,6 +3260,8 @@ should_compact_retry(unsigned int order, enum compact_result compact_result, enum migrate_mode *migrate_mode, int compaction_retries) { + int max_retries = MAX_COMPACT_RETRIES; + if (!order) return false; @@ -3277,17 +3279,24 @@ should_compact_retry(unsigned int order, enum compact_result compact_result, } /* - * !costly allocations are really important and we have to make sure - * the compaction wasn't deferred or didn't bail out early due to locks - * contention before we go OOM. Still cap the reclaim retry loops with - * progress to prevent from looping forever and potential trashing. + * make sure the compaction wasn't deferred or didn't bail out early + * due to locks contention before we declare that we should give up. */ - if (order <= PAGE_ALLOC_COSTLY_ORDER) { - if (compaction_withdrawn(compact_result)) - return true; - if (compaction_retries <= MAX_COMPACT_RETRIES) - return true; - } + if (compaction_withdrawn(compact_result)) + return true; + + /* + * !costly requests are much more important than __GFP_REPEAT + * costly ones because they are de facto nofail and invoke OOM + * killer to move on while costly can fail and users are ready + * to cope with that. 1/4 retries is rather arbitrary but we + * would need much more detailed feedback from compaction to + * make a better decision. + */ + if (order > PAGE_ALLOC_COSTLY_ORDER) + max_retries /= 4; + if (compaction_retries <= max_retries) + return true; return false; } @@ -3449,18 +3458,17 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask) * Checks whether it makes sense to retry the reclaim to make a forward progress * for the given allocation request. * The reclaim feedback represented by did_some_progress (any progress during - * the last reclaim round), pages_reclaimed (cumulative number of reclaimed - * pages) and no_progress_loops (number of reclaim rounds without any progress - * in a row) is considered as well as the reclaimable pages on the applicable - * zone list (with a backoff mechanism which is a function of no_progress_loops). + * the last reclaim round) and no_progress_loops (number of reclaim rounds without + * any progress in a row) is considered as well as the reclaimable pages on the + * applicable zone list (with a backoff mechanism which is a function of + * no_progress_loops). * * Returns true if a retry is viable or false to enter the oom path. */ static inline bool should_reclaim_retry(gfp_t gfp_mask, unsigned order, struct alloc_context *ac, int alloc_flags, - bool did_some_progress, unsigned long pages_reclaimed, - int no_progress_loops) + bool did_some_progress, int no_progress_loops) { struct zone *zone; struct zoneref *z; @@ -3472,14 +3480,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, if (no_progress_loops > MAX_RECLAIM_RETRIES) return false; - if (order > PAGE_ALLOC_COSTLY_ORDER) { - if (pages_reclaimed >= (1< PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT)) goto noretry; - if (did_some_progress) { + /* + * Costly allocations might have made a progress but this doesn't mean + * their order will become available due to high fragmentation so + * always increment the no progress counter for them + */ + if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) no_progress_loops = 0; - pages_reclaimed += did_some_progress; - } else { + else no_progress_loops++; - } if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, - did_some_progress > 0, pages_reclaimed, - no_progress_loops)) + did_some_progress > 0, no_progress_loops)) goto retry; /* -- cgit v0.10.2 From 86a294a81f93d6f36d00ec3ff779d36d218f852d Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:57:12 -0700 Subject: mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders "mm: consider compaction feedback also for costly allocation" has removed the upper bound for the reclaim/compaction retries based on the number of reclaimed pages for costly orders. While this is desirable the patch did miss a mis interaction between reclaim, compaction and the retry logic. The direct reclaim tries to get zones over min watermark while compaction backs off and returns COMPACT_SKIPPED when all zones are below low watermark + 1< Acked-by: Hillf Danton Acked-by: Vlastimil Babka Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Mel Gorman Cc: Tetsuo Handa Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 8d8c916..a58c852 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -142,6 +142,10 @@ static inline bool compaction_withdrawn(enum compact_result result) return false; } + +bool compaction_zonelist_suitable(struct alloc_context *ac, int order, + int alloc_flags); + extern int kcompactd_run(int nid); extern void kcompactd_stop(int nid); extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c60db20..8dd0333 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -739,6 +739,9 @@ static inline bool is_dev_zone(const struct zone *zone) extern struct mutex zonelists_mutex; void build_all_zonelists(pg_data_t *pgdat, struct zone *zone); void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx); +bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, + int classzone_idx, unsigned int alloc_flags, + long free_pages); bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, int classzone_idx, unsigned int alloc_flags); diff --git a/mm/compaction.c b/mm/compaction.c index 4af1577..d8a20fc 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1318,7 +1318,8 @@ static enum compact_result compact_finished(struct zone *zone, */ static enum compact_result __compaction_suitable(struct zone *zone, int order, unsigned int alloc_flags, - int classzone_idx) + int classzone_idx, + unsigned long wmark_target) { int fragindex; unsigned long watermark; @@ -1341,7 +1342,8 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order, * allocated and for a short time, the footprint is higher */ watermark += (2UL << order); - if (!zone_watermark_ok(zone, 0, watermark, classzone_idx, alloc_flags)) + if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx, + alloc_flags, wmark_target)) return COMPACT_SKIPPED; /* @@ -1368,7 +1370,8 @@ enum compact_result compaction_suitable(struct zone *zone, int order, { enum compact_result ret; - ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx); + ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx, + zone_page_state(zone, NR_FREE_PAGES)); trace_mm_compaction_suitable(zone, order, ret); if (ret == COMPACT_NOT_SUITABLE_ZONE) ret = COMPACT_SKIPPED; @@ -1376,6 +1379,39 @@ enum compact_result compaction_suitable(struct zone *zone, int order, return ret; } +bool compaction_zonelist_suitable(struct alloc_context *ac, int order, + int alloc_flags) +{ + struct zone *zone; + struct zoneref *z; + + /* + * Make sure at least one zone would pass __compaction_suitable if we continue + * retrying the reclaim. + */ + for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, + ac->nodemask) { + unsigned long available; + enum compact_result compact_result; + + /* + * Do not consider all the reclaimable memory because we do not + * want to trash just for a single high order allocation which + * is even not guaranteed to appear even if __compaction_suitable + * is happy about the watermark check. + */ + available = zone_reclaimable_pages(zone) / order; + available += zone_page_state_snapshot(zone, NR_FREE_PAGES); + compact_result = __compaction_suitable(zone, order, alloc_flags, + ac_classzone_idx(ac), available); + if (compact_result != COMPACT_SKIPPED && + compact_result != COMPACT_NOT_SUITABLE_ZONE) + return true; + } + + return false; +} + static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc) { enum compact_result ret; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dea406a..089f760 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2750,10 +2750,9 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) * one free page of a suitable size. Checking now avoids taking the zone lock * to check in the allocation paths if no pages are free. */ -static bool __zone_watermark_ok(struct zone *z, unsigned int order, - unsigned long mark, int classzone_idx, - unsigned int alloc_flags, - long free_pages) +bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, + int classzone_idx, unsigned int alloc_flags, + long free_pages) { long min = mark; int o; @@ -3256,8 +3255,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, } static inline bool -should_compact_retry(unsigned int order, enum compact_result compact_result, - enum migrate_mode *migrate_mode, +should_compact_retry(struct alloc_context *ac, int order, int alloc_flags, + enum compact_result compact_result, enum migrate_mode *migrate_mode, int compaction_retries) { int max_retries = MAX_COMPACT_RETRIES; @@ -3281,9 +3280,11 @@ should_compact_retry(unsigned int order, enum compact_result compact_result, /* * make sure the compaction wasn't deferred or didn't bail out early * due to locks contention before we declare that we should give up. + * But do not retry if the given zonelist is not suitable for + * compaction. */ if (compaction_withdrawn(compact_result)) - return true; + return compaction_zonelist_suitable(ac, order, alloc_flags); /* * !costly requests are much more important than __GFP_REPEAT @@ -3311,7 +3312,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, } static inline bool -should_compact_retry(unsigned int order, enum compact_result compact_result, +should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags, + enum compact_result compact_result, enum migrate_mode *migrate_mode, int compaction_retries) { @@ -3706,8 +3708,9 @@ retry: * of free memory (see __compaction_suitable) */ if (did_some_progress > 0 && - should_compact_retry(order, compact_result, - &migration_mode, compaction_retries)) + should_compact_retry(ac, order, alloc_flags, + compact_result, &migration_mode, + compaction_retries)) goto retry; /* Reclaim has failed us, start killing things */ -- cgit v0.10.2 From 31e49bfda18464b9b0cfe42044d5a4be4a0ca0f3 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:57:15 -0700 Subject: mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION Joonsoo has reported that he is able to trigger OOM for !costly high order requests (heavy fork() workload close the OOM) with the new oom detection rework. This is because we rely only on should_reclaim_retry when the compaction is disabled and it only checks watermarks for the requested order and so we might trigger OOM when there is a lot of free memory. It is not very clear what are the usual workloads when the compaction is disabled. Relying on high order allocations heavily without any mechanism to create those orders except for unbound amount of reclaim is certainly not a good idea. To prevent from potential regressions let's help this configuration some. We have to sacrifice the determinsm though because there simply is none here possible. should_compact_retry implementation for !CONFIG_COMPACTION, which was empty so far, will do watermark check for order-0 on all eligible zones. This will cause retrying until either the reclaim cannot make any further progress or all the zones are depleted even for order-0 pages. This means that the number of retries is basically unbounded for !costly orders but that was the case before the rework as well so this shouldn't regress. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1463051677-29418-3-git-send-email-mhocko@kernel.org Reported-by: Joonsoo Kim Signed-off-by: Michal Hocko Acked-by: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 089f760..7e49bc3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3317,6 +3317,24 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla enum migrate_mode *migrate_mode, int compaction_retries) { + struct zone *zone; + struct zoneref *z; + + if (!order || order > PAGE_ALLOC_COSTLY_ORDER) + return false; + + /* + * There are setups with compaction disabled which would prefer to loop + * inside the allocator rather than hit the oom killer prematurely. + * Let's give them a good hope and keep retrying while the order-0 + * watermarks are OK. + */ + for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, + ac->nodemask) { + if (zone_watermark_ok(zone, 0, min_wmark_pages(zone), + ac_classzone_idx(ac), alloc_flags)) + return true; + } return false; } #endif /* CONFIG_COMPACTION */ -- cgit v0.10.2 From bb8a4b7fd1266ef888b3a80aa5f266062b224ef4 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:57:18 -0700 Subject: mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully Commit 36324a990cf5 ("oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space") not only clears TIF_MEMDIE for oom reaped task but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the oom killer. This works in simple cases but it is not sufficient for (unlikely) cases where the mm is shared between independent processes (as they do not share signal struct). If the mm had only small amount of memory which could be reaped then another task sharing the mm could be selected and that wouldn't help to move out from the oom situation. Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set. Set the flag after __oom_reap_task is done with a task. This will force the select_bad_process() to ignore all already oom reaped tasks as well as no such task is sacrificed for its parent. Signed-off-by: Michal Hocko Cc: Tetsuo Handa Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/sched.h b/include/linux/sched.h index 31bd0d9..40eabf1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -521,6 +521,7 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_HAS_UPROBES 19 /* has uprobes */ #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ +#define MMF_OOM_REAPED 21 /* mm has been already reaped */ #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 415f7eb..c0376ef 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -174,8 +174,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, if (!p) return 0; + /* + * Do not even consider tasks which are explicitly marked oom + * unkillable or have been already oom reaped. + */ adj = (long)p->signal->oom_score_adj; - if (adj == OOM_SCORE_ADJ_MIN) { + if (adj == OOM_SCORE_ADJ_MIN || + test_bit(MMF_OOM_REAPED, &p->mm->flags)) { task_unlock(p); return 0; } @@ -513,7 +518,7 @@ static bool __oom_reap_task(struct task_struct *tsk) * This task can be safely ignored because we cannot do much more * to release its memory. */ - tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN; + set_bit(MMF_OOM_REAPED, &mm->flags); out: mmput(mm); return ret; -- cgit v0.10.2 From ec8d7c14ea14922fe21945b458a75e39f11dd832 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:57:21 -0700 Subject: mm, oom_reaper: do not mmput synchronously from the oom reaper context Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g. exit_aio waits for an IO). If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim. We should strive for making this context as reliable and independent on other subsystems as much as possible. Introduce mmput_async which will perform the slow path from an async (WQ) context. This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore. The only exception is when mmap_sem trylock has failed which shouldn't happen too often. The issue is only theoretical but not impossible. Signed-off-by: Michal Hocko Reported-by: Tetsuo Handa Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 1fda9c9..d553855 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -513,6 +514,7 @@ struct mm_struct { #ifdef CONFIG_HUGETLB_PAGE atomic_long_t hugetlb_usage; #endif + struct work_struct async_put_work; }; static inline void mm_init_cpumask(struct mm_struct *mm) diff --git a/include/linux/sched.h b/include/linux/sched.h index 40eabf1..479e3ca 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2730,6 +2730,11 @@ static inline void mmdrop(struct mm_struct * mm) /* mmput gets rid of the mappings and all user-space */ extern void mmput(struct mm_struct *); +/* same as above but performs the slow path from the async kontext. Can + * be called from the atomic context as well + */ +extern void mmput_async(struct mm_struct *); + /* Grab a reference to a task's mm, if it is not already going away */ extern struct mm_struct *get_task_mm(struct task_struct *task); /* diff --git a/kernel/fork.c b/kernel/fork.c index 3e84515..8fbed71 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -699,6 +699,26 @@ void __mmdrop(struct mm_struct *mm) } EXPORT_SYMBOL_GPL(__mmdrop); +static inline void __mmput(struct mm_struct *mm) +{ + VM_BUG_ON(atomic_read(&mm->mm_users)); + + uprobe_clear_state(mm); + exit_aio(mm); + ksm_exit(mm); + khugepaged_exit(mm); /* must run before exit_mmap */ + exit_mmap(mm); + set_mm_exe_file(mm, NULL); + if (!list_empty(&mm->mmlist)) { + spin_lock(&mmlist_lock); + list_del(&mm->mmlist); + spin_unlock(&mmlist_lock); + } + if (mm->binfmt) + module_put(mm->binfmt->module); + mmdrop(mm); +} + /* * Decrement the use count and release all resources for an mm. */ @@ -706,24 +726,24 @@ void mmput(struct mm_struct *mm) { might_sleep(); + if (atomic_dec_and_test(&mm->mm_users)) + __mmput(mm); +} +EXPORT_SYMBOL_GPL(mmput); + +static void mmput_async_fn(struct work_struct *work) +{ + struct mm_struct *mm = container_of(work, struct mm_struct, async_put_work); + __mmput(mm); +} + +void mmput_async(struct mm_struct *mm) +{ if (atomic_dec_and_test(&mm->mm_users)) { - uprobe_clear_state(mm); - exit_aio(mm); - ksm_exit(mm); - khugepaged_exit(mm); /* must run before exit_mmap */ - exit_mmap(mm); - set_mm_exe_file(mm, NULL); - if (!list_empty(&mm->mmlist)) { - spin_lock(&mmlist_lock); - list_del(&mm->mmlist); - spin_unlock(&mmlist_lock); - } - if (mm->binfmt) - module_put(mm->binfmt->module); - mmdrop(mm); + INIT_WORK(&mm->async_put_work, mmput_async_fn); + schedule_work(&mm->async_put_work); } } -EXPORT_SYMBOL_GPL(mmput); /** * set_mm_exe_file - change a reference to the mm's executable file diff --git a/mm/oom_kill.c b/mm/oom_kill.c index c0376ef..c0e37dd 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -446,7 +446,6 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); - static bool __oom_reap_task(struct task_struct *tsk) { struct mmu_gather tlb; @@ -520,7 +519,12 @@ static bool __oom_reap_task(struct task_struct *tsk) */ set_bit(MMF_OOM_REAPED, &mm->flags); out: - mmput(mm); + /* + * Drop our reference but make sure the mmput slow path is called from a + * different context because we shouldn't risk we get stuck there and + * put the oom_reaper out of the way. + */ + mmput_async(mm); return ret; } -- cgit v0.10.2 From 98748bd722005be9de2662bd4f7e41ad8148bdbd Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 20 May 2016 16:57:24 -0700 Subject: oom: consider multi-threaded tasks in task_will_free_mem task_will_free_mem is a misnomer for a more complex PF_EXITING test for early break out from the oom killer because it is believed that such a task would release its memory shortly and so we do not have to select an oom victim and perform a disruptive action. Currently we make sure that the given task is not participating in the core dumping because it might get blocked for a long time - see commit d003f371b270 ("oom: don't assume that a coredumping thread will exit soon"). The check can still do better though. We shouldn't consider the task unless the whole thread group is going down. This is rather unlikely but not impossible. A single exiting thread would surely leave all the address space behind. If we are really unlucky it might get stuck on the exit path and keep its TIF_MEMDIE and so block the oom killer. Link: http://lkml.kernel.org/r/1460452756-15491-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko Cc: Oleg Nesterov Cc: David Rientjes Cc: Tetsuo Handa Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/oom.h b/include/linux/oom.h index 83b9c39..d3f533f 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -110,13 +110,24 @@ extern struct task_struct *find_lock_task_mm(struct task_struct *p); static inline bool task_will_free_mem(struct task_struct *task) { + struct signal_struct *sig = task->signal; + /* * A coredumping process may sleep for an extended period in exit_mm(), * so the oom killer cannot assume that the process will promptly exit * and release memory. */ - return (task->flags & PF_EXITING) && - !(task->signal->flags & SIGNAL_GROUP_COREDUMP); + if (sig->flags & SIGNAL_GROUP_COREDUMP) + return false; + + if (!(task->flags & PF_EXITING)) + return false; + + /* Make sure that the whole thread group is going down */ + if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT)) + return false; + + return true; } /* sysctls */ -- cgit v0.10.2 From f44666b04605d1c7fd94ab90b7ccf633e7eff228 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Fri, 20 May 2016 16:57:27 -0700 Subject: mm,oom: speed up select_bad_process() loop Since commit 3a5dda7a17cf ("oom: prevent unnecessary oom kills or kernel panics"), select_bad_process() is using for_each_process_thread(). Since oom_unkillable_task() scans all threads in the caller's thread group and oom_task_origin() scans signal_struct of the caller's thread group, we don't need to call oom_unkillable_task() and oom_task_origin() on each thread. Also, since !mm test will be done later at oom_badness(), we don't need to do !mm test on each thread. Therefore, we only need to do TIF_MEMDIE test on each thread. Although the original code was correct it was quite inefficient because each thread group was scanned num_threads times which can be a lot especially with processes with many threads. Even though the OOM is extremely cold path it is always good to be as effective as possible when we are inside rcu_read_lock() - aka unpreemptible context. If we track number of TIF_MEMDIE threads inside signal_struct, we don't need to do TIF_MEMDIE test on each thread. This will allow select_bad_process() to use for_each_process(). This patch adds a counter to signal_struct for tracking how many TIF_MEMDIE threads are in a given thread group, and check it at oom_scan_process_thread() so that select_bad_process() can use for_each_process() rather than for_each_process_thread(). [mhocko@suse.com: do not blow the signal_struct size] Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa Acked-by: Michal Hocko Cc: David Rientjes Cc: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/sched.h b/include/linux/sched.h index 479e3ca..01fe1bb 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -669,6 +669,7 @@ struct signal_struct { atomic_t sigcnt; atomic_t live; int nr_threads; + atomic_t oom_victims; /* # of TIF_MEDIE threads in this thread group */ struct list_head thread_head; wait_queue_head_t wait_chldexit; /* for wait4() */ diff --git a/mm/oom_kill.c b/mm/oom_kill.c index c0e37dd..5bb2f76 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -283,12 +283,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, * This task already has access to memory reserves and is being killed. * Don't allow any other task to have access to the reserves. */ - if (test_tsk_thread_flag(task, TIF_MEMDIE)) { - if (!is_sysrq_oom(oc)) - return OOM_SCAN_ABORT; - } - if (!task->mm) - return OOM_SCAN_CONTINUE; + if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) + return OOM_SCAN_ABORT; /* * If task is allocating a lot of memory and has been marked to be @@ -307,12 +303,12 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, static struct task_struct *select_bad_process(struct oom_control *oc, unsigned int *ppoints, unsigned long totalpages) { - struct task_struct *g, *p; + struct task_struct *p; struct task_struct *chosen = NULL; unsigned long chosen_points = 0; rcu_read_lock(); - for_each_process_thread(g, p) { + for_each_process(p) { unsigned int points; switch (oom_scan_process_thread(oc, p, totalpages)) { @@ -331,9 +327,6 @@ static struct task_struct *select_bad_process(struct oom_control *oc, points = oom_badness(p, NULL, oc->nodemask, totalpages); if (!points || points < chosen_points) continue; - /* Prefer thread group leaders for display purposes */ - if (points == chosen_points && thread_group_leader(chosen)) - continue; chosen = p; chosen_points = points; @@ -673,6 +666,7 @@ void mark_oom_victim(struct task_struct *tsk) /* OOM killer might race with memcg OOM */ if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) return; + atomic_inc(&tsk->signal->oom_victims); /* * Make sure that the task is woken up from uninterruptible sleep * if it is frozen because OOM killer wouldn't be able to free @@ -690,6 +684,7 @@ void exit_oom_victim(struct task_struct *tsk) { if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE)) return; + atomic_dec(&tsk->signal->oom_victims); if (!atomic_dec_return(&oom_victims)) wake_up_all(&oom_victims_wait); -- cgit v0.10.2 From 340a43bed674a70308f196f2a61ec0b01f8a14d9 Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Fri, 20 May 2016 16:57:30 -0700 Subject: mm: thp: simplify the implementation of mk_huge_pmd() The implementation of mk_huge_pmd looks verbose, it could be just simplified to one line code. Signed-off-by: Yang Shi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 66675ee..86eafd9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -764,10 +764,7 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot) { - pmd_t entry; - entry = mk_pmd(page, prot); - entry = pmd_mkhuge(entry); - return entry; + return pmd_mkhuge(mk_pmd(page, prot)); } static inline struct list_head *page_deferred_list(struct page *page) -- cgit v0.10.2 From 495367c051fb200a42636bdc63be78ca1713a85a Mon Sep 17 00:00:00 2001 From: Chen Yucong Date: Fri, 20 May 2016 16:57:32 -0700 Subject: mm/memory-failure.c: replace "MCE" with "Memory failure" HWPoison was specific to some particular x86 platforms. And it is often seen as high level machine check handler. And therefore, 'MCE' is used for the format prefix of printk(). However, 'PowerNV' has also used HWPoison for handling memory errors[1], so 'MCE' is no longer suitable to memory_failure.c. Additionally, 'MCE' and 'Memory failure' have different context. The former belongs to exception context and the latter belongs to process context. Furthermore, HWPoison can also be used for off-lining those sub-health pages that do not trigger any machine check exception. This patch aims to replace 'MCE' with a more appropriate prefix. [1] commit 75eb3d9b60c2 ("powerpc/powernv: Get FSP memory errors and plumb into memory poison infrastructure.") Signed-off-by: Chen Yucong Acked-by: Naoya Horiguchi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memory-failure.c b/mm/memory-failure.c index ca5acee..2fcca6b 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -184,8 +184,8 @@ static int kill_proc(struct task_struct *t, unsigned long addr, int trapno, struct siginfo si; int ret; - pr_err("MCE %#lx: Killing %s:%d due to hardware memory corruption\n", - pfn, t->comm, t->pid); + pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n", + pfn, t->comm, t->pid); si.si_signo = SIGBUS; si.si_errno = 0; si.si_addr = (void *)addr; @@ -208,7 +208,7 @@ static int kill_proc(struct task_struct *t, unsigned long addr, int trapno, ret = send_sig_info(SIGBUS, &si, t); /* synchronous? */ } if (ret < 0) - pr_info("MCE: Error sending signal to %s:%d: %d\n", + pr_info("Memory failure: Error sending signal to %s:%d: %d\n", t->comm, t->pid, ret); return ret; } @@ -289,7 +289,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p, } else { tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC); if (!tk) { - pr_err("MCE: Out of memory while machine check handling\n"); + pr_err("Memory failure: Out of memory while machine check handling\n"); return; } } @@ -303,7 +303,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p, * a SIGKILL because the error is not contained anymore. */ if (tk->addr == -EFAULT) { - pr_info("MCE: Unable to find user space address %lx in %s\n", + pr_info("Memory failure: Unable to find user space address %lx in %s\n", page_to_pfn(p), tsk->comm); tk->addr_valid = 0; } @@ -334,7 +334,7 @@ static void kill_procs(struct list_head *to_kill, int forcekill, int trapno, * signal and then access the memory. Just kill it. */ if (fail || tk->addr_valid == 0) { - pr_err("MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n", + pr_err("Memory failure: %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n", pfn, tk->tsk->comm, tk->tsk->pid); force_sig(SIGKILL, tk->tsk); } @@ -347,7 +347,7 @@ static void kill_procs(struct list_head *to_kill, int forcekill, int trapno, */ else if (kill_proc(tk->tsk, tk->addr, trapno, pfn, page, flags) < 0) - pr_err("MCE %#lx: Cannot send advisory machine check signal to %s:%d\n", + pr_err("Memory failure: %#lx: Cannot send advisory machine check signal to %s:%d\n", pfn, tk->tsk->comm, tk->tsk->pid); } put_task_struct(tk->tsk); @@ -559,7 +559,7 @@ static int me_kernel(struct page *p, unsigned long pfn) */ static int me_unknown(struct page *p, unsigned long pfn) { - pr_err("MCE %#lx: Unknown page state\n", pfn); + pr_err("Memory failure: %#lx: Unknown page state\n", pfn); return MF_FAILED; } @@ -604,11 +604,12 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn) if (mapping->a_ops->error_remove_page) { err = mapping->a_ops->error_remove_page(mapping, p); if (err != 0) { - pr_info("MCE %#lx: Failed to punch page: %d\n", + pr_info("Memory failure: %#lx: Failed to punch page: %d\n", pfn, err); } else if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO)) { - pr_info("MCE %#lx: failed to release buffers\n", pfn); + pr_info("Memory failure: %#lx: failed to release buffers\n", + pfn); } else { ret = MF_RECOVERED; } @@ -620,7 +621,8 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn) if (invalidate_inode_page(p)) ret = MF_RECOVERED; else - pr_info("MCE %#lx: Failed to invalidate\n", pfn); + pr_info("Memory failure: %#lx: Failed to invalidate\n", + pfn); } return ret; } @@ -833,7 +835,7 @@ static void action_result(unsigned long pfn, enum mf_action_page_type type, { trace_memory_failure_event(pfn, type, result); - pr_err("MCE %#lx: recovery action for %s: %s\n", + pr_err("Memory failure: %#lx: recovery action for %s: %s\n", pfn, action_page_types[type], action_name[result]); } @@ -849,7 +851,7 @@ static int page_action(struct page_state *ps, struct page *p, if (ps->action == me_swapcache_dirty && result == MF_DELAYED) count--; if (count != 0) { - pr_err("MCE %#lx: %s still referenced by %d users\n", + pr_err("Memory failure: %#lx: %s still referenced by %d users\n", pfn, action_page_types[ps->type], count); result = MF_FAILED; } @@ -882,7 +884,7 @@ int get_hwpoison_page(struct page *page) * tries to touch the "partially handled" page. */ if (!PageAnon(head)) { - pr_err("MCE: %#lx: non anonymous thp\n", + pr_err("Memory failure: %#lx: non anonymous thp\n", page_to_pfn(page)); return 0; } @@ -892,7 +894,8 @@ int get_hwpoison_page(struct page *page) if (head == compound_head(page)) return 1; - pr_info("MCE: %#lx cannot catch tail\n", page_to_pfn(page)); + pr_info("Memory failure: %#lx cannot catch tail\n", + page_to_pfn(page)); put_page(head); } @@ -931,12 +934,13 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn, return SWAP_SUCCESS; if (PageKsm(p)) { - pr_err("MCE %#lx: can't handle KSM pages.\n", pfn); + pr_err("Memory failure: %#lx: can't handle KSM pages.\n", pfn); return SWAP_FAIL; } if (PageSwapCache(p)) { - pr_err("MCE %#lx: keeping poisoned page in swap cache\n", pfn); + pr_err("Memory failure: %#lx: keeping poisoned page in swap cache\n", + pfn); ttu |= TTU_IGNORE_HWPOISON; } @@ -954,7 +958,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn, } else { kill = 0; ttu |= TTU_IGNORE_HWPOISON; - pr_info("MCE %#lx: corrupted page was clean: dropped without side effects\n", + pr_info("Memory failure: %#lx: corrupted page was clean: dropped without side effects\n", pfn); } } @@ -972,7 +976,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn, ret = try_to_unmap(hpage, ttu); if (ret != SWAP_SUCCESS) - pr_err("MCE %#lx: failed to unmap page (mapcount=%d)\n", + pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n", pfn, page_mapcount(hpage)); /* @@ -1040,14 +1044,16 @@ int memory_failure(unsigned long pfn, int trapno, int flags) panic("Memory failure from trap %d on page %lx", trapno, pfn); if (!pfn_valid(pfn)) { - pr_err("MCE %#lx: memory outside kernel control\n", pfn); + pr_err("Memory failure: %#lx: memory outside kernel control\n", + pfn); return -ENXIO; } p = pfn_to_page(pfn); orig_head = hpage = compound_head(p); if (TestSetPageHWPoison(p)) { - pr_err("MCE %#lx: already hardware poisoned\n", pfn); + pr_err("Memory failure: %#lx: already hardware poisoned\n", + pfn); return 0; } @@ -1112,9 +1118,11 @@ int memory_failure(unsigned long pfn, int trapno, int flags) if (!PageAnon(hpage) || unlikely(split_huge_page(hpage))) { unlock_page(hpage); if (!PageAnon(hpage)) - pr_err("MCE: %#lx: non anonymous thp\n", pfn); + pr_err("Memory failure: %#lx: non anonymous thp\n", + pfn); else - pr_err("MCE: %#lx: thp split failed\n", pfn); + pr_err("Memory failure: %#lx: thp split failed\n", + pfn); if (TestClearPageHWPoison(p)) num_poisoned_pages_sub(nr_pages); put_hwpoison_page(p); @@ -1178,7 +1186,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags) * unpoison always clear PG_hwpoison inside page lock */ if (!PageHWPoison(p)) { - pr_err("MCE %#lx: just unpoisoned\n", pfn); + pr_err("Memory failure: %#lx: just unpoisoned\n", pfn); num_poisoned_pages_sub(nr_pages); unlock_page(hpage); put_hwpoison_page(hpage); @@ -1395,25 +1403,25 @@ int unpoison_memory(unsigned long pfn) page = compound_head(p); if (!PageHWPoison(p)) { - unpoison_pr_info("MCE: Page was already unpoisoned %#lx\n", + unpoison_pr_info("Unpoison: Page was already unpoisoned %#lx\n", pfn, &unpoison_rs); return 0; } if (page_count(page) > 1) { - unpoison_pr_info("MCE: Someone grabs the hwpoison page %#lx\n", + unpoison_pr_info("Unpoison: Someone grabs the hwpoison page %#lx\n", pfn, &unpoison_rs); return 0; } if (page_mapped(page)) { - unpoison_pr_info("MCE: Someone maps the hwpoison page %#lx\n", + unpoison_pr_info("Unpoison: Someone maps the hwpoison page %#lx\n", pfn, &unpoison_rs); return 0; } if (page_mapping(page)) { - unpoison_pr_info("MCE: the hwpoison page has non-NULL mapping %#lx\n", + unpoison_pr_info("Unpoison: the hwpoison page has non-NULL mapping %#lx\n", pfn, &unpoison_rs); return 0; } @@ -1424,7 +1432,7 @@ int unpoison_memory(unsigned long pfn) * In such case, we yield to memory_failure() and make unpoison fail. */ if (!PageHuge(page) && PageTransHuge(page)) { - unpoison_pr_info("MCE: Memory failure is now running on %#lx\n", + unpoison_pr_info("Unpoison: Memory failure is now running on %#lx\n", pfn, &unpoison_rs); return 0; } @@ -1439,13 +1447,13 @@ int unpoison_memory(unsigned long pfn) * to the end. */ if (PageHuge(page)) { - unpoison_pr_info("MCE: Memory failure is now running on free hugepage %#lx\n", + unpoison_pr_info("Unpoison: Memory failure is now running on free hugepage %#lx\n", pfn, &unpoison_rs); return 0; } if (TestClearPageHWPoison(p)) num_poisoned_pages_dec(); - unpoison_pr_info("MCE: Software-unpoisoned free page %#lx\n", + unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n", pfn, &unpoison_rs); return 0; } @@ -1458,7 +1466,7 @@ int unpoison_memory(unsigned long pfn) * the free buddy page pool. */ if (TestClearPageHWPoison(page)) { - unpoison_pr_info("MCE: Software-unpoisoned page %#lx\n", + unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n", pfn, &unpoison_rs); num_poisoned_pages_sub(nr_pages); freeit = 1; -- cgit v0.10.2 From f705ac4b39f30a6a5f8411a42114758f4d4655bc Mon Sep 17 00:00:00 2001 From: Alexander Kuleshov Date: Fri, 20 May 2016 16:57:35 -0700 Subject: mm/memblock.c: move memblock_{add,reserve}_region into memblock_{add,reserve} memblock_add_region() and memblock_reserve_region() do nothing specific before the call of memblock_add_range(), only print debug output. We can do the same in memblock_add() and memblock_reserve() since both memblock_add_region() and memblock_reserve_region() are not used by anybody outside of memblock.c and memblock_{add,reserve}() have the same set of flags and nids. Since memblock_add_region() and memblock_reserve_region() will be inlined, there will not be functional changes, but will improve code readability a little. Signed-off-by: Alexander Kuleshov Acked-by: Ard Biesheuvel Cc: Mel Gorman Cc: Pekka Enberg Cc: Tony Luck Cc: Tang Chen Cc: David Gibson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memblock.c b/mm/memblock.c index b570ddd..3b93daa 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -606,22 +606,14 @@ int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size, return memblock_add_range(&memblock.memory, base, size, nid, 0); } -static int __init_memblock memblock_add_region(phys_addr_t base, - phys_addr_t size, - int nid, - unsigned long flags) +int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size) { memblock_dbg("memblock_add: [%#016llx-%#016llx] flags %#02lx %pF\n", (unsigned long long)base, (unsigned long long)base + size - 1, - flags, (void *)_RET_IP_); - - return memblock_add_range(&memblock.memory, base, size, nid, flags); -} + 0UL, (void *)_RET_IP_); -int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size) -{ - return memblock_add_region(base, size, MAX_NUMNODES, 0); + return memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0); } /** @@ -732,22 +724,14 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size) return memblock_remove_range(&memblock.reserved, base, size); } -static int __init_memblock memblock_reserve_region(phys_addr_t base, - phys_addr_t size, - int nid, - unsigned long flags) +int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) { memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n", (unsigned long long)base, (unsigned long long)base + size - 1, - flags, (void *)_RET_IP_); - - return memblock_add_range(&memblock.reserved, base, size, nid, flags); -} + 0UL, (void *)_RET_IP_); -int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) -{ - return memblock_reserve_region(base, size, MAX_NUMNODES, 0); + return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0); } /** -- cgit v0.10.2 From 80c4bd7a5e4368b680e0aeb57050a1b06eb573d8 Mon Sep 17 00:00:00 2001 From: Chris Wilson Date: Fri, 20 May 2016 16:57:38 -0700 Subject: mm/vmalloc: keep a separate lazy-free list When mixing lots of vmallocs and set_memory_*() (which calls vm_unmap_aliases()) I encountered situations where the performance degraded severely due to the walking of the entire vmap_area list each invocation. One simple improvement is to add the lazily freed vmap_area to a separate lockless free list, such that we then avoid having to walk the full list on each purge. Signed-off-by: Chris Wilson Reviewed-by: Roman Pen Cc: Joonas Lahtinen Cc: Tvrtko Ursulin Cc: Daniel Vetter Cc: David Rientjes Cc: Joonsoo Kim Cc: Roman Pen Cc: Mel Gorman Cc: Toshi Kani Cc: Shawn Lin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index d1f1d33..957adb7 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -4,6 +4,7 @@ #include #include #include +#include #include /* pgprot_t */ #include @@ -44,7 +45,7 @@ struct vmap_area { unsigned long flags; struct rb_node rb_node; /* address sorted rbtree */ struct list_head list; /* address sorted list */ - struct list_head purge_list; /* "lazy purge" list */ + struct llist_node purge_list; /* "lazy purge" list */ struct vm_struct *vm; struct rcu_head rcu_head; }; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index ae7d20b..6e32918 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -274,13 +274,12 @@ EXPORT_SYMBOL(vmalloc_to_pfn); /*** Global kva allocator ***/ -#define VM_LAZY_FREE 0x01 -#define VM_LAZY_FREEING 0x02 #define VM_VM_AREA 0x04 static DEFINE_SPINLOCK(vmap_area_lock); /* Export for kexec only */ LIST_HEAD(vmap_area_list); +static LLIST_HEAD(vmap_purge_list); static struct rb_root vmap_area_root = RB_ROOT; /* The vmap cache globals are protected by vmap_area_lock */ @@ -601,7 +600,7 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, int sync, int force_flush) { static DEFINE_SPINLOCK(purge_lock); - LIST_HEAD(valist); + struct llist_node *valist; struct vmap_area *va; struct vmap_area *n_va; int nr = 0; @@ -620,20 +619,14 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, if (sync) purge_fragmented_blocks_allcpus(); - rcu_read_lock(); - list_for_each_entry_rcu(va, &vmap_area_list, list) { - if (va->flags & VM_LAZY_FREE) { - if (va->va_start < *start) - *start = va->va_start; - if (va->va_end > *end) - *end = va->va_end; - nr += (va->va_end - va->va_start) >> PAGE_SHIFT; - list_add_tail(&va->purge_list, &valist); - va->flags |= VM_LAZY_FREEING; - va->flags &= ~VM_LAZY_FREE; - } + valist = llist_del_all(&vmap_purge_list); + llist_for_each_entry(va, valist, purge_list) { + if (va->va_start < *start) + *start = va->va_start; + if (va->va_end > *end) + *end = va->va_end; + nr += (va->va_end - va->va_start) >> PAGE_SHIFT; } - rcu_read_unlock(); if (nr) atomic_sub(nr, &vmap_lazy_nr); @@ -643,7 +636,7 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, if (nr) { spin_lock(&vmap_area_lock); - list_for_each_entry_safe(va, n_va, &valist, purge_list) + llist_for_each_entry_safe(va, n_va, valist, purge_list) __free_vmap_area(va); spin_unlock(&vmap_area_lock); } @@ -678,9 +671,15 @@ static void purge_vmap_area_lazy(void) */ static void free_vmap_area_noflush(struct vmap_area *va) { - va->flags |= VM_LAZY_FREE; - atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr); - if (unlikely(atomic_read(&vmap_lazy_nr) > lazy_max_pages())) + int nr_lazy; + + nr_lazy = atomic_add_return((va->va_end - va->va_start) >> PAGE_SHIFT, + &vmap_lazy_nr); + + /* After this point, we may free va at any time */ + llist_add(&va->purge_list, &vmap_purge_list); + + if (unlikely(nr_lazy > lazy_max_pages())) try_purge_vmap_area_lazy(); } -- cgit v0.10.2 From d5957d2fc232a689543bdbed1a5ff8002f0e9843 Mon Sep 17 00:00:00 2001 From: Yongji Xie Date: Fri, 20 May 2016 16:57:41 -0700 Subject: mm: fix incorrect pfn passed to untrack_pfn() in remap_pfn_range() We use generic hooks in remap_pfn_range() to help archs to track pfnmap regions. The code is something like: int remap_pfn_range() { ... track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size)); ... pfn -= addr >> PAGE_SHIFT; ... untrack_pfn(vma, pfn, PAGE_ALIGN(size)); ... } Here we can easily find the pfn is changed but not recovered before untrack_pfn() is called. That's incorrect. There are no known runtime effects - this is from inspection. Signed-off-by: Yongji Xie Cc: Kirill A. Shutemov Cc: Jerome Marchand Cc: Ingo Molnar Cc: Vlastimil Babka Cc: Dave Hansen Cc: Dan Williams Cc: Matthew Wilcox Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Andy Lutomirski Cc: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memory.c b/mm/memory.c index 07493e3..007c72a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1744,6 +1744,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, unsigned long next; unsigned long end = addr + PAGE_ALIGN(size); struct mm_struct *mm = vma->vm_mm; + unsigned long remap_pfn = pfn; int err; /* @@ -1770,7 +1771,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, vma->vm_pgoff = pfn; } - err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size)); + err = track_pfn_remap(vma, &prot, remap_pfn, addr, PAGE_ALIGN(size)); if (err) return -EINVAL; @@ -1789,7 +1790,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, } while (pgd++, addr = next, addr != end); if (err) - untrack_pfn(vma, pfn, PAGE_ALIGN(size)); + untrack_pfn(vma, remap_pfn, PAGE_ALIGN(size)); return err; } -- cgit v0.10.2 From f4fcd55841fc9e46daac553b39361572453c2b88 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Fri, 20 May 2016 16:57:45 -0700 Subject: mm: enable RLIMIT_DATA by default with workaround for valgrind Since commit 84638335900f ("mm: rework virtual memory accounting") RLIMIT_DATA limits both brk() and private mmap() but this's disabled by default because of incompatibility with older versions of valgrind. Valgrind always set limit to zero and fails if RLIMIT_DATA is enabled. Fortunately it changes only rlim_cur and keeps rlim_max for reverting limit back when needed. This patch checks current usage also against rlim_max if rlim_cur is zero. This is safe because task anyway can increase rlim_cur up to rlim_max. Size of brk is still checked against rlim_cur, so this part is completely compatible - zero rlim_cur forbids brk() but allows private mmap(). Link: http://lkml.kernel.org/r/56A28613.5070104@de.ibm.com Signed-off-by: Konstantin Khlebnikov Acked-by: Linus Torvalds Cc: Cyrill Gorcunov Cc: Christian Borntraeger Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/mmap.c b/mm/mmap.c index fba246b..b9274a0 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -66,7 +66,7 @@ const int mmap_rnd_compat_bits_max = CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX; int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS; #endif -static bool ignore_rlimit_data = true; +static bool ignore_rlimit_data; core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644); static void unmap_region(struct mm_struct *mm, @@ -2886,13 +2886,17 @@ bool may_expand_vm(struct mm_struct *mm, vm_flags_t flags, unsigned long npages) if (is_data_mapping(flags) && mm->data_vm + npages > rlimit(RLIMIT_DATA) >> PAGE_SHIFT) { - if (ignore_rlimit_data) - pr_warn_once("%s (%d): VmData %lu exceed data ulimit %lu. Will be forbidden soon.\n", + /* Workaround for Valgrind */ + if (rlimit(RLIMIT_DATA) == 0 && + mm->data_vm + npages <= rlimit_max(RLIMIT_DATA) >> PAGE_SHIFT) + return true; + if (!ignore_rlimit_data) { + pr_warn_once("%s (%d): VmData %lu exceed data ulimit %lu. Update limits or use boot option ignore_rlimit_data.\n", current->comm, current->pid, (mm->data_vm + npages) << PAGE_SHIFT, rlimit(RLIMIT_DATA)); - else return false; + } } return true; -- cgit v0.10.2 From 63678c32e209bd165f33432bbed72b2954ce5ae4 Mon Sep 17 00:00:00 2001 From: Rich Felker Date: Fri, 20 May 2016 16:57:47 -0700 Subject: tmpfs/ramfs: fix VM_MAYSHARE mappings for NOMMU The nommu do_mmap expects f_op->get_unmapped_area to either succeed or return -ENOSYS for VM_MAYSHARE (e.g. private read-only) mappings. Returning addr in the non-MAP_SHARED case was completely wrong, and only happened to work because addr was 0. However, it prevented VM_MAYSHARE mappings from sharing backing with the fs cache, and forced such mappings (including shareable program text) to be copied whenever the number of mappings transitioned from 0 to 1, impacting performance and memory usage. Subsequent mappings beyond the first still correctly shared memory with the first. Instead, treat VM_MAYSHARE identically to VM_SHARED at the file ops level; do_mmap already handles the semantic differences between them. Signed-off-by: Rich Felker Cc: Michal Hocko Cc: Greg Ungerer Cc: Geert Uytterhoeven Cc: Yoshinori Sato Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c index a586467..be3ddd1 100644 --- a/fs/ramfs/file-nommu.c +++ b/fs/ramfs/file-nommu.c @@ -211,14 +211,11 @@ static unsigned long ramfs_nommu_get_unmapped_area(struct file *file, struct page **pages = NULL, **ptr, *page; loff_t isize; - if (!(flags & MAP_SHARED)) - return addr; - /* the mapping mustn't extend beyond the EOF */ lpages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; isize = i_size_read(inode); - ret = -EINVAL; + ret = -ENOSYS; maxpages = (isize + PAGE_SIZE - 1) >> PAGE_SHIFT; if (pgoff >= maxpages) goto out; @@ -227,7 +224,6 @@ static unsigned long ramfs_nommu_get_unmapped_area(struct file *file, goto out; /* gang-find the pages */ - ret = -ENOMEM; pages = kcalloc(lpages, sizeof(struct page *), GFP_KERNEL); if (!pages) goto out_free; @@ -263,7 +259,7 @@ out: */ static int ramfs_nommu_mmap(struct file *file, struct vm_area_struct *vma) { - if (!(vma->vm_flags & VM_SHARED)) + if (!(vma->vm_flags & (VM_SHARED | VM_MAYSHARE))) return -ENOSYS; file_accessed(file); -- cgit v0.10.2 From 297880f4af4e492ed5084be9397d65a18ade56ee Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Fri, 20 May 2016 16:57:50 -0700 Subject: mm, hugetlb_cgroup: round limit_in_bytes down to hugepage size The page_counter rounds limits down to page size values. This makes sense, except in the case of hugetlb_cgroup where it's not possible to charge partial hugepages. If the hugetlb_cgroup margin is less than the hugepage size being charged, it will fail as expected. Round the hugetlb_cgroup limit down to hugepage size, since it is the effective limit of the cgroup. For consistency, round down PAGE_COUNTER_MAX as well when a hugetlb_cgroup is created: this prevents error reports when a user cannot restore the value to the kernel default. Signed-off-by: David Rientjes Cc: Michal Hocko Cc: Nikolay Borisov Cc: Johannes Weiner Cc: "Kirill A. Shutemov" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index d8fb10d..eec1150 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -67,26 +67,42 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg) return false; } +static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup, + struct hugetlb_cgroup *parent_h_cgroup) +{ + int idx; + + for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) { + struct page_counter *counter = &h_cgroup->hugepage[idx]; + struct page_counter *parent = NULL; + unsigned long limit; + int ret; + + if (parent_h_cgroup) + parent = &parent_h_cgroup->hugepage[idx]; + page_counter_init(counter, parent); + + limit = round_down(PAGE_COUNTER_MAX, + 1 << huge_page_order(&hstates[idx])); + ret = page_counter_limit(counter, limit); + VM_BUG_ON(ret); + } +} + static struct cgroup_subsys_state * hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) { struct hugetlb_cgroup *parent_h_cgroup = hugetlb_cgroup_from_css(parent_css); struct hugetlb_cgroup *h_cgroup; - int idx; h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL); if (!h_cgroup) return ERR_PTR(-ENOMEM); - if (parent_h_cgroup) { - for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) - page_counter_init(&h_cgroup->hugepage[idx], - &parent_h_cgroup->hugepage[idx]); - } else { + if (!parent_h_cgroup) root_h_cgroup = h_cgroup; - for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) - page_counter_init(&h_cgroup->hugepage[idx], NULL); - } + + hugetlb_cgroup_init(h_cgroup, parent_h_cgroup); return &h_cgroup->css; } @@ -285,6 +301,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, return ret; idx = MEMFILE_IDX(of_cft(of)->private); + nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx])); switch (MEMFILE_ATTR(of_cft(of)->private)) { case RES_LIMIT: -- cgit v0.10.2 From b8ca9e3a612eaf3e54c6fa136c62246a1a9aece7 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Fri, 20 May 2016 16:57:53 -0700 Subject: mm: tighten fault_in_pages_writeable() copy_page_to_iter_iovec() is currently the only user of fault_in_pages_writeable(), and it definitely can use fragments from high order pages. Make sure fault_in_pages_writeable() is only touching two adjacent pages at most, as claimed. Signed-off-by: Eric Dumazet Cc: Al Viro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index fe1513f..9735410 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -518,33 +518,27 @@ void page_endio(struct page *page, int rw, int err); extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter); /* - * Fault a userspace page into pagetables. Return non-zero on a fault. - * - * This assumes that two userspace pages are always sufficient. + * Fault one or two userspace pages into pagetables. + * Return -EINVAL if more than two pages would be needed. + * Return non-zero on a fault. */ static inline int fault_in_pages_writeable(char __user *uaddr, int size) { - int ret; + int span, ret; if (unlikely(size == 0)) return 0; + span = offset_in_page(uaddr) + size; + if (span > 2 * PAGE_SIZE) + return -EINVAL; /* * Writing zeroes into userspace here is OK, because we know that if * the zero gets there, we'll be overwriting it. */ ret = __put_user(0, uaddr); - if (ret == 0) { - char __user *end = uaddr + size - 1; - - /* - * If the page was already mapped, this will get a cache miss - * for sure, so try to avoid doing it. - */ - if (((unsigned long)uaddr & PAGE_MASK) != - ((unsigned long)end & PAGE_MASK)) - ret = __put_user(0, end); - } + if (ret == 0 && span > PAGE_SIZE) + ret = __put_user(0, uaddr + size - 1); return ret; } -- cgit v0.10.2 From a4a921aa5c7a1ab621e71fd1a4289e76fe230cbd Mon Sep 17 00:00:00 2001 From: Ming Li Date: Fri, 20 May 2016 16:57:56 -0700 Subject: mm/swap.c: put activate_page_pvecs and other pagevecs together Put the activate_page_pvecs definition next to those of the other pagevecs, for clarity. Signed-off-by: Ming Li Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/swap.c b/mm/swap.c index 03aacbc..9591614 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -47,6 +47,9 @@ static DEFINE_PER_CPU(struct pagevec, lru_add_pvec); static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs); static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs); static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs); +#ifdef CONFIG_SMP +static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs); +#endif /* * This path almost never happens for VM activity - pages are normally @@ -274,8 +277,6 @@ static void __activate_page(struct page *page, struct lruvec *lruvec, } #ifdef CONFIG_SMP -static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs); - static void activate_page_drain(int cpu) { struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu); -- cgit v0.10.2 From 7fab358d90e6ba9d9cb702bee0c8a5f5c13bb6df Mon Sep 17 00:00:00 2001 From: Chen Gang Date: Fri, 20 May 2016 16:57:59 -0700 Subject: include/linux/hugetlb*.h: clean up code Macro HUGETLBFS_SB is clear enough, so one statement is clearer than 3 lines statements. Remove redundant return statements for non-return functions, which can save lines, at least. Signed-off-by: Chen Gang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index e44c578..7ef4b63 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -353,9 +353,7 @@ extern unsigned int default_hstate_idx; static inline struct hstate *hstate_inode(struct inode *i) { - struct hugetlbfs_sb_info *hsb; - hsb = HUGETLBFS_SB(i->i_sb); - return hsb->hstate; + return HUGETLBFS_SB(i->i_sb)->hstate; } static inline struct hstate *hstate_file(struct file *f) diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h index 24154c2..063962f 100644 --- a/include/linux/hugetlb_cgroup.h +++ b/include/linux/hugetlb_cgroup.h @@ -93,20 +93,17 @@ hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg, struct page *page) { - return; } static inline void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page) { - return; } static inline void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg) { - return; } static inline void hugetlb_cgroup_file_init(void) @@ -116,7 +113,6 @@ static inline void hugetlb_cgroup_file_init(void) static inline void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage) { - return; } #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */ -- cgit v0.10.2 From d70c17d436b3fb9dbdae8c93bf908a6110b0cb4f Mon Sep 17 00:00:00 2001 From: Chen Gang Date: Fri, 20 May 2016 16:58:01 -0700 Subject: include/linux/hugetlb.h: use bool instead of int for hugepage_migration_supported() It is used as a pure bool function within kernel source wide. Signed-off-by: Chen Gang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 7ef4b63..c26d463 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -452,12 +452,12 @@ static inline pgoff_t basepage_index(struct page *page) extern void dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn); -static inline int hugepage_migration_supported(struct hstate *h) +static inline bool hugepage_migration_supported(struct hstate *h) { #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION return huge_page_shift(h) == PMD_SHIFT; #else - return 0; + return false; #endif } @@ -519,7 +519,7 @@ static inline pgoff_t basepage_index(struct page *page) return page->index; } #define dissolve_free_huge_pages(s, e) do {} while (0) -#define hugepage_migration_supported(h) 0 +#define hugepage_migration_supported(h) false static inline spinlock_t *huge_pte_lockptr(struct hstate *h, struct mm_struct *mm, pte_t *pte) -- cgit v0.10.2 From 0c9ad804f178eae02a34045bb0916fa0e31623d5 Mon Sep 17 00:00:00 2001 From: Weijie Yang Date: Fri, 20 May 2016 16:58:04 -0700 Subject: mm fix commmets: if SPARSEMEM, pgdata doesn't have page_ext If SPARSEMEM, use page_ext in mem_section if !SPARSEMEM, use page_ext in pgdata Signed-off-by: Weijie Yang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8dd0333..02069c2 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1063,7 +1063,7 @@ struct mem_section { unsigned long *pageblock_flags; #ifdef CONFIG_PAGE_EXTENSION /* - * If !SPARSEMEM, pgdat doesn't have page_ext pointer. We use + * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use * section. (see page_ext.h about this.) */ struct page_ext *page_ext; -- cgit v0.10.2 From 89474d50a0d6b9a06e043351b4e658f4f0b6ee97 Mon Sep 17 00:00:00 2001 From: Eric Engestrom Date: Fri, 20 May 2016 16:58:07 -0700 Subject: Documentation: vm: fix spelling mistakes Signed-off-by: Eric Engestrom Cc: Jonathan Corbet Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt index fb0e1f2..7c871d6 100644 --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt @@ -340,7 +340,7 @@ unaffected. libhugetlbfs will also work fine as usual. == Graceful fallback == -Code walking pagetables but unware about huge pmds can simply call +Code walking pagetables but unaware about huge pmds can simply call split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by pmd_offset. It's trivial to make the code transparent hugepage aware by just grepping for "pmd_offset" and adding split_huge_pmd where @@ -414,7 +414,7 @@ tracking. The alternative is alter ->_mapcount in all subpages on each map/unmap of the whole compound page. We set PG_double_map when a PMD of the page got split for the first time, -but still have PMD mapping. The addtional references go away with last +but still have PMD mapping. The additional references go away with last compound_mapcount. split_huge_page internally has to distribute the refcounts in the head @@ -432,10 +432,10 @@ page->_mapcount. We safe against physical memory scanners too: the only legitimate way scanner can get reference to a page is get_page_unless_zero(). -All tail pages has zero ->_refcount until atomic_add(). It prevent scanner -from geting reference to tail page up to the point. After the atomic_add() -we don't care about ->_refcount value. We already known how many references -with should uncharge from head page. +All tail pages have zero ->_refcount until atomic_add(). This prevents the +scanner from getting a reference to the tail page up to that point. After the +atomic_add() we don't care about the ->_refcount value. We already known how +many references should be uncharged from the head page. For head page get_page_unless_zero() will succeed and we don't mind. It's clear where reference should go after split: it will stay on head page. -- cgit v0.10.2 From 78ebc2f7146156f488083c9e5a7ded9d5c38c58b Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Fri, 20 May 2016 16:58:10 -0700 Subject: mm,writeback: don't use memory reserves for wb_start_writeback When writeback operation cannot make forward progress because memory allocation requests needed for doing I/O cannot be satisfied (e.g. under OOM-livelock situation), we can observe flood of order-0 page allocation failure messages caused by complete depletion of memory reserves. This is caused by unconditionally allocating "struct wb_writeback_work" objects using GFP_ATOMIC from PF_MEMALLOC context. __alloc_pages_nodemask() { __alloc_pages_slowpath() { __alloc_pages_direct_reclaim() { __perform_reclaim() { current->flags |= PF_MEMALLOC; try_to_free_pages() { do_try_to_free_pages() { wakeup_flusher_threads() { wb_start_writeback() { kzalloc(sizeof(*work), GFP_ATOMIC) { /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */ } } } } } current->flags &= ~PF_MEMALLOC; } } } } Since I/O is stalling, allocating writeback requests forever shall deplete memory reserves. Fortunately, since wb_start_writeback() can fall back to wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't need to allow wb_start_writeback() to use memory reserves. Mem-Info: active_anon:289393 inactive_anon:2093 isolated_anon:29 active_file:10838 inactive_file:113013 isolated_file:859 unevictable:0 dirty:108531 writeback:5308 unstable:0 slab_reclaimable:5526 slab_unreclaimable:7077 mapped:9970 shmem:2159 pagetables:2387 bounce:0 free:3042 free_pcp:0 free_cma:0 Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes lowmem_reserve[]: 0 1732 1732 1732 Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 126847 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 524157 pages RAM 0 pages HighMem/MovableOnly 76348 pages reserved 0 pages hwpoisoned Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB kthreadd: page allocation failure: order:0, mode:0x2200020 file_io.00: page allocation failure: order:0, mode:0x2200020 CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 Call Trace: warn_alloc_failed+0xf7/0x150 __alloc_pages_nodemask+0x23f/0xa60 alloc_pages_current+0x87/0x110 new_slab+0x3a1/0x440 ___slab_alloc+0x3cf/0x590 __slab_alloc.isra.64+0x18/0x1d kmem_cache_alloc+0x11c/0x150 wb_start_writeback+0x39/0x90 wakeup_flusher_threads+0x7f/0xf0 do_try_to_free_pages+0x1f9/0x410 try_to_free_pages+0x94/0xc0 __alloc_pages_nodemask+0x566/0xa60 alloc_pages_current+0x87/0x110 __page_cache_alloc+0xaf/0xc0 pagecache_get_page+0x88/0x260 grab_cache_page_write_begin+0x21/0x40 xfs_vm_write_begin+0x2f/0xf0 generic_perform_write+0xca/0x1c0 xfs_file_buffered_aio_write+0xcc/0x1f0 xfs_file_write_iter+0x84/0x140 __vfs_write+0xc7/0x100 vfs_write+0x9d/0x190 SyS_write+0x50/0xc0 entry_SYSCALL_64_fastpath+0x12/0x6a Mem-Info: active_anon:293335 inactive_anon:2093 isolated_anon:0 active_file:10829 inactive_file:110045 isolated_file:32 unevictable:0 dirty:109275 writeback:822 unstable:0 slab_reclaimable:5489 slab_unreclaimable:10070 mapped:9999 shmem:2159 pagetables:2420 bounce:0 free:3 free_pcp:0 free_cma:0 Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes lowmem_reserve[]: 0 1732 1732 1732 Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 123086 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 524157 pages RAM 0 pages HighMem/MovableOnly 76348 pages reserved 0 pages hwpoisoned SLUB: Unable to allocate memory on node -1 (gfp=0x2088020) cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0 node 0: slabs: 3218, objs: 205952, free: 0 file_io.00: page allocation failure: order:0, mode:0x2200020 CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45 Assuming that somebody will find a better solution, let's apply this patch for now to stop bleeding, for this problem frequently prevents me from testing OOM livelock condition. Link: http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.cz Signed-off-by: Tetsuo Handa Acked-by: Michal Hocko Cc: Jan Kara Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 592cea5..989a2ce 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -931,7 +931,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, * This is WB_SYNC_NONE writeback, so if allocation fails just * wakeup the thread for old dirty data writeback */ - work = kzalloc(sizeof(*work), GFP_ATOMIC); + work = kzalloc(sizeof(*work), + GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN); if (!work) { trace_writeback_nowork(wb); wb_wakeup(wb); -- cgit v0.10.2 From 2a138dc7e50bfdc90f9db9b52584ac5564952425 Mon Sep 17 00:00:00 2001 From: Minfei Huang Date: Fri, 20 May 2016 16:58:13 -0700 Subject: mm: use existing helper to convert "on"/"off" to boolean It's more convenient to use existing function helper to convert string "on/off" to boolean. Link: http://lkml.kernel.org/r/1461908824-16129-1-git-send-email-mnghuan@gmail.com Signed-off-by: Minfei Huang Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7e49bc3..3f4b69a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -613,14 +613,7 @@ static int __init early_debug_pagealloc(char *buf) { if (!buf) return -EINVAL; - - if (strcmp(buf, "on") == 0) - _debug_pagealloc_enabled = true; - - if (strcmp(buf, "off") == 0) - _debug_pagealloc_enabled = false; - - return 0; + return kstrtobool(buf, &_debug_pagealloc_enabled); } early_param("debug_pagealloc", early_debug_pagealloc); diff --git a/mm/page_poison.c b/mm/page_poison.c index 479e7ea..1eae5fa 100644 --- a/mm/page_poison.c +++ b/mm/page_poison.c @@ -13,13 +13,7 @@ static int early_page_poison_param(char *buf) { if (!buf) return -EINVAL; - - if (strcmp(buf, "on") == 0) - want_page_poisoning = true; - else if (strcmp(buf, "off") == 0) - want_page_poisoning = false; - - return 0; + return strtobool(buf, &want_page_poisoning); } early_param("page_poison", early_page_poison_param); -- cgit v0.10.2 From d2a1a1f0a97a77d25cbada37161dc2ecdf01f93d Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Fri, 20 May 2016 16:58:16 -0700 Subject: mm: use unsigned long constant for page flags struct page->flags is unsigned long, so when shifting bits we should use UL suffix to match it. Found this problem after I added 64-bit CPU specific page flags and failed to compile the kernel: mm/page_alloc.c: In function '__free_one_page': mm/page_alloc.c:672:2: error: integer overflow in expression [-Werror=overflow] Link: http://lkml.kernel.org/r/1461971723-16187-1-git-send-email-yuzhao@google.com Signed-off-by: Yu Zhao Cc: "Kirill A . Shutemov" Cc: Michal Hocko Cc: Naoya Horiguchi Cc: Jerome Marchand Cc: Denys Vlasenko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index a61e06e..e5a3244 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -479,7 +479,7 @@ static inline void ClearPageCompound(struct page *page) } #endif -#define PG_head_mask ((1L << PG_head)) +#define PG_head_mask ((1UL << PG_head)) #ifdef CONFIG_HUGETLB_PAGE int PageHuge(struct page *page); @@ -670,7 +670,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) } #ifdef CONFIG_MMU -#define __PG_MLOCKED (1 << PG_mlocked) +#define __PG_MLOCKED (1UL << PG_mlocked) #else #define __PG_MLOCKED 0 #endif @@ -680,11 +680,11 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) * these flags set. It they are, there is a problem. */ #define PAGE_FLAGS_CHECK_AT_FREE \ - (1 << PG_lru | 1 << PG_locked | \ - 1 << PG_private | 1 << PG_private_2 | \ - 1 << PG_writeback | 1 << PG_reserved | \ - 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ - 1 << PG_unevictable | __PG_MLOCKED) + (1UL << PG_lru | 1UL << PG_locked | \ + 1UL << PG_private | 1UL << PG_private_2 | \ + 1UL << PG_writeback | 1UL << PG_reserved | \ + 1UL << PG_slab | 1UL << PG_swapcache | 1UL << PG_active | \ + 1UL << PG_unevictable | __PG_MLOCKED) /* * Flags checked when a page is prepped for return by the page allocator. @@ -695,10 +695,10 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) * alloc-free cycle to prevent from reusing the page. */ #define PAGE_FLAGS_CHECK_AT_PREP \ - (((1 << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON) + (((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON) #define PAGE_FLAGS_PRIVATE \ - (1 << PG_private | 1 << PG_private_2) + (1UL << PG_private | 1UL << PG_private_2) /** * page_has_private - Determine if page has private stuff * @page: The page to be checked -- cgit v0.10.2 From 51038171b7ecf0017fbe3e06bd246eaa26f4d2e7 Mon Sep 17 00:00:00 2001 From: Greg Thelen Date: Fri, 20 May 2016 16:58:18 -0700 Subject: memcg: fix stale mem_cgroup_force_empty() comment Commit f61c42a7d911 ("memcg: remove tasks/children test from mem_cgroup_force_empty()") removed memory reparenting from the function. Fix the function's comment. Link: http://lkml.kernel.org/r/1462569810-54496-1-git-send-email-gthelen@google.com Signed-off-by: Greg Thelen Acked-by: Johannes Weiner Acked-by: Michal Hocko Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d71d387..b3f16ab 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2652,8 +2652,7 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg) } /* - * Reclaims as many pages from the given memcg as possible and moves - * the rest to the parent. + * Reclaims as many pages from the given memcg as possible. * * Caller is responsible for holding css reference for memcg. */ -- cgit v0.10.2 From 7b8da4c7f0777489f8690115b5fd7704ac0abb8f Mon Sep 17 00:00:00 2001 From: Christoph Lameter Date: Fri, 20 May 2016 16:58:21 -0700 Subject: vmstat: get rid of the ugly cpu_stat_off variable The cpu_stat_off variable is unecessary since we can check if a workqueue request is pending otherwise. Removal of cpu_stat_off makes it pretty easy for the vmstat shepherd to ensure that the proper things happen. Removing the state also removes all races related to it. Should a workqueue not be scheduled as needed for vmstat_update then the shepherd will notice and schedule it as needed. Should a workqueue be unecessarily scheduled then the vmstat updater will disable it. [akpm@linux-foundation.org: fix indentation, per Michal] Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1605061306460.17934@east.gentwo.org Signed-off-by: Christoph Lameter Cc: Tejun Heo Acked-by: Michal Hocko Cc: Tetsuo Handa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/vmstat.c b/mm/vmstat.c index 5b72a8a..77e42ef 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1352,7 +1352,6 @@ static const struct file_operations proc_vmstat_file_operations = { static struct workqueue_struct *vmstat_wq; static DEFINE_PER_CPU(struct delayed_work, vmstat_work); int sysctl_stat_interval __read_mostly = HZ; -static cpumask_var_t cpu_stat_off; #ifdef CONFIG_PROC_FS static void refresh_vm_stats(struct work_struct *work) @@ -1421,24 +1420,10 @@ static void vmstat_update(struct work_struct *w) * Counters were updated so we expect more updates * to occur in the future. Keep on running the * update worker thread. - * If we were marked on cpu_stat_off clear the flag - * so that vmstat_shepherd doesn't schedule us again. */ - if (!cpumask_test_and_clear_cpu(smp_processor_id(), - cpu_stat_off)) { - queue_delayed_work_on(smp_processor_id(), vmstat_wq, + queue_delayed_work_on(smp_processor_id(), vmstat_wq, this_cpu_ptr(&vmstat_work), round_jiffies_relative(sysctl_stat_interval)); - } - } else { - /* - * We did not update any counters so the app may be in - * a mode where it does not cause counter updates. - * We may be uselessly running vmstat_update. - * Defer the checking for differentials to the - * shepherd thread on a different processor. - */ - cpumask_set_cpu(smp_processor_id(), cpu_stat_off); } } @@ -1470,16 +1455,17 @@ static bool need_update(int cpu) return false; } +/* + * Switch off vmstat processing and then fold all the remaining differentials + * until the diffs stay at zero. The function is used by NOHZ and can only be + * invoked when tick processing is not active. + */ void quiet_vmstat(void) { if (system_state != SYSTEM_RUNNING) return; - /* - * If we are already in hands of the shepherd then there - * is nothing for us to do here. - */ - if (cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) + if (!delayed_work_pending(this_cpu_ptr(&vmstat_work))) return; if (!need_update(smp_processor_id())) @@ -1494,7 +1480,6 @@ void quiet_vmstat(void) refresh_cpu_vm_stats(false); } - /* * Shepherd worker thread that checks the * differentials of processors that have their worker @@ -1511,20 +1496,11 @@ static void vmstat_shepherd(struct work_struct *w) get_online_cpus(); /* Check processors whose vmstat worker threads have been disabled */ - for_each_cpu(cpu, cpu_stat_off) { + for_each_online_cpu(cpu) { struct delayed_work *dw = &per_cpu(vmstat_work, cpu); - if (need_update(cpu)) { - if (cpumask_test_and_clear_cpu(cpu, cpu_stat_off)) - queue_delayed_work_on(cpu, vmstat_wq, dw, 0); - } else { - /* - * Cancel the work if quiet_vmstat has put this - * cpu on cpu_stat_off because the work item might - * be still scheduled - */ - cancel_delayed_work(dw); - } + if (!delayed_work_pending(dw) && need_update(cpu)) + queue_delayed_work_on(cpu, vmstat_wq, dw, 0); } put_online_cpus(); @@ -1540,10 +1516,6 @@ static void __init start_shepherd_timer(void) INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu), vmstat_update); - if (!alloc_cpumask_var(&cpu_stat_off, GFP_KERNEL)) - BUG(); - cpumask_copy(cpu_stat_off, cpu_online_mask); - vmstat_wq = alloc_workqueue("vmstat", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0); schedule_delayed_work(&shepherd, round_jiffies_relative(sysctl_stat_interval)); @@ -1578,16 +1550,13 @@ static int vmstat_cpuup_callback(struct notifier_block *nfb, case CPU_ONLINE_FROZEN: refresh_zone_stat_thresholds(); node_set_state(cpu_to_node(cpu), N_CPU); - cpumask_set_cpu(cpu, cpu_stat_off); break; case CPU_DOWN_PREPARE: case CPU_DOWN_PREPARE_FROZEN: cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu)); - cpumask_clear_cpu(cpu, cpu_stat_off); break; case CPU_DOWN_FAILED: case CPU_DOWN_FAILED_FROZEN: - cpumask_set_cpu(cpu, cpu_stat_off); break; case CPU_DEAD: case CPU_DEAD_FROZEN: -- cgit v0.10.2 From 5f527c2b3ea261bfccb7d12f9feade924cc4987c Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Fri, 20 May 2016 16:58:24 -0700 Subject: mm: thp: microoptimize compound_mapcount() compound_mapcount() is only called after PageCompound() has already been checked by the caller, so there's no point to check it again. Gcc may optimize it away too because it's inline but this will remove the runtime check for sure and add it'll add an assert instead. Link: http://lkml.kernel.org/r/1462547040-1737-3-git-send-email-aarcange@redhat.com Signed-off-by: Andrea Arcangeli Acked-by: Kirill A. Shutemov Cc: Alex Williamson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/mm.h b/include/linux/mm.h index 2b97be1..65d18a4 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -475,8 +475,7 @@ static inline atomic_t *compound_mapcount_ptr(struct page *page) static inline int compound_mapcount(struct page *page) { - if (!PageCompound(page)) - return 0; + VM_BUG_ON_PAGE(!PageCompound(page), page); page = compound_head(page); return atomic_read(compound_mapcount_ptr(page)) + 1; } -- cgit v0.10.2 From d5ee7c3bcca6fe2b4f7a1fdee253250059c110d2 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Fri, 20 May 2016 16:58:27 -0700 Subject: mm: thp: split_huge_pmd_address() comment improvement Comment is partly wrong, this improves it by including the case of split_huge_pmd_address() called by try_to_unmap_one if TTU_SPLIT_HUGE_PMD is set. Link: http://lkml.kernel.org/r/1462547040-1737-4-git-send-email-aarcange@redhat.com Signed-off-by: Andrea Arcangeli Acked-by: Kirill A. Shutemov Cc: Alex Williamson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 86eafd9..1764184 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3023,8 +3023,10 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address, return; /* - * Caller holds the mmap_sem write mode, so a huge pmd cannot - * materialize from under us. + * Caller holds the mmap_sem write mode or the anon_vma lock, + * so a huge pmd cannot materialize from under us (khugepaged + * holds both the mmap_sem write mode and the anon_vma lock + * write mode). */ __split_huge_pmd(vma, pmd, address, freeze); } -- cgit v0.10.2 From 9a001fc19cccdeb9be4c3b89ad089d92df303c44 Mon Sep 17 00:00:00 2001 From: Vitaly Wool Date: Fri, 20 May 2016 16:58:30 -0700 Subject: z3fold: the 3-fold allocator for compressed pages This patch introduces z3fold, a special purpose allocator for storing compressed pages. It is designed to store up to three compressed pages per physical page. It is a ZBUD derivative which allows for higher compression ratio keeping the simplicity and determinism of its predecessor. This patch comes as a follow-up to the discussions at the Embedded Linux Conference in San-Diego related to the talk [1]. The outcome of these discussions was that it would be good to have a compressed page allocator as stable and deterministic as zbud with with higher compression ratio. To keep the determinism and simplicity, z3fold, just like zbud, always stores an integral number of compressed pages per page, but it can store up to 3 pages unlike zbud which can store at most 2. Therefore the compression ratio goes to around 2.6x while zbud's one is around 1.7x. The patch is based on the latest linux.git tree. This version has been updated after testing on various simulators (e.g. ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on comments from Dan Streetman [3]. [1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou [2] https://lkml.org/lkml/2016/4/21/799 [3] https://lkml.org/lkml/2016/5/4/852 Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.com Signed-off-by: Vitaly Wool Cc: Seth Jennings Cc: Dan Streetman Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/Documentation/vm/z3fold.txt b/Documentation/vm/z3fold.txt new file mode 100644 index 0000000..38e4dac --- /dev/null +++ b/Documentation/vm/z3fold.txt @@ -0,0 +1,26 @@ +z3fold +------ + +z3fold is a special purpose allocator for storing compressed pages. +It is designed to store up to three compressed pages per physical page. +It is a zbud derivative which allows for higher compression +ratio keeping the simplicity and determinism of its predecessor. + +The main differences between z3fold and zbud are: +* unlike zbud, z3fold allows for up to PAGE_SIZE allocations +* z3fold can hold up to 3 compressed pages in its page +* z3fold doesn't export any API itself and is thus intended to be used + via the zpool API. + +To keep the determinism and simplicity, z3fold, just like zbud, always +stores an integral number of compressed pages per page, but it can store +up to 3 pages unlike zbud which can store at most 2. Therefore the +compression ratio goes to around 2.7x while zbud's one is around 1.7x. + +Unlike zbud (but like zsmalloc for that matter) z3fold_alloc() does not +return a dereferenceable pointer. Instead, it returns an unsigned long +handle which encodes actual location of the allocated object. + +Keeping effective compression ratio close to zsmalloc's, z3fold doesn't +depend on MMU enabled and provides more predictable reclaim behavior +which makes it a better fit for small and response-critical systems. diff --git a/mm/Kconfig b/mm/Kconfig index b0432b7..1a6a28e 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -567,7 +567,7 @@ config ZPOOL zsmalloc. config ZBUD - tristate "Low density storage for compressed pages" + tristate "Low (Up to 2x) density storage for compressed pages" default n help A special purpose allocator for storing compressed pages. @@ -576,6 +576,16 @@ config ZBUD deterministic reclaim properties that make it preferable to a higher density approach when reclaim will be used. +config Z3FOLD + tristate "Up to 3x density storage for compressed pages" + depends on ZPOOL + default n + help + A special purpose allocator for storing compressed pages. + It is designed to store up to three compressed pages per physical + page. It is a ZBUD derivative so the simplicity and determinism are + still there. + config ZSMALLOC tristate "Memory allocator for compressed pages" depends on MMU diff --git a/mm/Makefile b/mm/Makefile index deb467e..78c6f7d 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -89,6 +89,7 @@ obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o obj-$(CONFIG_ZPOOL) += zpool.o obj-$(CONFIG_ZBUD) += zbud.o obj-$(CONFIG_ZSMALLOC) += zsmalloc.o +obj-$(CONFIG_Z3FOLD) += z3fold.o obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o obj-$(CONFIG_CMA) += cma.o obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o diff --git a/mm/z3fold.c b/mm/z3fold.c new file mode 100644 index 0000000..34917d5 --- /dev/null +++ b/mm/z3fold.c @@ -0,0 +1,792 @@ +/* + * z3fold.c + * + * Author: Vitaly Wool + * Copyright (C) 2016, Sony Mobile Communications Inc. + * + * This implementation is based on zbud written by Seth Jennings. + * + * z3fold is an special purpose allocator for storing compressed pages. It + * can store up to three compressed pages per page which improves the + * compression ratio of zbud while retaining its main concepts (e. g. always + * storing an integral number of objects per page) and simplicity. + * It still has simple and deterministic reclaim properties that make it + * preferable to a higher density approach (with no requirement on integral + * number of object per page) when reclaim is used. + * + * As in zbud, pages are divided into "chunks". The size of the chunks is + * fixed at compile time and is determined by NCHUNKS_ORDER below. + * + * z3fold doesn't export any API and is meant to be used via zpool API. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include + +/***************** + * Structures +*****************/ +/* + * NCHUNKS_ORDER determines the internal allocation granularity, effectively + * adjusting internal fragmentation. It also determines the number of + * freelists maintained in each pool. NCHUNKS_ORDER of 6 means that the + * allocation granularity will be in chunks of size PAGE_SIZE/64. As one chunk + * in allocated page is occupied by z3fold header, NCHUNKS will be calculated + * to 63 which shows the max number of free chunks in z3fold page, also there + * will be 63 freelists per pool. + */ +#define NCHUNKS_ORDER 6 + +#define CHUNK_SHIFT (PAGE_SHIFT - NCHUNKS_ORDER) +#define CHUNK_SIZE (1 << CHUNK_SHIFT) +#define ZHDR_SIZE_ALIGNED CHUNK_SIZE +#define NCHUNKS ((PAGE_SIZE - ZHDR_SIZE_ALIGNED) >> CHUNK_SHIFT) + +#define BUDDY_MASK ((1 << NCHUNKS_ORDER) - 1) + +struct z3fold_pool; +struct z3fold_ops { + int (*evict)(struct z3fold_pool *pool, unsigned long handle); +}; + +/** + * struct z3fold_pool - stores metadata for each z3fold pool + * @lock: protects all pool fields and first|last_chunk fields of any + * z3fold page in the pool + * @unbuddied: array of lists tracking z3fold pages that contain 2- buddies; + * the lists each z3fold page is added to depends on the size of + * its free region. + * @buddied: list tracking the z3fold pages that contain 3 buddies; + * these z3fold pages are full + * @lru: list tracking the z3fold pages in LRU order by most recently + * added buddy. + * @pages_nr: number of z3fold pages in the pool. + * @ops: pointer to a structure of user defined operations specified at + * pool creation time. + * + * This structure is allocated at pool creation time and maintains metadata + * pertaining to a particular z3fold pool. + */ +struct z3fold_pool { + spinlock_t lock; + struct list_head unbuddied[NCHUNKS]; + struct list_head buddied; + struct list_head lru; + u64 pages_nr; + const struct z3fold_ops *ops; + struct zpool *zpool; + const struct zpool_ops *zpool_ops; +}; + +enum buddy { + HEADLESS = 0, + FIRST, + MIDDLE, + LAST, + BUDDIES_MAX +}; + +/* + * struct z3fold_header - z3fold page metadata occupying the first chunk of each + * z3fold page, except for HEADLESS pages + * @buddy: links the z3fold page into the relevant list in the pool + * @first_chunks: the size of the first buddy in chunks, 0 if free + * @middle_chunks: the size of the middle buddy in chunks, 0 if free + * @last_chunks: the size of the last buddy in chunks, 0 if free + * @first_num: the starting number (for the first handle) + */ +struct z3fold_header { + struct list_head buddy; + unsigned short first_chunks; + unsigned short middle_chunks; + unsigned short last_chunks; + unsigned short start_middle; + unsigned short first_num:NCHUNKS_ORDER; +}; + +/* + * Internal z3fold page flags + */ +enum z3fold_page_flags { + UNDER_RECLAIM = 0, + PAGE_HEADLESS, + MIDDLE_CHUNK_MAPPED, +}; + +/***************** + * Helpers +*****************/ + +/* Converts an allocation size in bytes to size in z3fold chunks */ +static int size_to_chunks(size_t size) +{ + return (size + CHUNK_SIZE - 1) >> CHUNK_SHIFT; +} + +#define for_each_unbuddied_list(_iter, _begin) \ + for ((_iter) = (_begin); (_iter) < NCHUNKS; (_iter)++) + +/* Initializes the z3fold header of a newly allocated z3fold page */ +static struct z3fold_header *init_z3fold_page(struct page *page) +{ + struct z3fold_header *zhdr = page_address(page); + + INIT_LIST_HEAD(&page->lru); + clear_bit(UNDER_RECLAIM, &page->private); + clear_bit(PAGE_HEADLESS, &page->private); + clear_bit(MIDDLE_CHUNK_MAPPED, &page->private); + + zhdr->first_chunks = 0; + zhdr->middle_chunks = 0; + zhdr->last_chunks = 0; + zhdr->first_num = 0; + zhdr->start_middle = 0; + INIT_LIST_HEAD(&zhdr->buddy); + return zhdr; +} + +/* Resets the struct page fields and frees the page */ +static void free_z3fold_page(struct z3fold_header *zhdr) +{ + __free_page(virt_to_page(zhdr)); +} + +/* + * Encodes the handle of a particular buddy within a z3fold page + * Pool lock should be held as this function accesses first_num + */ +static unsigned long encode_handle(struct z3fold_header *zhdr, enum buddy bud) +{ + unsigned long handle; + + handle = (unsigned long)zhdr; + if (bud != HEADLESS) + handle += (bud + zhdr->first_num) & BUDDY_MASK; + return handle; +} + +/* Returns the z3fold page where a given handle is stored */ +static struct z3fold_header *handle_to_z3fold_header(unsigned long handle) +{ + return (struct z3fold_header *)(handle & PAGE_MASK); +} + +/* Returns buddy number */ +static enum buddy handle_to_buddy(unsigned long handle) +{ + struct z3fold_header *zhdr = handle_to_z3fold_header(handle); + return (handle - zhdr->first_num) & BUDDY_MASK; +} + +/* + * Returns the number of free chunks in a z3fold page. + * NB: can't be used with HEADLESS pages. + */ +static int num_free_chunks(struct z3fold_header *zhdr) +{ + int nfree; + /* + * If there is a middle object, pick up the bigger free space + * either before or after it. Otherwise just subtract the number + * of chunks occupied by the first and the last objects. + */ + if (zhdr->middle_chunks != 0) { + int nfree_before = zhdr->first_chunks ? + 0 : zhdr->start_middle - 1; + int nfree_after = zhdr->last_chunks ? + 0 : NCHUNKS - zhdr->start_middle - zhdr->middle_chunks; + nfree = max(nfree_before, nfree_after); + } else + nfree = NCHUNKS - zhdr->first_chunks - zhdr->last_chunks; + return nfree; +} + +/***************** + * API Functions +*****************/ +/** + * z3fold_create_pool() - create a new z3fold pool + * @gfp: gfp flags when allocating the z3fold pool structure + * @ops: user-defined operations for the z3fold pool + * + * Return: pointer to the new z3fold pool or NULL if the metadata allocation + * failed. + */ +static struct z3fold_pool *z3fold_create_pool(gfp_t gfp, + const struct z3fold_ops *ops) +{ + struct z3fold_pool *pool; + int i; + + pool = kzalloc(sizeof(struct z3fold_pool), gfp); + if (!pool) + return NULL; + spin_lock_init(&pool->lock); + for_each_unbuddied_list(i, 0) + INIT_LIST_HEAD(&pool->unbuddied[i]); + INIT_LIST_HEAD(&pool->buddied); + INIT_LIST_HEAD(&pool->lru); + pool->pages_nr = 0; + pool->ops = ops; + return pool; +} + +/** + * z3fold_destroy_pool() - destroys an existing z3fold pool + * @pool: the z3fold pool to be destroyed + * + * The pool should be emptied before this function is called. + */ +static void z3fold_destroy_pool(struct z3fold_pool *pool) +{ + kfree(pool); +} + +/* Has to be called with lock held */ +static int z3fold_compact_page(struct z3fold_header *zhdr) +{ + struct page *page = virt_to_page(zhdr); + void *beg = zhdr; + + + if (!test_bit(MIDDLE_CHUNK_MAPPED, &page->private) && + zhdr->middle_chunks != 0 && + zhdr->first_chunks == 0 && zhdr->last_chunks == 0) { + memmove(beg + ZHDR_SIZE_ALIGNED, + beg + (zhdr->start_middle << CHUNK_SHIFT), + zhdr->middle_chunks << CHUNK_SHIFT); + zhdr->first_chunks = zhdr->middle_chunks; + zhdr->middle_chunks = 0; + zhdr->start_middle = 0; + zhdr->first_num++; + return 1; + } + return 0; +} + +/** + * z3fold_alloc() - allocates a region of a given size + * @pool: z3fold pool from which to allocate + * @size: size in bytes of the desired allocation + * @gfp: gfp flags used if the pool needs to grow + * @handle: handle of the new allocation + * + * This function will attempt to find a free region in the pool large enough to + * satisfy the allocation request. A search of the unbuddied lists is + * performed first. If no suitable free region is found, then a new page is + * allocated and added to the pool to satisfy the request. + * + * gfp should not set __GFP_HIGHMEM as highmem pages cannot be used + * as z3fold pool pages. + * + * Return: 0 if success and handle is set, otherwise -EINVAL if the size or + * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate + * a new page. + */ +static int z3fold_alloc(struct z3fold_pool *pool, size_t size, gfp_t gfp, + unsigned long *handle) +{ + int chunks = 0, i, freechunks; + struct z3fold_header *zhdr = NULL; + enum buddy bud; + struct page *page; + + if (!size || (gfp & __GFP_HIGHMEM)) + return -EINVAL; + + if (size > PAGE_SIZE) + return -ENOSPC; + + if (size > PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE) + bud = HEADLESS; + else { + chunks = size_to_chunks(size); + spin_lock(&pool->lock); + + /* First, try to find an unbuddied z3fold page. */ + zhdr = NULL; + for_each_unbuddied_list(i, chunks) { + if (!list_empty(&pool->unbuddied[i])) { + zhdr = list_first_entry(&pool->unbuddied[i], + struct z3fold_header, buddy); + page = virt_to_page(zhdr); + if (zhdr->first_chunks == 0) { + if (zhdr->middle_chunks != 0 && + chunks >= zhdr->start_middle) + bud = LAST; + else + bud = FIRST; + } else if (zhdr->last_chunks == 0) + bud = LAST; + else if (zhdr->middle_chunks == 0) + bud = MIDDLE; + else { + pr_err("No free chunks in unbuddied\n"); + WARN_ON(1); + continue; + } + list_del(&zhdr->buddy); + goto found; + } + } + bud = FIRST; + spin_unlock(&pool->lock); + } + + /* Couldn't find unbuddied z3fold page, create new one */ + page = alloc_page(gfp); + if (!page) + return -ENOMEM; + spin_lock(&pool->lock); + pool->pages_nr++; + zhdr = init_z3fold_page(page); + + if (bud == HEADLESS) { + set_bit(PAGE_HEADLESS, &page->private); + goto headless; + } + +found: + if (bud == FIRST) + zhdr->first_chunks = chunks; + else if (bud == LAST) + zhdr->last_chunks = chunks; + else { + zhdr->middle_chunks = chunks; + zhdr->start_middle = zhdr->first_chunks + 1; + } + + if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0 || + zhdr->middle_chunks == 0) { + /* Add to unbuddied list */ + freechunks = num_free_chunks(zhdr); + list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); + } else { + /* Add to buddied list */ + list_add(&zhdr->buddy, &pool->buddied); + } + +headless: + /* Add/move z3fold page to beginning of LRU */ + if (!list_empty(&page->lru)) + list_del(&page->lru); + + list_add(&page->lru, &pool->lru); + + *handle = encode_handle(zhdr, bud); + spin_unlock(&pool->lock); + + return 0; +} + +/** + * z3fold_free() - frees the allocation associated with the given handle + * @pool: pool in which the allocation resided + * @handle: handle associated with the allocation returned by z3fold_alloc() + * + * In the case that the z3fold page in which the allocation resides is under + * reclaim, as indicated by the PG_reclaim flag being set, this function + * only sets the first|last_chunks to 0. The page is actually freed + * once both buddies are evicted (see z3fold_reclaim_page() below). + */ +static void z3fold_free(struct z3fold_pool *pool, unsigned long handle) +{ + struct z3fold_header *zhdr; + int freechunks; + struct page *page; + enum buddy bud; + + spin_lock(&pool->lock); + zhdr = handle_to_z3fold_header(handle); + page = virt_to_page(zhdr); + + if (test_bit(PAGE_HEADLESS, &page->private)) { + /* HEADLESS page stored */ + bud = HEADLESS; + } else { + bud = (handle - zhdr->first_num) & BUDDY_MASK; + + switch (bud) { + case FIRST: + zhdr->first_chunks = 0; + break; + case MIDDLE: + zhdr->middle_chunks = 0; + zhdr->start_middle = 0; + break; + case LAST: + zhdr->last_chunks = 0; + break; + default: + pr_err("%s: unknown bud %d\n", __func__, bud); + WARN_ON(1); + spin_unlock(&pool->lock); + return; + } + } + + if (test_bit(UNDER_RECLAIM, &page->private)) { + /* z3fold page is under reclaim, reclaim will free */ + spin_unlock(&pool->lock); + return; + } + + if (bud != HEADLESS) { + /* Remove from existing buddy list */ + list_del(&zhdr->buddy); + } + + if (bud == HEADLESS || + (zhdr->first_chunks == 0 && zhdr->middle_chunks == 0 && + zhdr->last_chunks == 0)) { + /* z3fold page is empty, free */ + list_del(&page->lru); + clear_bit(PAGE_HEADLESS, &page->private); + free_z3fold_page(zhdr); + pool->pages_nr--; + } else { + z3fold_compact_page(zhdr); + /* Add to the unbuddied list */ + freechunks = num_free_chunks(zhdr); + list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); + } + + spin_unlock(&pool->lock); +} + +/** + * z3fold_reclaim_page() - evicts allocations from a pool page and frees it + * @pool: pool from which a page will attempt to be evicted + * @retires: number of pages on the LRU list for which eviction will + * be attempted before failing + * + * z3fold reclaim is different from normal system reclaim in that it is done + * from the bottom, up. This is because only the bottom layer, z3fold, has + * information on how the allocations are organized within each z3fold page. + * This has the potential to create interesting locking situations between + * z3fold and the user, however. + * + * To avoid these, this is how z3fold_reclaim_page() should be called: + + * The user detects a page should be reclaimed and calls z3fold_reclaim_page(). + * z3fold_reclaim_page() will remove a z3fold page from the pool LRU list and + * call the user-defined eviction handler with the pool and handle as + * arguments. + * + * If the handle can not be evicted, the eviction handler should return + * non-zero. z3fold_reclaim_page() will add the z3fold page back to the + * appropriate list and try the next z3fold page on the LRU up to + * a user defined number of retries. + * + * If the handle is successfully evicted, the eviction handler should + * return 0 _and_ should have called z3fold_free() on the handle. z3fold_free() + * contains logic to delay freeing the page if the page is under reclaim, + * as indicated by the setting of the PG_reclaim flag on the underlying page. + * + * If all buddies in the z3fold page are successfully evicted, then the + * z3fold page can be freed. + * + * Returns: 0 if page is successfully freed, otherwise -EINVAL if there are + * no pages to evict or an eviction handler is not registered, -EAGAIN if + * the retry limit was hit. + */ +static int z3fold_reclaim_page(struct z3fold_pool *pool, unsigned int retries) +{ + int i, ret = 0, freechunks; + struct z3fold_header *zhdr; + struct page *page; + unsigned long first_handle = 0, middle_handle = 0, last_handle = 0; + + spin_lock(&pool->lock); + if (!pool->ops || !pool->ops->evict || list_empty(&pool->lru) || + retries == 0) { + spin_unlock(&pool->lock); + return -EINVAL; + } + for (i = 0; i < retries; i++) { + page = list_last_entry(&pool->lru, struct page, lru); + list_del(&page->lru); + + /* Protect z3fold page against free */ + set_bit(UNDER_RECLAIM, &page->private); + zhdr = page_address(page); + if (!test_bit(PAGE_HEADLESS, &page->private)) { + list_del(&zhdr->buddy); + /* + * We need encode the handles before unlocking, since + * we can race with free that will set + * (first|last)_chunks to 0 + */ + first_handle = 0; + last_handle = 0; + middle_handle = 0; + if (zhdr->first_chunks) + first_handle = encode_handle(zhdr, FIRST); + if (zhdr->middle_chunks) + middle_handle = encode_handle(zhdr, MIDDLE); + if (zhdr->last_chunks) + last_handle = encode_handle(zhdr, LAST); + } else { + first_handle = encode_handle(zhdr, HEADLESS); + last_handle = middle_handle = 0; + } + + spin_unlock(&pool->lock); + + /* Issue the eviction callback(s) */ + if (middle_handle) { + ret = pool->ops->evict(pool, middle_handle); + if (ret) + goto next; + } + if (first_handle) { + ret = pool->ops->evict(pool, first_handle); + if (ret) + goto next; + } + if (last_handle) { + ret = pool->ops->evict(pool, last_handle); + if (ret) + goto next; + } +next: + spin_lock(&pool->lock); + clear_bit(UNDER_RECLAIM, &page->private); + if ((test_bit(PAGE_HEADLESS, &page->private) && ret == 0) || + (zhdr->first_chunks == 0 && zhdr->last_chunks == 0 && + zhdr->middle_chunks == 0)) { + /* + * All buddies are now free, free the z3fold page and + * return success. + */ + clear_bit(PAGE_HEADLESS, &page->private); + free_z3fold_page(zhdr); + pool->pages_nr--; + spin_unlock(&pool->lock); + return 0; + } else if (zhdr->first_chunks != 0 && + zhdr->last_chunks != 0 && zhdr->middle_chunks != 0) { + /* Full, add to buddied list */ + list_add(&zhdr->buddy, &pool->buddied); + } else if (!test_bit(PAGE_HEADLESS, &page->private)) { + z3fold_compact_page(zhdr); + /* add to unbuddied list */ + freechunks = num_free_chunks(zhdr); + list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); + } + + /* add to beginning of LRU */ + list_add(&page->lru, &pool->lru); + } + spin_unlock(&pool->lock); + return -EAGAIN; +} + +/** + * z3fold_map() - maps the allocation associated with the given handle + * @pool: pool in which the allocation resides + * @handle: handle associated with the allocation to be mapped + * + * Extracts the buddy number from handle and constructs the pointer to the + * correct starting chunk within the page. + * + * Returns: a pointer to the mapped allocation + */ +static void *z3fold_map(struct z3fold_pool *pool, unsigned long handle) +{ + struct z3fold_header *zhdr; + struct page *page; + void *addr; + enum buddy buddy; + + spin_lock(&pool->lock); + zhdr = handle_to_z3fold_header(handle); + addr = zhdr; + page = virt_to_page(zhdr); + + if (test_bit(PAGE_HEADLESS, &page->private)) + goto out; + + buddy = handle_to_buddy(handle); + switch (buddy) { + case FIRST: + addr += ZHDR_SIZE_ALIGNED; + break; + case MIDDLE: + addr += zhdr->start_middle << CHUNK_SHIFT; + set_bit(MIDDLE_CHUNK_MAPPED, &page->private); + break; + case LAST: + addr += PAGE_SIZE - (zhdr->last_chunks << CHUNK_SHIFT); + break; + default: + pr_err("unknown buddy id %d\n", buddy); + WARN_ON(1); + addr = NULL; + break; + } +out: + spin_unlock(&pool->lock); + return addr; +} + +/** + * z3fold_unmap() - unmaps the allocation associated with the given handle + * @pool: pool in which the allocation resides + * @handle: handle associated with the allocation to be unmapped + */ +static void z3fold_unmap(struct z3fold_pool *pool, unsigned long handle) +{ + struct z3fold_header *zhdr; + struct page *page; + enum buddy buddy; + + spin_lock(&pool->lock); + zhdr = handle_to_z3fold_header(handle); + page = virt_to_page(zhdr); + + if (test_bit(PAGE_HEADLESS, &page->private)) { + spin_unlock(&pool->lock); + return; + } + + buddy = handle_to_buddy(handle); + if (buddy == MIDDLE) + clear_bit(MIDDLE_CHUNK_MAPPED, &page->private); + spin_unlock(&pool->lock); +} + +/** + * z3fold_get_pool_size() - gets the z3fold pool size in pages + * @pool: pool whose size is being queried + * + * Returns: size in pages of the given pool. The pool lock need not be + * taken to access pages_nr. + */ +static u64 z3fold_get_pool_size(struct z3fold_pool *pool) +{ + return pool->pages_nr; +} + +/***************** + * zpool + ****************/ + +static int z3fold_zpool_evict(struct z3fold_pool *pool, unsigned long handle) +{ + if (pool->zpool && pool->zpool_ops && pool->zpool_ops->evict) + return pool->zpool_ops->evict(pool->zpool, handle); + else + return -ENOENT; +} + +static const struct z3fold_ops z3fold_zpool_ops = { + .evict = z3fold_zpool_evict +}; + +static void *z3fold_zpool_create(const char *name, gfp_t gfp, + const struct zpool_ops *zpool_ops, + struct zpool *zpool) +{ + struct z3fold_pool *pool; + + pool = z3fold_create_pool(gfp, zpool_ops ? &z3fold_zpool_ops : NULL); + if (pool) { + pool->zpool = zpool; + pool->zpool_ops = zpool_ops; + } + return pool; +} + +static void z3fold_zpool_destroy(void *pool) +{ + z3fold_destroy_pool(pool); +} + +static int z3fold_zpool_malloc(void *pool, size_t size, gfp_t gfp, + unsigned long *handle) +{ + return z3fold_alloc(pool, size, gfp, handle); +} +static void z3fold_zpool_free(void *pool, unsigned long handle) +{ + z3fold_free(pool, handle); +} + +static int z3fold_zpool_shrink(void *pool, unsigned int pages, + unsigned int *reclaimed) +{ + unsigned int total = 0; + int ret = -EINVAL; + + while (total < pages) { + ret = z3fold_reclaim_page(pool, 8); + if (ret < 0) + break; + total++; + } + + if (reclaimed) + *reclaimed = total; + + return ret; +} + +static void *z3fold_zpool_map(void *pool, unsigned long handle, + enum zpool_mapmode mm) +{ + return z3fold_map(pool, handle); +} +static void z3fold_zpool_unmap(void *pool, unsigned long handle) +{ + z3fold_unmap(pool, handle); +} + +static u64 z3fold_zpool_total_size(void *pool) +{ + return z3fold_get_pool_size(pool) * PAGE_SIZE; +} + +static struct zpool_driver z3fold_zpool_driver = { + .type = "z3fold", + .owner = THIS_MODULE, + .create = z3fold_zpool_create, + .destroy = z3fold_zpool_destroy, + .malloc = z3fold_zpool_malloc, + .free = z3fold_zpool_free, + .shrink = z3fold_zpool_shrink, + .map = z3fold_zpool_map, + .unmap = z3fold_zpool_unmap, + .total_size = z3fold_zpool_total_size, +}; + +MODULE_ALIAS("zpool-z3fold"); + +static int __init init_z3fold(void) +{ + /* Make sure the z3fold header will fit in one chunk */ + BUILD_BUG_ON(sizeof(struct z3fold_header) > ZHDR_SIZE_ALIGNED); + zpool_register_driver(&z3fold_zpool_driver); + + return 0; +} + +static void __exit exit_z3fold(void) +{ + zpool_unregister_driver(&z3fold_zpool_driver); +} + +module_init(init_z3fold); +module_exit(exit_z3fold); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Vitaly Wool "); +MODULE_DESCRIPTION("3-Fold Allocator for Compressed Pages"); -- cgit v0.10.2 From cd33a76b0f2805fb0d6a05a2d53933f3817ccc9b Mon Sep 17 00:00:00 2001 From: Richard Leitner Date: Fri, 20 May 2016 16:58:33 -0700 Subject: mm/memblock.c: remove unnecessary always-true comparison Comparing an u64 variable to >= 0 returns always true and can therefore be removed. This issue was detected using the -Wtype-limits gcc flag. This patch fixes following type-limits warning: mm/memblock.c: In function `__next_reserved_mem_region': mm/memblock.c:843:11: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits] if (*idx >= 0 && *idx < type->cnt) { Link: http://lkml.kernel.org/r/20160510103625.3a7f8f32@g0hl1n.net Signed-off-by: Richard Leitner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memblock.c b/mm/memblock.c index 3b93daa..ac12489 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -824,7 +824,7 @@ void __init_memblock __next_reserved_mem_region(u64 *idx, { struct memblock_type *type = &memblock.reserved; - if (*idx >= 0 && *idx < type->cnt) { + if (*idx < type->cnt) { struct memblock_region *r = &type->regions[*idx]; phys_addr_t base = r->base; phys_addr_t size = r->size; -- cgit v0.10.2 From d2005e3f41d4f9299e2df6a967c8beb5086967a9 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Fri, 20 May 2016 16:58:36 -0700 Subject: userfaultfd: don't pin the user memory in userfaultfd_file_create() userfaultfd_file_create() increments mm->mm_users; this means that the memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY after that can populate the orphaned mm more. Change userfaultfd_file_create() and userfaultfd_ctx_put() to use mm->mm_count to pin mm_struct. This means that atomic_inc_not_zero(mm->mm_users) is needed when we are going to actually play with this memory. Except handle_userfault() path doesn't need this, the caller must already have a reference. The patch adds the new trivial helper, mmget_not_zero(), it can have more users. Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.com Signed-off-by: Oleg Nesterov Cc: Andrea Arcangeli Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 66cdb44..2d97952 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -137,7 +137,7 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx) VM_BUG_ON(waitqueue_active(&ctx->fault_wqh)); VM_BUG_ON(spin_is_locked(&ctx->fd_wqh.lock)); VM_BUG_ON(waitqueue_active(&ctx->fd_wqh)); - mmput(ctx->mm); + mmdrop(ctx->mm); kmem_cache_free(userfaultfd_ctx_cachep, ctx); } } @@ -434,6 +434,9 @@ static int userfaultfd_release(struct inode *inode, struct file *file) ACCESS_ONCE(ctx->released) = true; + if (!mmget_not_zero(mm)) + goto wakeup; + /* * Flush page faults out of all CPUs. NOTE: all page faults * must be retried without returning VM_FAULT_SIGBUS if @@ -466,7 +469,8 @@ static int userfaultfd_release(struct inode *inode, struct file *file) vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; } up_write(&mm->mmap_sem); - + mmput(mm); +wakeup: /* * After no new page faults can wait on this fault_*wqh, flush * the last page faults that may have been already waiting on @@ -760,10 +764,12 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, start = uffdio_register.range.start; end = start + uffdio_register.range.len; + ret = -ENOMEM; + if (!mmget_not_zero(mm)) + goto out; + down_write(&mm->mmap_sem); vma = find_vma_prev(mm, start, &prev); - - ret = -ENOMEM; if (!vma) goto out_unlock; @@ -864,6 +870,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, } while (vma && vma->vm_start < end); out_unlock: up_write(&mm->mmap_sem); + mmput(mm); if (!ret) { /* * Now that we scanned all vmas we can already tell @@ -902,10 +909,12 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, start = uffdio_unregister.start; end = start + uffdio_unregister.len; + ret = -ENOMEM; + if (!mmget_not_zero(mm)) + goto out; + down_write(&mm->mmap_sem); vma = find_vma_prev(mm, start, &prev); - - ret = -ENOMEM; if (!vma) goto out_unlock; @@ -998,6 +1007,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, } while (vma && vma->vm_start < end); out_unlock: up_write(&mm->mmap_sem); + mmput(mm); out: return ret; } @@ -1067,9 +1077,11 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx, goto out; if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE) goto out; - - ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src, - uffdio_copy.len); + if (mmget_not_zero(ctx->mm)) { + ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src, + uffdio_copy.len); + mmput(ctx->mm); + } if (unlikely(put_user(ret, &user_uffdio_copy->copy))) return -EFAULT; if (ret < 0) @@ -1110,8 +1122,11 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx, if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE) goto out; - ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start, - uffdio_zeropage.range.len); + if (mmget_not_zero(ctx->mm)) { + ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start, + uffdio_zeropage.range.len); + mmput(ctx->mm); + } if (unlikely(put_user(ret, &user_uffdio_zeropage->zeropage))) return -EFAULT; if (ret < 0) @@ -1289,12 +1304,12 @@ static struct file *userfaultfd_file_create(int flags) ctx->released = false; ctx->mm = current->mm; /* prevent the mm struct to be freed */ - atomic_inc(&ctx->mm->mm_users); + atomic_inc(&ctx->mm->mm_count); file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, ctx, O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS)); if (IS_ERR(file)) { - mmput(ctx->mm); + mmdrop(ctx->mm); kmem_cache_free(userfaultfd_ctx_cachep, ctx); } out: diff --git a/include/linux/sched.h b/include/linux/sched.h index 01fe1bb..6b3213d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2723,12 +2723,17 @@ extern struct mm_struct * mm_alloc(void); /* mmdrop drops the mm and the page tables */ extern void __mmdrop(struct mm_struct *); -static inline void mmdrop(struct mm_struct * mm) +static inline void mmdrop(struct mm_struct *mm) { if (unlikely(atomic_dec_and_test(&mm->mm_count))) __mmdrop(mm); } +static inline bool mmget_not_zero(struct mm_struct *mm) +{ + return atomic_inc_not_zero(&mm->mm_users); +} + /* mmput gets rid of the mappings and all user-space */ extern void mmput(struct mm_struct *); /* same as above but performs the slow path from the async kontext. Can -- cgit v0.10.2 From 4b50bcc7eda4d3cc9e3f2a0aa60e590fedf728c5 Mon Sep 17 00:00:00 2001 From: Stefan Bader Date: Fri, 20 May 2016 16:58:38 -0700 Subject: mm: use phys_addr_t for reserve_bootmem_region() arguments Since commit 92923ca3aace ("mm: meminit: only set page reserved in the memblock region") the reserved bit is set on reserved memblock regions. However start and end address are passed as unsigned long. This is only 32bit on i386, so it can end up marking the wrong pages reserved for ranges at 4GB and above. This was observed on a 32bit Xen dom0 which was booted with initial memory set to a value below 4G but allowing to balloon in memory (dom0_mem=1024M for example). This would define a reserved bootmem region for the additional memory (for example on a 8GB system there was a reverved region covering the 4GB-8GB range). But since the addresses were passed on as unsigned long, this was actually marking all pages from 0 to 4GB as reserved. Fixes: 92923ca3aacef63 ("mm: meminit: only set page reserved in the memblock region") Link: http://lkml.kernel.org/r/1463491221-10573-1-git-send-email-stefan.bader@canonical.com Signed-off-by: Stefan Bader Cc: [4.2+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/mm.h b/include/linux/mm.h index 65d18a4..fbdb9d4 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1763,7 +1763,7 @@ extern void free_highmem_page(struct page *page); extern void adjust_managed_page_count(struct page *page, long count); extern void mem_init_print_info(const char *str); -extern void reserve_bootmem_region(unsigned long start, unsigned long end); +extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end); /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3f4b69a..2dd1ba4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1205,7 +1205,7 @@ static inline void init_reserved_page(unsigned long pfn) * marks the pages PageReserved. The remaining valid pages are later * sent to the buddy page allocator. */ -void __meminit reserve_bootmem_region(unsigned long start, unsigned long end) +void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); -- cgit v0.10.2 From 5c0a85fad949212b3e059692deecdeed74ae7ec7 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 20 May 2016 16:58:41 -0700 Subject: mm: make faultaround produce old ptes Currently, faultaround code produces young pte. This can screw up vmscan behaviour[1], as it makes vmscan think that these pages are hot and not push them out on first round. During sparse file access faultaround gets more pages mapped and all of them are young. Under memory pressure, this makes vmscan swap out anon pages instead, or to drop other page cache pages which otherwise stay resident. Modify faultaround to produce old ptes, so they can easily be reclaimed under memory pressure. This can to some extend defeat the purpose of faultaround on machines without hardware accessed bit as it will not help us with reducing the number of minor page faults. We may want to disable faultaround on such machines altogether, but that's subject for separate patchset. Minchan: "I tested 512M mmap sequential word read test on non-HW access bit system (i.e., ARM) and confirmed it doesn't increase minor fault any more. old: 4096 fault_around minor fault: 131291 elapsed time: 6747645 usec new: 65536 fault_around minor fault: 131291 elapsed time: 6709263 usec 0.56% benefit" [1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov Acked-by: Michal Hocko Acked-by: Minchan Kim Tested-by: Minchan Kim Acked-by: Rik van Riel Cc: Mel Gorman Cc: Michal Hocko Cc: Vinayak Menon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/mm.h b/include/linux/mm.h index fbdb9d4..f223ac2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -596,7 +596,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) } void do_set_pte(struct vm_area_struct *vma, unsigned long address, - struct page *page, pte_t *pte, bool write, bool anon); + struct page *page, pte_t *pte, bool write, bool anon, bool old); #endif /* diff --git a/mm/filemap.c b/mm/filemap.c index 8f48599..b418405 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2191,7 +2191,7 @@ repeat: if (file->f_ra.mmap_miss > 0) file->f_ra.mmap_miss--; addr = address + (page->index - vmf->pgoff) * PAGE_SIZE; - do_set_pte(vma, addr, page, pte, false, false); + do_set_pte(vma, addr, page, pte, false, false, true); unlock_page(page); goto next; unlock: diff --git a/mm/memory.c b/mm/memory.c index 007c72a..f29e5ab 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2876,7 +2876,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, * vm_ops->map_pages. */ void do_set_pte(struct vm_area_struct *vma, unsigned long address, - struct page *page, pte_t *pte, bool write, bool anon) + struct page *page, pte_t *pte, bool write, bool anon, bool old) { pte_t entry; @@ -2884,6 +2884,8 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address, entry = mk_pte(page, vma->vm_page_prot); if (write) entry = maybe_mkwrite(pte_mkdirty(entry), vma); + if (old) + entry = pte_mkold(entry); if (anon) { inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, address, false); @@ -3021,9 +3023,20 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma, */ if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) { pte = pte_offset_map_lock(mm, pmd, address, &ptl); - do_fault_around(vma, address, pte, pgoff, flags); if (!pte_same(*pte, orig_pte)) goto unlock_out; + do_fault_around(vma, address, pte, pgoff, flags); + /* Check if the fault is handled by faultaround */ + if (!pte_same(*pte, orig_pte)) { + /* + * Faultaround produce old pte, but the pte we've + * handler fault for should be young. + */ + pte_t entry = pte_mkyoung(*pte); + if (ptep_set_access_flags(vma, address, pte, entry, 0)) + update_mmu_cache(vma, address, pte); + goto unlock_out; + } pte_unmap_unlock(pte, ptl); } @@ -3038,7 +3051,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma, put_page(fault_page); return ret; } - do_set_pte(vma, address, fault_page, pte, false, false); + do_set_pte(vma, address, fault_page, pte, false, false, false); unlock_page(fault_page); unlock_out: pte_unmap_unlock(pte, ptl); @@ -3090,7 +3103,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, } goto uncharge_out; } - do_set_pte(vma, address, new_page, pte, true, true); + do_set_pte(vma, address, new_page, pte, true, true, false); mem_cgroup_commit_charge(new_page, memcg, false, false); lru_cache_add_active_or_unevictable(new_page, vma); pte_unmap_unlock(pte, ptl); @@ -3147,7 +3160,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma, put_page(fault_page); return ret; } - do_set_pte(vma, address, fault_page, pte, true, false); + do_set_pte(vma, address, fault_page, pte, true, false, false); pte_unmap_unlock(pte, ptl); if (set_page_dirty(fault_page)) -- cgit v0.10.2 From d0834a6c2c5b0c76cfb806bd7dba6556d8b4edbb Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Fri, 20 May 2016 16:58:44 -0700 Subject: mm: disable fault around on emulated access bit architecture fault_around aims to reduce minor faults of file-backed pages via speculative ahead pte mapping and relying on readahead logic. However, on non-HW access bit architecture the benefit is highly limited because they should emulate the young bit with minor faults for reclaim's page aging algorithm. IOW, we cannot reduce minor faults on those architectures. I did quick a test on my ARM machine. 512M file mmap sequential every word read on eSATA drive 4 times. stddev is stable. = fault_around 4096 = elapsed time(usec): 6747645 = fault_around 65536 = elapsed time(usec): 6709263 0.5% gain. Even when I tested it with eMMC there is no gain because I guess with slow storage the major fault is the dominant factor. Also, fault_around has the side effect of shrinking slab more aggressively and causes higher vmpressure, so if such speculation fails, it can evict slab more which can result in page I/O (e.g., inode cache). In the end, it would make void any benefit of fault_around. So let's make the default "disabled" on those architectures. Link: http://lkml.kernel.org/r/20160518014229.GB21538@bbox Signed-off-by: Minchan Kim Cc: Kirill A. Shutemov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/memory.c b/mm/memory.c index f29e5ab..a1b93d9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2899,8 +2899,16 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address, update_mmu_cache(vma, address, pte); } +/* + * If architecture emulates "accessed" or "young" bit without HW support, + * there is no much gain with fault_around. + */ static unsigned long fault_around_bytes __read_mostly = +#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS + PAGE_SIZE; +#else rounddown_pow_of_two(65536); +#endif #ifdef CONFIG_DEBUG_FS static int fault_around_bytes_get(void *data, u64 *val) -- cgit v0.10.2 From 29b52de182acf50f85a8284ad39104d84c9bbf57 Mon Sep 17 00:00:00 2001 From: "seokhoon.yoon" Date: Fri, 20 May 2016 16:58:47 -0700 Subject: mm, kasan: fix to call kasan_free_pages() after poisoning page When CONFIG_PAGE_POISONING and CONFIG_KASAN is enabled, free_pages_prepare()'s codeflow is below. 1)kmemcheck_free_shadow() 2)kasan_free_pages() - set shadow byte of page is freed 3)kernel_poison_pages() 3.1) check access to page is valid or not using kasan ---> error occur, kasan think it is invalid access 3.2) poison page 4)kernel_map_pages() So kasan_free_pages() should be called after poisoning the page. Link: http://lkml.kernel.org/r/1463220405-7455-1-git-send-email-iamyooon@gmail.com Signed-off-by: seokhoon.yoon Cc: Andrey Ryabinin Cc: Laura Abbott Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2dd1ba4..383b14b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -993,7 +993,6 @@ static __always_inline bool free_pages_prepare(struct page *page, trace_mm_page_free(page, order); kmemcheck_free_shadow(page, order); - kasan_free_pages(page, order); /* * Check tail pages before head page information is cleared to @@ -1035,6 +1034,7 @@ static __always_inline bool free_pages_prepare(struct page *page, arch_free_page(page, order); kernel_poison_pages(page, 1 << order, 0); kernel_map_pages(page, 1 << order, 0); + kasan_free_pages(page, order); return true; } -- cgit v0.10.2 From e570f56cccd215db68e50870ee74b7d9c0022109 Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Fri, 20 May 2016 16:58:50 -0700 Subject: mm: check_new_page_bad() directly returns in __PG_HWPOISON case Currently we check page->flags twice for "HWPoisoned" case of check_new_page_bad(), which can cause a race with unpoisoning. This race unnecessarily taints kernel with "BUG: Bad page state". check_new_page_bad() is the only caller of bad_page() which is interested in __PG_HWPOISON, so let's move the hwpoison related code in bad_page() to it. Link: http://lkml.kernel.org/r/20160518100949.GA17299@hori1.linux.bs1.fc.nec.co.jp Signed-off-by: Naoya Horiguchi Acked-by: Mel Gorman Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 383b14b..f8f3bfc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -522,12 +522,6 @@ static void bad_page(struct page *page, const char *reason, static unsigned long nr_shown; static unsigned long nr_unshown; - /* Don't complain about poisoned pages */ - if (PageHWPoison(page)) { - page_mapcount_reset(page); /* remove PageBuddy */ - return; - } - /* * Allow a burst of 60 reports, then keep quiet for that minute; * or allow a steady drip of one report per second. @@ -1654,6 +1648,9 @@ static void check_new_page_bad(struct page *page) if (unlikely(page->flags & __PG_HWPOISON)) { bad_reason = "HWPoisoned (hardware-corrupted)"; bad_flags = __PG_HWPOISON; + /* Don't complain about hwpoisoned pages */ + page_mapcount_reset(page); /* remove PageBuddy */ + return; } if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { bad_reason = "PAGE_FLAGS_CHECK_AT_PREP flag set"; -- cgit v0.10.2 From a53eaff8c1192bb5bdfda5deb484bc8f415c5dfd Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Fri, 20 May 2016 16:58:53 -0700 Subject: MM: increase safety margin provided by PF_LESS_THROTTLE When nfsd is exporting a filesystem over NFS which is then NFS-mounted on the local machine there is a risk of deadlock. This happens when there are lots of dirty pages in the NFS filesystem and they cause NFSD to be throttled, either in throttle_vm_writeout() or in balance_dirty_pages(). To avoid this problem the PF_LESS_THROTTLE flag is set for NFSD threads and it provides a 25% increase to the limits that affect NFSD. Any process writing to an NFS filesystem will be throttled well before the number of dirty NFS pages reaches the limit imposed on NFSD, so NFSD will not deadlock on pages that it needs to write out. At least it shouldn't. All processes are allowed a small excess margin to avoid performing too many calculations: ratelimit_pages. ratelimit_pages is set so that if a thread on every CPU uses the entire margin, the total will only go 3% over the limit, and this is much less than the 25% bonus that PF_LESS_THROTTLE provides, so this margin shouldn't be a problem. But it is. The "total memory" that these 3% and 25% are calculated against are not really total memory but are "global_dirtyable_memory()" which doesn't include anonymous memory, just free memory and page-cache memory. The "ratelimit_pages" number is based on whatever the global_dirtyable_memory was on the last CPU hot-plug, which might not be what you expect, but is probably close to the total freeable memory. The throttle threshold uses the global_dirtable_memory at the moment when the throttling happens, which could be much less than at the last CPU hotplug. So if lots of anonymous memory has been allocated, thus pushing out lots of page-cache pages, then NFSD might end up being throttled due to dirty NFS pages because the "25%" bonus it gets is calculated against a rather small amount of dirtyable memory, while the "3%" margin that other processes are allowed to dirty without penalty is calculated against a much larger number. To remove this possibility of deadlock we need to make sure that the margin granted to PF_LESS_THROTTLE exceeds that rate-limit margin. Simply adding ratelimit_pages isn't enough as that should be multiplied by the number of cpus. So add "global_wb_domain.dirty_limit / 32" as that more accurately reflects the current total over-shoot margin. This ensures that the number of dirty NFS pages never gets so high that nfsd will be throttled waiting for them to be written. Link: http://lkml.kernel.org/r/87futgowwv.fsf@notabene.neil.brown.name Signed-off-by: NeilBrown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 3b88795..b9956fd 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -411,8 +411,8 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) bg_thresh = thresh / 2; tsk = current; if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { - bg_thresh += bg_thresh / 4; - thresh += thresh / 4; + bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32; + thresh += thresh / 4 + global_wb_domain.dirty_limit / 32; } dtc->thresh = thresh; dtc->bg_thresh = bg_thresh; -- cgit v0.10.2 From f0508977787b3bcfd54e82111ab50d6245b9f4df Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Fri, 20 May 2016 16:58:56 -0700 Subject: mm, thp: khugepaged should scan when sleep value is written If a large value is written to scan_sleep_millisecs, for example, that period must lapse before khugepaged will wake up for periodic collapsing. If this value is tuned to 1 day, for example, and then re-tuned to its default 10s, khugepaged will still wait for a day before scanning again. This patch causes khugepaged to wakeup immediately when the value is changed and then sleep until that value is rewritten or the new value lapses. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605181453200.4786@chino.kir.corp.google.com Signed-off-by: David Rientjes Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1764184..41ef754 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -89,6 +89,7 @@ static unsigned int khugepaged_full_scans; static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000; /* during fragmentation poll the hugepage allocator once every minute */ static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000; +static unsigned long khugepaged_sleep_expire; static struct task_struct *khugepaged_thread __read_mostly; static DEFINE_MUTEX(khugepaged_mutex); static DEFINE_SPINLOCK(khugepaged_mm_lock); @@ -467,6 +468,7 @@ static ssize_t scan_sleep_millisecs_store(struct kobject *kobj, return -EINVAL; khugepaged_scan_sleep_millisecs = msecs; + khugepaged_sleep_expire = 0; wake_up_interruptible(&khugepaged_wait); return count; @@ -494,6 +496,7 @@ static ssize_t alloc_sleep_millisecs_store(struct kobject *kobj, return -EINVAL; khugepaged_alloc_sleep_millisecs = msecs; + khugepaged_sleep_expire = 0; wake_up_interruptible(&khugepaged_wait); return count; @@ -2791,15 +2794,25 @@ static void khugepaged_do_scan(void) put_page(hpage); } +static bool khugepaged_should_wakeup(void) +{ + return kthread_should_stop() || + time_after_eq(jiffies, khugepaged_sleep_expire); +} + static void khugepaged_wait_work(void) { if (khugepaged_has_work()) { - if (!khugepaged_scan_sleep_millisecs) + const unsigned long scan_sleep_jiffies = + msecs_to_jiffies(khugepaged_scan_sleep_millisecs); + + if (!scan_sleep_jiffies) return; + khugepaged_sleep_expire = jiffies + scan_sleep_jiffies; wait_event_freezable_timeout(khugepaged_wait, - kthread_should_stop(), - msecs_to_jiffies(khugepaged_scan_sleep_millisecs)); + khugepaged_should_wakeup(), + scan_sleep_jiffies); return; } -- cgit v0.10.2 From 0bb2fd13b69abfd88880f356903b5c7ca36d5eea Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Fri, 20 May 2016 16:58:59 -0700 Subject: mm: page_is_guard(): return false when page_ext arrays are not allocated yet When enabling the below kernel configs: CONFIG_DEFERRED_STRUCT_PAGE_INIT CONFIG_DEBUG_PAGEALLOC CONFIG_PAGE_EXTENSION CONFIG_DEBUG_VM kernel bootup may fail due to the following oops: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] free_pcppages_bulk+0x2d2/0x8d0 PGD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000 RIP: 0010:[] [] free_pcppages_bulk+0x2d2/0x8d0 RSP: 0000:ffff88017c087c48 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401 RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009 R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400 R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080 FS: 0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0 Call Trace: free_hot_cold_page+0x192/0x1d0 __free_pages+0x5c/0x90 __free_pages_boot_core+0x11a/0x14e deferred_free_range+0x50/0x62 deferred_init_memmap+0x220/0x3c3 kthread+0xf8/0x110 ret_from_fork+0x22/0x40 Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff RIP [] free_pcppages_bulk+0x2d2/0x8d0 RSP CR2: 0000000000000000 The problem is lookup_page_ext() returns NULL then page_is_guard() tried to access it in page freeing. page_is_guard() depends on PAGE_EXT_DEBUG_GUARD bit of page extension flag, but freeing page might reach here before the page_ext arrays are allocated when feeding a range of pages to the allocator for the first time during bootup or memory hotplug. When it returns NULL, page_is_guard() should just return false instead of checking PAGE_EXT_DEBUG_GUARD unconditionally. Link: http://lkml.kernel.org/r/1463610225-29060-1-git-send-email-yang.shi@linaro.org Signed-off-by: Yang Shi Cc: Joonsoo Kim Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/mm.h b/include/linux/mm.h index f223ac2..b530c99 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2386,6 +2386,9 @@ static inline bool page_is_guard(struct page *page) return false; page_ext = lookup_page_ext(page); + if (unlikely(!page_ext)) + return false; + return test_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flags); } #else -- cgit v0.10.2 From 6cd9dc3e75078ef646076fa63adfb9b85ced0b66 Mon Sep 17 00:00:00 2001 From: Chen Feng Date: Fri, 20 May 2016 16:59:02 -0700 Subject: mm/compaction.c: fix zoneindex in kcompactd() While testing the kcompactd in my platform 3G MEM only DMA ZONE. I found the kcompactd never wakeup. It seems the zoneindex has already minus 1 before. So the traverse here should be <=. It fixes a regression where kswapd could previously compact, but kcompactd not. Not a crash fix though. [akpm@linux-foundation.org: fix kcompactd_do_work() as well, per Hugh] Link: http://lkml.kernel.org/r/1463659121-84124-1-git-send-email-puck.chen@hisilicon.com Fixes: accf62422b3a ("mm, kswapd: replace kswapd compaction with waking up kcompactd") Signed-off-by: Chen Feng Acked-by: Vlastimil Babka Cc: Hugh Dickins Cc: Michal Hocko Cc: Kirill A. Shutemov Cc: Johannes Weiner Cc: Tejun Heo Cc: Zhuangluan Su Cc: Yiping Xu Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/compaction.c b/mm/compaction.c index d8a20fc..1427366 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1862,7 +1862,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat) struct zone *zone; enum zone_type classzone_idx = pgdat->kcompactd_classzone_idx; - for (zoneid = 0; zoneid < classzone_idx; zoneid++) { + for (zoneid = 0; zoneid <= classzone_idx; zoneid++) { zone = &pgdat->node_zones[zoneid]; if (!populated_zone(zone)) @@ -1897,7 +1897,7 @@ static void kcompactd_do_work(pg_data_t *pgdat) cc.classzone_idx); count_vm_event(KCOMPACTD_WAKE); - for (zoneid = 0; zoneid < cc.classzone_idx; zoneid++) { + for (zoneid = 0; zoneid <= cc.classzone_idx; zoneid++) { int status; zone = &pgdat->node_zones[zoneid]; -- cgit v0.10.2 From dfef2ef4027b1304149a65dc33794eab65e8a3bf Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Fri, 20 May 2016 16:59:05 -0700 Subject: mm, migrate: increment fail count on ENOMEM If page migration fails due to -ENOMEM, nr_failed should still be incremented for proper statistics. This was encountered recently when all page migration vmstats showed 0, and inferred that migrate_pages() was never called, although in reality the first page migration failed because compaction_alloc() failed to find a migration target. This patch increments nr_failed so the vmstat is properly accounted on ENOMEM. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605191510230.32658@chino.kir.corp.google.com Signed-off-by: David Rientjes Cc: Vlastimil Babka Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/migrate.c b/mm/migrate.c index 53ab639..9baf41c 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1171,6 +1171,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, switch(rc) { case -ENOMEM: + nr_failed++; goto out; case -EAGAIN: retry++; -- cgit v0.10.2 From b8f1a75d61d8405a753380c6fb17ba84a5603cd4 Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Fri, 20 May 2016 16:59:08 -0700 Subject: mm: call page_ext_init() after all struct pages are initialized When DEFERRED_STRUCT_PAGE_INIT is enabled, just a subset of memmap at boot are initialized, then the rest are initialized in parallel by starting one-off "pgdatinitX" kernel thread for each node X. If page_ext_init is called before it, some pages will not have valid extension, this may lead the below kernel oops when booting up kernel: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] free_pcppages_bulk+0x2d2/0x8d0 PGD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000 RIP: 0010:[] [] free_pcppages_bulk+0x2d2/0x8d0 RSP: 0000:ffff88017c087c48 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401 RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009 R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400 R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080 FS: 0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0 Call Trace: free_hot_cold_page+0x192/0x1d0 __free_pages+0x5c/0x90 __free_pages_boot_core+0x11a/0x14e deferred_free_range+0x50/0x62 deferred_init_memmap+0x220/0x3c3 kthread+0xf8/0x110 ret_from_fork+0x22/0x40 Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff RIP [] free_pcppages_bulk+0x2d2/0x8d0 RSP CR2: 0000000000000000 Move page_ext_init() after page_alloc_init_late() to make sure page extension is setup for all pages. Link: http://lkml.kernel.org/r/1463696006-31360-1-git-send-email-yang.shi@linaro.org Signed-off-by: Yang Shi Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/init/main.c b/init/main.c index b3c6e36..2075faf 100644 --- a/init/main.c +++ b/init/main.c @@ -606,7 +606,6 @@ asmlinkage __visible void __init start_kernel(void) initrd_start = 0; } #endif - page_ext_init(); debug_objects_mem_init(); kmemleak_init(); setup_per_cpu_pageset(); @@ -1004,6 +1003,8 @@ static noinline void __init kernel_init_freeable(void) sched_init_smp(); page_alloc_init_late(); + /* Initialize page ext after all struct pages are initializaed */ + page_ext_init(); do_basic_setup(); -- cgit v0.10.2 From 55834c59098d0c5a97b0f3247e55832b67facdcf Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Fri, 20 May 2016 16:59:11 -0700 Subject: mm: kasan: initial memory quarantine implementation Quarantine isolates freed objects in a separate queue. The objects are returned to the allocator later, which helps to detect use-after-free errors. When the object is freed, its state changes from KASAN_STATE_ALLOC to KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine instead of being returned to the allocator, therefore every subsequent access to that object triggers a KASAN error, and the error handler is able to say where the object has been allocated and deallocated. When it's time for the object to leave quarantine, its state becomes KASAN_STATE_FREE and it's returned to the allocator. From now on the allocator may reuse it for another allocation. Before that happens, it's still possible to detect a use-after free on that object (it retains the allocation/deallocation stacks). When the allocator reuses this object, the shadow is unpoisoned and old allocation/deallocation stacks are wiped. Therefore a use of this object, even an incorrect one, won't trigger ASan warning. Without the quarantine, it's not guaranteed that the objects aren't reused immediately, that's why the probability of catching a use-after-free is lower than with quarantine in place. Quarantine isolates freed objects in a separate queue. The objects are returned to the allocator later, which helps to detect use-after-free errors. Freed objects are first added to per-cpu quarantine queues. When a cache is destroyed or memory shrinking is requested, the objects are moved into the global quarantine queue. Whenever a kmalloc call allows memory reclaiming, the oldest objects are popped out of the global queue until the total size of objects in quarantine is less than 3/4 of the maximum quarantine size (which is a fraction of installed physical memory). As long as an object remains in the quarantine, KASAN is able to report accesses to it, so the chance of reporting a use-after-free is increased. Once the object leaves quarantine, the allocator may reuse it, in which case the object is unpoisoned and KASAN can't detect incorrect accesses to it. Right now quarantine support is only enabled in SLAB allocator. Unification of KASAN features in SLAB and SLUB will be done later. This patch is based on the "mm: kasan: quarantine" patch originally prepared by Dmitry Chernenkov. A number of improvements have been suggested by Andrey Ryabinin. [glider@google.com: v9] Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com Signed-off-by: Alexander Potapenko Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrey Konovalov Cc: Dmitry Vyukov Cc: Andrey Ryabinin Cc: Steven Rostedt Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/kasan.h b/include/linux/kasan.h index 737371b..611927f 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -50,6 +50,8 @@ void kasan_free_pages(struct page *page, unsigned int order); void kasan_cache_create(struct kmem_cache *cache, size_t *size, unsigned long *flags); +void kasan_cache_shrink(struct kmem_cache *cache); +void kasan_cache_destroy(struct kmem_cache *cache); void kasan_poison_slab(struct page *page); void kasan_unpoison_object_data(struct kmem_cache *cache, void *object); @@ -63,7 +65,8 @@ void kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size, void kasan_krealloc(const void *object, size_t new_size, gfp_t flags); void kasan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags); -void kasan_slab_free(struct kmem_cache *s, void *object); +bool kasan_slab_free(struct kmem_cache *s, void *object); +void kasan_poison_slab_free(struct kmem_cache *s, void *object); struct kasan_cache { int alloc_meta_offset; @@ -88,6 +91,8 @@ static inline void kasan_free_pages(struct page *page, unsigned int order) {} static inline void kasan_cache_create(struct kmem_cache *cache, size_t *size, unsigned long *flags) {} +static inline void kasan_cache_shrink(struct kmem_cache *cache) {} +static inline void kasan_cache_destroy(struct kmem_cache *cache) {} static inline void kasan_poison_slab(struct page *page) {} static inline void kasan_unpoison_object_data(struct kmem_cache *cache, @@ -105,7 +110,11 @@ static inline void kasan_krealloc(const void *object, size_t new_size, static inline void kasan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags) {} -static inline void kasan_slab_free(struct kmem_cache *s, void *object) {} +static inline bool kasan_slab_free(struct kmem_cache *s, void *object) +{ + return false; +} +static inline void kasan_poison_slab_free(struct kmem_cache *s, void *object) {} static inline int kasan_module_alloc(void *addr, size_t size) { return 0; } static inline void kasan_free_shadow(const struct vm_struct *vm) {} diff --git a/mm/kasan/Makefile b/mm/kasan/Makefile index 131daad..1548749 100644 --- a/mm/kasan/Makefile +++ b/mm/kasan/Makefile @@ -8,3 +8,4 @@ CFLAGS_REMOVE_kasan.o = -pg CFLAGS_kasan.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) obj-y := kasan.o report.o kasan_init.o +obj-$(CONFIG_SLAB) += quarantine.o diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c index 38f1dd7..8df666b 100644 --- a/mm/kasan/kasan.c +++ b/mm/kasan/kasan.c @@ -388,6 +388,16 @@ void kasan_cache_create(struct kmem_cache *cache, size_t *size, } #endif +void kasan_cache_shrink(struct kmem_cache *cache) +{ + quarantine_remove_cache(cache); +} + +void kasan_cache_destroy(struct kmem_cache *cache) +{ + quarantine_remove_cache(cache); +} + void kasan_poison_slab(struct page *page) { kasan_poison_shadow(page_address(page), @@ -482,7 +492,7 @@ void kasan_slab_alloc(struct kmem_cache *cache, void *object, gfp_t flags) kasan_kmalloc(cache, object, cache->object_size, flags); } -void kasan_slab_free(struct kmem_cache *cache, void *object) +void kasan_poison_slab_free(struct kmem_cache *cache, void *object) { unsigned long size = cache->object_size; unsigned long rounded_up_size = round_up(size, KASAN_SHADOW_SCALE_SIZE); @@ -491,18 +501,43 @@ void kasan_slab_free(struct kmem_cache *cache, void *object) if (unlikely(cache->flags & SLAB_DESTROY_BY_RCU)) return; + kasan_poison_shadow(object, rounded_up_size, KASAN_KMALLOC_FREE); +} + +bool kasan_slab_free(struct kmem_cache *cache, void *object) +{ #ifdef CONFIG_SLAB - if (cache->flags & SLAB_KASAN) { - struct kasan_free_meta *free_info = - get_free_info(cache, object); + /* RCU slabs could be legally used after free within the RCU period */ + if (unlikely(cache->flags & SLAB_DESTROY_BY_RCU)) + return false; + + if (likely(cache->flags & SLAB_KASAN)) { struct kasan_alloc_meta *alloc_info = get_alloc_info(cache, object); - alloc_info->state = KASAN_STATE_FREE; - set_track(&free_info->track, GFP_NOWAIT); + struct kasan_free_meta *free_info = + get_free_info(cache, object); + + switch (alloc_info->state) { + case KASAN_STATE_ALLOC: + alloc_info->state = KASAN_STATE_QUARANTINE; + quarantine_put(free_info, cache); + set_track(&free_info->track, GFP_NOWAIT); + kasan_poison_slab_free(cache, object); + return true; + case KASAN_STATE_QUARANTINE: + case KASAN_STATE_FREE: + pr_err("Double free"); + dump_stack(); + break; + default: + break; + } } + return false; +#else + kasan_poison_slab_free(cache, object); + return false; #endif - - kasan_poison_shadow(object, rounded_up_size, KASAN_KMALLOC_FREE); } void kasan_kmalloc(struct kmem_cache *cache, const void *object, size_t size, @@ -511,6 +546,9 @@ void kasan_kmalloc(struct kmem_cache *cache, const void *object, size_t size, unsigned long redzone_start; unsigned long redzone_end; + if (flags & __GFP_RECLAIM) + quarantine_reduce(); + if (unlikely(object == NULL)) return; @@ -541,6 +579,9 @@ void kasan_kmalloc_large(const void *ptr, size_t size, gfp_t flags) unsigned long redzone_start; unsigned long redzone_end; + if (flags & __GFP_RECLAIM) + quarantine_reduce(); + if (unlikely(ptr == NULL)) return; diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h index 30a2f0b..7f7ac51 100644 --- a/mm/kasan/kasan.h +++ b/mm/kasan/kasan.h @@ -62,6 +62,7 @@ struct kasan_global { enum kasan_state { KASAN_STATE_INIT, KASAN_STATE_ALLOC, + KASAN_STATE_QUARANTINE, KASAN_STATE_FREE }; @@ -79,9 +80,14 @@ struct kasan_alloc_meta { u32 reserved; }; +struct qlist_node { + struct qlist_node *next; +}; struct kasan_free_meta { - /* Allocator freelist pointer, unused by KASAN. */ - void **freelist; + /* This field is used while the object is in the quarantine. + * Otherwise it might be used for the allocator freelist. + */ + struct qlist_node quarantine_link; struct kasan_track track; }; @@ -105,4 +111,15 @@ static inline bool kasan_report_enabled(void) void kasan_report(unsigned long addr, size_t size, bool is_write, unsigned long ip); +#ifdef CONFIG_SLAB +void quarantine_put(struct kasan_free_meta *info, struct kmem_cache *cache); +void quarantine_reduce(void); +void quarantine_remove_cache(struct kmem_cache *cache); +#else +static inline void quarantine_put(struct kasan_free_meta *info, + struct kmem_cache *cache) { } +static inline void quarantine_reduce(void) { } +static inline void quarantine_remove_cache(struct kmem_cache *cache) { } +#endif + #endif diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c new file mode 100644 index 0000000..4973505 --- /dev/null +++ b/mm/kasan/quarantine.c @@ -0,0 +1,291 @@ +/* + * KASAN quarantine. + * + * Author: Alexander Potapenko + * Copyright (C) 2016 Google, Inc. + * + * Based on code by Dmitry Chernenkov. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * version 2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../slab.h" +#include "kasan.h" + +/* Data structure and operations for quarantine queues. */ + +/* + * Each queue is a signle-linked list, which also stores the total size of + * objects inside of it. + */ +struct qlist_head { + struct qlist_node *head; + struct qlist_node *tail; + size_t bytes; +}; + +#define QLIST_INIT { NULL, NULL, 0 } + +static bool qlist_empty(struct qlist_head *q) +{ + return !q->head; +} + +static void qlist_init(struct qlist_head *q) +{ + q->head = q->tail = NULL; + q->bytes = 0; +} + +static void qlist_put(struct qlist_head *q, struct qlist_node *qlink, + size_t size) +{ + if (unlikely(qlist_empty(q))) + q->head = qlink; + else + q->tail->next = qlink; + q->tail = qlink; + qlink->next = NULL; + q->bytes += size; +} + +static void qlist_move_all(struct qlist_head *from, struct qlist_head *to) +{ + if (unlikely(qlist_empty(from))) + return; + + if (qlist_empty(to)) { + *to = *from; + qlist_init(from); + return; + } + + to->tail->next = from->head; + to->tail = from->tail; + to->bytes += from->bytes; + + qlist_init(from); +} + +static void qlist_move(struct qlist_head *from, struct qlist_node *last, + struct qlist_head *to, size_t size) +{ + if (unlikely(last == from->tail)) { + qlist_move_all(from, to); + return; + } + if (qlist_empty(to)) + to->head = from->head; + else + to->tail->next = from->head; + to->tail = last; + from->head = last->next; + last->next = NULL; + from->bytes -= size; + to->bytes += size; +} + + +/* + * The object quarantine consists of per-cpu queues and a global queue, + * guarded by quarantine_lock. + */ +static DEFINE_PER_CPU(struct qlist_head, cpu_quarantine); + +static struct qlist_head global_quarantine; +static DEFINE_SPINLOCK(quarantine_lock); + +/* Maximum size of the global queue. */ +static unsigned long quarantine_size; + +/* + * The fraction of physical memory the quarantine is allowed to occupy. + * Quarantine doesn't support memory shrinker with SLAB allocator, so we keep + * the ratio low to avoid OOM. + */ +#define QUARANTINE_FRACTION 32 + +#define QUARANTINE_LOW_SIZE (READ_ONCE(quarantine_size) * 3 / 4) +#define QUARANTINE_PERCPU_SIZE (1 << 20) + +static struct kmem_cache *qlink_to_cache(struct qlist_node *qlink) +{ + return virt_to_head_page(qlink)->slab_cache; +} + +static void *qlink_to_object(struct qlist_node *qlink, struct kmem_cache *cache) +{ + struct kasan_free_meta *free_info = + container_of(qlink, struct kasan_free_meta, + quarantine_link); + + return ((void *)free_info) - cache->kasan_info.free_meta_offset; +} + +static void qlink_free(struct qlist_node *qlink, struct kmem_cache *cache) +{ + void *object = qlink_to_object(qlink, cache); + struct kasan_alloc_meta *alloc_info = get_alloc_info(cache, object); + unsigned long flags; + + local_irq_save(flags); + alloc_info->state = KASAN_STATE_FREE; + ___cache_free(cache, object, _THIS_IP_); + local_irq_restore(flags); +} + +static void qlist_free_all(struct qlist_head *q, struct kmem_cache *cache) +{ + struct qlist_node *qlink; + + if (unlikely(qlist_empty(q))) + return; + + qlink = q->head; + while (qlink) { + struct kmem_cache *obj_cache = + cache ? cache : qlink_to_cache(qlink); + struct qlist_node *next = qlink->next; + + qlink_free(qlink, obj_cache); + qlink = next; + } + qlist_init(q); +} + +void quarantine_put(struct kasan_free_meta *info, struct kmem_cache *cache) +{ + unsigned long flags; + struct qlist_head *q; + struct qlist_head temp = QLIST_INIT; + + local_irq_save(flags); + + q = this_cpu_ptr(&cpu_quarantine); + qlist_put(q, &info->quarantine_link, cache->size); + if (unlikely(q->bytes > QUARANTINE_PERCPU_SIZE)) + qlist_move_all(q, &temp); + + local_irq_restore(flags); + + if (unlikely(!qlist_empty(&temp))) { + spin_lock_irqsave(&quarantine_lock, flags); + qlist_move_all(&temp, &global_quarantine); + spin_unlock_irqrestore(&quarantine_lock, flags); + } +} + +void quarantine_reduce(void) +{ + size_t new_quarantine_size; + unsigned long flags; + struct qlist_head to_free = QLIST_INIT; + size_t size_to_free = 0; + struct qlist_node *last; + + if (likely(READ_ONCE(global_quarantine.bytes) <= + READ_ONCE(quarantine_size))) + return; + + spin_lock_irqsave(&quarantine_lock, flags); + + /* + * Update quarantine size in case of hotplug. Allocate a fraction of + * the installed memory to quarantine minus per-cpu queue limits. + */ + new_quarantine_size = (READ_ONCE(totalram_pages) << PAGE_SHIFT) / + QUARANTINE_FRACTION; + new_quarantine_size -= QUARANTINE_PERCPU_SIZE * num_online_cpus(); + WRITE_ONCE(quarantine_size, new_quarantine_size); + + last = global_quarantine.head; + while (last) { + struct kmem_cache *cache = qlink_to_cache(last); + + size_to_free += cache->size; + if (!last->next || size_to_free > + global_quarantine.bytes - QUARANTINE_LOW_SIZE) + break; + last = last->next; + } + qlist_move(&global_quarantine, last, &to_free, size_to_free); + + spin_unlock_irqrestore(&quarantine_lock, flags); + + qlist_free_all(&to_free, NULL); +} + +static void qlist_move_cache(struct qlist_head *from, + struct qlist_head *to, + struct kmem_cache *cache) +{ + struct qlist_node *prev = NULL, *curr; + + if (unlikely(qlist_empty(from))) + return; + + curr = from->head; + while (curr) { + struct qlist_node *qlink = curr; + struct kmem_cache *obj_cache = qlink_to_cache(qlink); + + if (obj_cache == cache) { + if (unlikely(from->head == qlink)) { + from->head = curr->next; + prev = curr; + } else + prev->next = curr->next; + if (unlikely(from->tail == qlink)) + from->tail = curr->next; + from->bytes -= cache->size; + qlist_put(to, qlink, cache->size); + } else { + prev = curr; + } + curr = curr->next; + } +} + +static void per_cpu_remove_cache(void *arg) +{ + struct kmem_cache *cache = arg; + struct qlist_head to_free = QLIST_INIT; + struct qlist_head *q; + + q = this_cpu_ptr(&cpu_quarantine); + qlist_move_cache(q, &to_free, cache); + qlist_free_all(&to_free, cache); +} + +void quarantine_remove_cache(struct kmem_cache *cache) +{ + unsigned long flags; + struct qlist_head to_free = QLIST_INIT; + + on_each_cpu(per_cpu_remove_cache, cache, 1); + + spin_lock_irqsave(&quarantine_lock, flags); + qlist_move_cache(&global_quarantine, &to_free, cache); + spin_unlock_irqrestore(&quarantine_lock, flags); + + qlist_free_all(&to_free, cache); +} diff --git a/mm/kasan/report.c b/mm/kasan/report.c index 60869a5..b3c122d 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -151,6 +151,7 @@ static void object_err(struct kmem_cache *cache, struct page *page, print_track(&alloc_info->track); break; case KASAN_STATE_FREE: + case KASAN_STATE_QUARANTINE: pr_err("Object freed, allocated with size %u bytes\n", alloc_info->alloc_size); free_info = get_free_info(cache, object); diff --git a/mm/mempool.c b/mm/mempool.c index 9b7a14a..9e075f8 100644 --- a/mm/mempool.c +++ b/mm/mempool.c @@ -105,7 +105,7 @@ static inline void poison_element(mempool_t *pool, void *element) static void kasan_poison_element(mempool_t *pool, void *element) { if (pool->alloc == mempool_alloc_slab) - kasan_slab_free(pool->pool_data, element); + kasan_poison_slab_free(pool->pool_data, element); if (pool->alloc == mempool_kmalloc) kasan_kfree(element); if (pool->alloc == mempool_alloc_pages) diff --git a/mm/slab.c b/mm/slab.c index c11bf50..28864c0 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3547,9 +3547,17 @@ free_done: static inline void __cache_free(struct kmem_cache *cachep, void *objp, unsigned long caller) { - struct array_cache *ac = cpu_cache_get(cachep); + /* Put the object into the quarantine, don't touch it for now. */ + if (kasan_slab_free(cachep, objp)) + return; + + ___cache_free(cachep, objp, caller); +} - kasan_slab_free(cachep, objp); +void ___cache_free(struct kmem_cache *cachep, void *objp, + unsigned long caller) +{ + struct array_cache *ac = cpu_cache_get(cachep); check_irq_off(); kmemleak_free_recursive(objp, cachep->flags); diff --git a/mm/slab.h b/mm/slab.h index 5969769..dedb1a9 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -462,4 +462,6 @@ void *slab_next(struct seq_file *m, void *p, loff_t *pos); void slab_stop(struct seq_file *m, void *p); int memcg_slab_show(struct seq_file *m, void *p); +void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr); + #endif /* MM_SLAB_H */ diff --git a/mm/slab_common.c b/mm/slab_common.c index 3239bfd..a65dad7 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -715,6 +715,7 @@ void kmem_cache_destroy(struct kmem_cache *s) get_online_cpus(); get_online_mems(); + kasan_cache_destroy(s); mutex_lock(&slab_mutex); s->refcount--; @@ -753,6 +754,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep) get_online_cpus(); get_online_mems(); + kasan_cache_shrink(cachep); ret = __kmem_cache_shrink(cachep, false); put_online_mems(); put_online_cpus(); -- cgit v0.10.2 From 4ebb31a42ffa03912447fe1aabbdb28242f909ba Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Fri, 20 May 2016 16:59:14 -0700 Subject: mm, kasan: don't call kasan_krealloc() from ksize(). Instead of calling kasan_krealloc(), which replaces the memory allocation stack ID (if stack depot is used), just unpoison the whole memory chunk. Signed-off-by: Alexander Potapenko Acked-by: Andrey Ryabinin Cc: Andrey Konovalov Cc: Dmitry Vyukov Cc: Christoph Lameter Cc: Konstantin Serebryany Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/slab.c b/mm/slab.c index 28864c0..cc8bbc1 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -4501,7 +4501,7 @@ size_t ksize(const void *objp) /* We assume that ksize callers could use the whole allocated area, * so we need to unpoison this area. */ - kasan_krealloc(objp, size, GFP_NOWAIT); + kasan_unpoison_shadow(objp, size); return size; } diff --git a/mm/slub.c b/mm/slub.c index cf1faa4..825ff45 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3635,8 +3635,9 @@ size_t ksize(const void *object) { size_t size = __ksize(object); /* We assume that ksize callers could use whole allocated area, - so we need unpoison this area. */ - kasan_krealloc(object, size, GFP_NOWAIT); + * so we need to unpoison this area. + */ + kasan_unpoison_shadow(object, size); return size; } EXPORT_SYMBOL(ksize); -- cgit v0.10.2 From 96fe805fb6fe9b2ed12fc54ad0e3e6829a4152cb Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Fri, 20 May 2016 16:59:17 -0700 Subject: mm, kasan: add a ksize() test Add a test that makes sure ksize() unpoisons the whole chunk. Signed-off-by: Alexander Potapenko Acked-by: Andrey Ryabinin Cc: Andrey Konovalov Cc: Dmitry Vyukov Cc: Christoph Lameter Cc: Konstantin Serebryany Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/test_kasan.c b/lib/test_kasan.c index 82169fb..48e5a0b 100644 --- a/lib/test_kasan.c +++ b/lib/test_kasan.c @@ -344,6 +344,25 @@ static noinline void __init kasan_stack_oob(void) *(volatile char *)p; } +static noinline void __init ksize_unpoisons_memory(void) +{ + char *ptr; + size_t size = 123, real_size = size; + + pr_info("ksize() unpoisons the whole allocated chunk\n"); + ptr = kmalloc(size, GFP_KERNEL); + if (!ptr) { + pr_err("Allocation failed\n"); + return; + } + real_size = ksize(ptr); + /* This access doesn't trigger an error. */ + ptr[size] = 'x'; + /* This one does. */ + ptr[real_size] = 'y'; + kfree(ptr); +} + static int __init kmalloc_tests_init(void) { kmalloc_oob_right(); @@ -367,6 +386,7 @@ static int __init kmalloc_tests_init(void) kmem_cache_oob(); kasan_stack_oob(); kasan_global_oob(); + ksize_unpoisons_memory(); return -EAGAIN; } -- cgit v0.10.2 From 936bb4bbbb832f81055328b84e5afe1fc7246a8d Mon Sep 17 00:00:00 2001 From: Andrey Ryabinin Date: Fri, 20 May 2016 16:59:20 -0700 Subject: mm/kasan: print name of mem[set,cpy,move]() caller in report When bogus memory access happens in mem[set,cpy,move]() it's usually caller's fault. So don't blame mem[set,cpy,move]() in bug report, blame the caller instead. Before: BUG: KASAN: out-of-bounds access in memset+0x23/0x40 at
After: BUG: KASAN: out-of-bounds access in at
Link: http://lkml.kernel.org/r/1462538722-1574-2-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin Acked-by: Alexander Potapenko Cc: Dmitry Vyukov Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c index 8df666b..e5beb40 100644 --- a/mm/kasan/kasan.c +++ b/mm/kasan/kasan.c @@ -273,32 +273,36 @@ static __always_inline bool memory_is_poisoned(unsigned long addr, size_t size) return memory_is_poisoned_n(addr, size); } - -static __always_inline void check_memory_region(unsigned long addr, - size_t size, bool write) +static __always_inline void check_memory_region_inline(unsigned long addr, + size_t size, bool write, + unsigned long ret_ip) { if (unlikely(size == 0)) return; if (unlikely((void *)addr < kasan_shadow_to_mem((void *)KASAN_SHADOW_START))) { - kasan_report(addr, size, write, _RET_IP_); + kasan_report(addr, size, write, ret_ip); return; } if (likely(!memory_is_poisoned(addr, size))) return; - kasan_report(addr, size, write, _RET_IP_); + kasan_report(addr, size, write, ret_ip); } -void __asan_loadN(unsigned long addr, size_t size); -void __asan_storeN(unsigned long addr, size_t size); +static void check_memory_region(unsigned long addr, + size_t size, bool write, + unsigned long ret_ip) +{ + check_memory_region_inline(addr, size, write, ret_ip); +} #undef memset void *memset(void *addr, int c, size_t len) { - __asan_storeN((unsigned long)addr, len); + check_memory_region((unsigned long)addr, len, true, _RET_IP_); return __memset(addr, c, len); } @@ -306,8 +310,8 @@ void *memset(void *addr, int c, size_t len) #undef memmove void *memmove(void *dest, const void *src, size_t len) { - __asan_loadN((unsigned long)src, len); - __asan_storeN((unsigned long)dest, len); + check_memory_region((unsigned long)src, len, false, _RET_IP_); + check_memory_region((unsigned long)dest, len, true, _RET_IP_); return __memmove(dest, src, len); } @@ -315,8 +319,8 @@ void *memmove(void *dest, const void *src, size_t len) #undef memcpy void *memcpy(void *dest, const void *src, size_t len) { - __asan_loadN((unsigned long)src, len); - __asan_storeN((unsigned long)dest, len); + check_memory_region((unsigned long)src, len, false, _RET_IP_); + check_memory_region((unsigned long)dest, len, true, _RET_IP_); return __memcpy(dest, src, len); } @@ -690,22 +694,22 @@ void __asan_unregister_globals(struct kasan_global *globals, size_t size) } EXPORT_SYMBOL(__asan_unregister_globals); -#define DEFINE_ASAN_LOAD_STORE(size) \ - void __asan_load##size(unsigned long addr) \ - { \ - check_memory_region(addr, size, false); \ - } \ - EXPORT_SYMBOL(__asan_load##size); \ - __alias(__asan_load##size) \ - void __asan_load##size##_noabort(unsigned long); \ - EXPORT_SYMBOL(__asan_load##size##_noabort); \ - void __asan_store##size(unsigned long addr) \ - { \ - check_memory_region(addr, size, true); \ - } \ - EXPORT_SYMBOL(__asan_store##size); \ - __alias(__asan_store##size) \ - void __asan_store##size##_noabort(unsigned long); \ +#define DEFINE_ASAN_LOAD_STORE(size) \ + void __asan_load##size(unsigned long addr) \ + { \ + check_memory_region_inline(addr, size, false, _RET_IP_);\ + } \ + EXPORT_SYMBOL(__asan_load##size); \ + __alias(__asan_load##size) \ + void __asan_load##size##_noabort(unsigned long); \ + EXPORT_SYMBOL(__asan_load##size##_noabort); \ + void __asan_store##size(unsigned long addr) \ + { \ + check_memory_region_inline(addr, size, true, _RET_IP_); \ + } \ + EXPORT_SYMBOL(__asan_store##size); \ + __alias(__asan_store##size) \ + void __asan_store##size##_noabort(unsigned long); \ EXPORT_SYMBOL(__asan_store##size##_noabort) DEFINE_ASAN_LOAD_STORE(1); @@ -716,7 +720,7 @@ DEFINE_ASAN_LOAD_STORE(16); void __asan_loadN(unsigned long addr, size_t size) { - check_memory_region(addr, size, false); + check_memory_region(addr, size, false, _RET_IP_); } EXPORT_SYMBOL(__asan_loadN); @@ -726,7 +730,7 @@ EXPORT_SYMBOL(__asan_loadN_noabort); void __asan_storeN(unsigned long addr, size_t size) { - check_memory_region(addr, size, true); + check_memory_region(addr, size, true, _RET_IP_); } EXPORT_SYMBOL(__asan_storeN); -- cgit v0.10.2 From 64f8ebaf115bcddc4aaa902f981c57ba6506bc42 Mon Sep 17 00:00:00 2001 From: Andrey Ryabinin Date: Fri, 20 May 2016 16:59:28 -0700 Subject: mm/kasan: add API to check memory regions Memory access coded in an assembly won't be seen by KASAN as a compiler can instrument only C code. Add kasan_check_[read,write]() API which is going to be used to check a certain memory range. Link: http://lkml.kernel.org/r/1462538722-1574-3-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin Acked-by: Alexander Potapenko Cc: Dmitry Vyukov Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/MAINTAINERS b/MAINTAINERS index 374ffa2..8b92445 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6242,7 +6242,7 @@ S: Maintained F: arch/*/include/asm/kasan.h F: arch/*/mm/kasan_init* F: Documentation/kasan.txt -F: include/linux/kasan.h +F: include/linux/kasan*.h F: lib/test_kasan.c F: mm/kasan/ F: scripts/Makefile.kasan diff --git a/include/linux/kasan-checks.h b/include/linux/kasan-checks.h new file mode 100644 index 0000000..b7f8ace --- /dev/null +++ b/include/linux/kasan-checks.h @@ -0,0 +1,12 @@ +#ifndef _LINUX_KASAN_CHECKS_H +#define _LINUX_KASAN_CHECKS_H + +#ifdef CONFIG_KASAN +void kasan_check_read(const void *p, unsigned int size); +void kasan_check_write(const void *p, unsigned int size); +#else +static inline void kasan_check_read(const void *p, unsigned int size) { } +static inline void kasan_check_write(const void *p, unsigned int size) { } +#endif + +#endif diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c index e5beb40..18b6a2b 100644 --- a/mm/kasan/kasan.c +++ b/mm/kasan/kasan.c @@ -299,6 +299,18 @@ static void check_memory_region(unsigned long addr, check_memory_region_inline(addr, size, write, ret_ip); } +void kasan_check_read(const void *p, unsigned int size) +{ + check_memory_region((unsigned long)p, size, false, _RET_IP_); +} +EXPORT_SYMBOL(kasan_check_read); + +void kasan_check_write(const void *p, unsigned int size) +{ + check_memory_region((unsigned long)p, size, true, _RET_IP_); +} +EXPORT_SYMBOL(kasan_check_write); + #undef memset void *memset(void *addr, int c, size_t len) { -- cgit v0.10.2 From 1771c6e1a567ea0ba2cccc0a4ffe68a1419fd8ef Mon Sep 17 00:00:00 2001 From: Andrey Ryabinin Date: Fri, 20 May 2016 16:59:31 -0700 Subject: x86/kasan: instrument user memory access API Exchange between user and kernel memory is coded in assembly language. Which means that such accesses won't be spotted by KASAN as a compiler instruments only C code. Add explicit KASAN checks to user memory access API to ensure that userspace writes to (or reads from) a valid kernel memory. Note: Unlike others strncpy_from_user() is written mostly in C and KASAN sees memory accesses in it. However, it makes sense to add explicit check for all @count bytes that *potentially* could be written to the kernel. [aryabinin@virtuozzo.com: move kasan check under the condition] Link: http://lkml.kernel.org/r/1462869209-21096-1-git-send-email-aryabinin@virtuozzo.com Link: http://lkml.kernel.org/r/1462538722-1574-4-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin Cc: Alexander Potapenko Cc: Dmitry Vyukov Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 12f9653..2982387 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -5,6 +5,7 @@ */ #include #include +#include #include #include #include @@ -721,6 +722,8 @@ copy_from_user(void *to, const void __user *from, unsigned long n) might_fault(); + kasan_check_write(to, n); + /* * While we would like to have the compiler do the checking for us * even in the non-constant size case, any false positives there are @@ -754,6 +757,8 @@ copy_to_user(void __user *to, const void *from, unsigned long n) { int sz = __compiletime_object_size(from); + kasan_check_read(from, n); + might_fault(); /* See the comment in copy_from_user() above. */ diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index 3076986..2eac2aa 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -109,6 +110,7 @@ static __always_inline __must_check int __copy_from_user(void *dst, const void __user *src, unsigned size) { might_fault(); + kasan_check_write(dst, size); return __copy_from_user_nocheck(dst, src, size); } @@ -175,6 +177,7 @@ static __always_inline __must_check int __copy_to_user(void __user *dst, const void *src, unsigned size) { might_fault(); + kasan_check_read(src, size); return __copy_to_user_nocheck(dst, src, size); } @@ -242,12 +245,14 @@ int __copy_in_user(void __user *dst, const void __user *src, unsigned size) static __must_check __always_inline int __copy_from_user_inatomic(void *dst, const void __user *src, unsigned size) { + kasan_check_write(dst, size); return __copy_from_user_nocheck(dst, src, size); } static __must_check __always_inline int __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size) { + kasan_check_read(src, size); return __copy_to_user_nocheck(dst, src, size); } @@ -258,6 +263,7 @@ static inline int __copy_from_user_nocache(void *dst, const void __user *src, unsigned size) { might_fault(); + kasan_check_write(dst, size); return __copy_user_nocache(dst, src, size, 1); } @@ -265,6 +271,7 @@ static inline int __copy_from_user_inatomic_nocache(void *dst, const void __user *src, unsigned size) { + kasan_check_write(dst, size); return __copy_user_nocache(dst, src, size, 0); } diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c index 3384032..33f655e 100644 --- a/lib/strncpy_from_user.c +++ b/lib/strncpy_from_user.c @@ -1,5 +1,6 @@ #include #include +#include #include #include #include @@ -109,6 +110,7 @@ long strncpy_from_user(char *dst, const char __user *src, long count) unsigned long max = max_addr - src_addr; long retval; + kasan_check_write(dst, count); user_access_begin(); retval = do_strncpy_from_user(dst, src, count, max); user_access_end(); -- cgit v0.10.2 From eae08dcab80c695c16c9f1f7dcd5b8ed52bfc88b Mon Sep 17 00:00:00 2001 From: Andrey Ryabinin Date: Fri, 20 May 2016 16:59:34 -0700 Subject: kasan/tests: add tests for user memory access functions Add some tests for the newly-added user memory access API. Link: http://lkml.kernel.org/r/1462538722-1574-1-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin Cc: Alexander Potapenko Cc: Dmitry Vyukov Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/test_kasan.c b/lib/test_kasan.c index 48e5a0b..5e51872b 100644 --- a/lib/test_kasan.c +++ b/lib/test_kasan.c @@ -12,9 +12,12 @@ #define pr_fmt(fmt) "kasan test: %s " fmt, __func__ #include +#include +#include #include #include #include +#include #include static noinline void __init kmalloc_oob_right(void) @@ -363,6 +366,51 @@ static noinline void __init ksize_unpoisons_memory(void) kfree(ptr); } +static noinline void __init copy_user_test(void) +{ + char *kmem; + char __user *usermem; + size_t size = 10; + int unused; + + kmem = kmalloc(size, GFP_KERNEL); + if (!kmem) + return; + + usermem = (char __user *)vm_mmap(NULL, 0, PAGE_SIZE, + PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_ANONYMOUS | MAP_PRIVATE, 0); + if (IS_ERR(usermem)) { + pr_err("Failed to allocate user memory\n"); + kfree(kmem); + return; + } + + pr_info("out-of-bounds in copy_from_user()\n"); + unused = copy_from_user(kmem, usermem, size + 1); + + pr_info("out-of-bounds in copy_to_user()\n"); + unused = copy_to_user(usermem, kmem, size + 1); + + pr_info("out-of-bounds in __copy_from_user()\n"); + unused = __copy_from_user(kmem, usermem, size + 1); + + pr_info("out-of-bounds in __copy_to_user()\n"); + unused = __copy_to_user(usermem, kmem, size + 1); + + pr_info("out-of-bounds in __copy_from_user_inatomic()\n"); + unused = __copy_from_user_inatomic(kmem, usermem, size + 1); + + pr_info("out-of-bounds in __copy_to_user_inatomic()\n"); + unused = __copy_to_user_inatomic(usermem, kmem, size + 1); + + pr_info("out-of-bounds in strncpy_from_user()\n"); + unused = strncpy_from_user(kmem, usermem, size + 1); + + vm_munmap((unsigned long)usermem, PAGE_SIZE); + kfree(kmem); +} + static int __init kmalloc_tests_init(void) { kmalloc_oob_right(); @@ -387,6 +435,7 @@ static int __init kmalloc_tests_init(void) kasan_stack_oob(); kasan_global_oob(); ksize_unpoisons_memory(); + copy_user_test(); return -EAGAIN; } -- cgit v0.10.2 From a42094676f076534bf4998625456fe0bb99c1f1e Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Fri, 20 May 2016 16:59:36 -0700 Subject: zsmalloc: use first_page rather than page Clean up function parameter "struct page". Many functions of zsmalloc expect that page paramter is "first_page" so use "first_page" rather than "page" for code readability. Signed-off-by: Minchan Kim Reviewed-by: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index fe47fbb..c3e55a4 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -413,26 +413,28 @@ static int is_last_page(struct page *page) return PagePrivate2(page); } -static void get_zspage_mapping(struct page *page, unsigned int *class_idx, +static void get_zspage_mapping(struct page *first_page, + unsigned int *class_idx, enum fullness_group *fullness) { unsigned long m; - BUG_ON(!is_first_page(page)); + BUG_ON(!is_first_page(first_page)); - m = (unsigned long)page->mapping; + m = (unsigned long)first_page->mapping; *fullness = m & FULLNESS_MASK; *class_idx = (m >> FULLNESS_BITS) & CLASS_IDX_MASK; } -static void set_zspage_mapping(struct page *page, unsigned int class_idx, +static void set_zspage_mapping(struct page *first_page, + unsigned int class_idx, enum fullness_group fullness) { unsigned long m; - BUG_ON(!is_first_page(page)); + BUG_ON(!is_first_page(first_page)); m = ((class_idx & CLASS_IDX_MASK) << FULLNESS_BITS) | (fullness & FULLNESS_MASK); - page->mapping = (struct address_space *)m; + first_page->mapping = (struct address_space *)m; } /* @@ -625,14 +627,14 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool) * the pool (not yet implemented). This function returns fullness * status of the given page. */ -static enum fullness_group get_fullness_group(struct page *page) +static enum fullness_group get_fullness_group(struct page *first_page) { int inuse, max_objects; enum fullness_group fg; - BUG_ON(!is_first_page(page)); + BUG_ON(!is_first_page(first_page)); - inuse = page->inuse; - max_objects = page->objects; + inuse = first_page->inuse; + max_objects = first_page->objects; if (inuse == 0) fg = ZS_EMPTY; @@ -652,12 +654,12 @@ static enum fullness_group get_fullness_group(struct page *page) * have. This functions inserts the given zspage into the freelist * identified by . */ -static void insert_zspage(struct page *page, struct size_class *class, +static void insert_zspage(struct page *first_page, struct size_class *class, enum fullness_group fullness) { struct page **head; - BUG_ON(!is_first_page(page)); + BUG_ON(!is_first_page(first_page)); if (fullness >= _ZS_NR_FULLNESS_GROUPS) return; @@ -667,7 +669,7 @@ static void insert_zspage(struct page *page, struct size_class *class, head = &class->fullness_list[fullness]; if (!*head) { - *head = page; + *head = first_page; return; } @@ -675,21 +677,21 @@ static void insert_zspage(struct page *page, struct size_class *class, * We want to see more ZS_FULL pages and less almost * empty/full. Put pages with higher ->inuse first. */ - list_add_tail(&page->lru, &(*head)->lru); - if (page->inuse >= (*head)->inuse) - *head = page; + list_add_tail(&first_page->lru, &(*head)->lru); + if (first_page->inuse >= (*head)->inuse) + *head = first_page; } /* * This function removes the given zspage from the freelist identified * by . */ -static void remove_zspage(struct page *page, struct size_class *class, +static void remove_zspage(struct page *first_page, struct size_class *class, enum fullness_group fullness) { struct page **head; - BUG_ON(!is_first_page(page)); + BUG_ON(!is_first_page(first_page)); if (fullness >= _ZS_NR_FULLNESS_GROUPS) return; @@ -698,11 +700,11 @@ static void remove_zspage(struct page *page, struct size_class *class, BUG_ON(!*head); if (list_empty(&(*head)->lru)) *head = NULL; - else if (*head == page) + else if (*head == first_page) *head = (struct page *)list_entry((*head)->lru.next, struct page, lru); - list_del_init(&page->lru); + list_del_init(&first_page->lru); zs_stat_dec(class, fullness == ZS_ALMOST_EMPTY ? CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1); } @@ -717,21 +719,21 @@ static void remove_zspage(struct page *page, struct size_class *class, * fullness group. */ static enum fullness_group fix_fullness_group(struct size_class *class, - struct page *page) + struct page *first_page) { int class_idx; enum fullness_group currfg, newfg; - BUG_ON(!is_first_page(page)); + BUG_ON(!is_first_page(first_page)); - get_zspage_mapping(page, &class_idx, &currfg); - newfg = get_fullness_group(page); + get_zspage_mapping(first_page, &class_idx, &currfg); + newfg = get_fullness_group(first_page); if (newfg == currfg) goto out; - remove_zspage(page, class, currfg); - insert_zspage(page, class, newfg); - set_zspage_mapping(page, class_idx, newfg); + remove_zspage(first_page, class, currfg); + insert_zspage(first_page, class, newfg); + set_zspage_mapping(first_page, class_idx, newfg); out: return newfg; @@ -1234,11 +1236,11 @@ static bool can_merge(struct size_class *prev, int size, int pages_per_zspage) return true; } -static bool zspage_full(struct page *page) +static bool zspage_full(struct page *first_page) { - BUG_ON(!is_first_page(page)); + BUG_ON(!is_first_page(first_page)); - return page->inuse == page->objects; + return first_page->inuse == first_page->objects; } unsigned long zs_get_total_pages(struct zs_pool *pool) -- cgit v0.10.2 From 830e4bc5baa9fda5d45257e9a3dbb3555c6c180e Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Fri, 20 May 2016 16:59:39 -0700 Subject: zsmalloc: clean up many BUG_ON There are many BUG_ON in zsmalloc.c which is not recommened so change them as alternatives. Normal rule is as follows: 1. avoid BUG_ON if possible. Instead, use VM_BUG_ON or VM_BUG_ON_PAGE 2. use VM_BUG_ON_PAGE if we need to see struct page's fields 3. use those assertion in primitive functions so higher functions can rely on the assertion in the primitive function. 4. Don't use assertion if following instruction can trigger Oops Signed-off-by: Minchan Kim Reviewed-by: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index c3e55a4..dfe684c 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -418,7 +418,7 @@ static void get_zspage_mapping(struct page *first_page, enum fullness_group *fullness) { unsigned long m; - BUG_ON(!is_first_page(first_page)); + VM_BUG_ON_PAGE(!is_first_page(first_page), first_page); m = (unsigned long)first_page->mapping; *fullness = m & FULLNESS_MASK; @@ -430,7 +430,7 @@ static void set_zspage_mapping(struct page *first_page, enum fullness_group fullness) { unsigned long m; - BUG_ON(!is_first_page(first_page)); + VM_BUG_ON_PAGE(!is_first_page(first_page), first_page); m = ((class_idx & CLASS_IDX_MASK) << FULLNESS_BITS) | (fullness & FULLNESS_MASK); @@ -631,7 +631,8 @@ static enum fullness_group get_fullness_group(struct page *first_page) { int inuse, max_objects; enum fullness_group fg; - BUG_ON(!is_first_page(first_page)); + + VM_BUG_ON_PAGE(!is_first_page(first_page), first_page); inuse = first_page->inuse; max_objects = first_page->objects; @@ -659,7 +660,7 @@ static void insert_zspage(struct page *first_page, struct size_class *class, { struct page **head; - BUG_ON(!is_first_page(first_page)); + VM_BUG_ON_PAGE(!is_first_page(first_page), first_page); if (fullness >= _ZS_NR_FULLNESS_GROUPS) return; @@ -691,13 +692,13 @@ static void remove_zspage(struct page *first_page, struct size_class *class, { struct page **head; - BUG_ON(!is_first_page(first_page)); + VM_BUG_ON_PAGE(!is_first_page(first_page), first_page); if (fullness >= _ZS_NR_FULLNESS_GROUPS) return; head = &class->fullness_list[fullness]; - BUG_ON(!*head); + VM_BUG_ON_PAGE(!*head, first_page); if (list_empty(&(*head)->lru)) *head = NULL; else if (*head == first_page) @@ -724,8 +725,6 @@ static enum fullness_group fix_fullness_group(struct size_class *class, int class_idx; enum fullness_group currfg, newfg; - BUG_ON(!is_first_page(first_page)); - get_zspage_mapping(first_page, &class_idx, &currfg); newfg = get_fullness_group(first_page); if (newfg == currfg) @@ -811,7 +810,7 @@ static void *location_to_obj(struct page *page, unsigned long obj_idx) unsigned long obj; if (!page) { - BUG_ON(obj_idx); + VM_BUG_ON(obj_idx); return NULL; } @@ -844,7 +843,7 @@ static unsigned long obj_to_head(struct size_class *class, struct page *page, void *obj) { if (class->huge) { - VM_BUG_ON(!is_first_page(page)); + VM_BUG_ON_PAGE(!is_first_page(page), page); return page_private(page); } else return *(unsigned long *)obj; @@ -894,8 +893,8 @@ static void free_zspage(struct page *first_page) { struct page *nextp, *tmp, *head_extra; - BUG_ON(!is_first_page(first_page)); - BUG_ON(first_page->inuse); + VM_BUG_ON_PAGE(!is_first_page(first_page), first_page); + VM_BUG_ON_PAGE(first_page->inuse, first_page); head_extra = (struct page *)page_private(first_page); @@ -921,7 +920,8 @@ static void init_zspage(struct page *first_page, struct size_class *class) unsigned long off = 0; struct page *page = first_page; - BUG_ON(!is_first_page(first_page)); + VM_BUG_ON_PAGE(!is_first_page(first_page), first_page); + while (page) { struct page *next_page; struct link_free *link; @@ -1238,7 +1238,7 @@ static bool can_merge(struct size_class *prev, int size, int pages_per_zspage) static bool zspage_full(struct page *first_page) { - BUG_ON(!is_first_page(first_page)); + VM_BUG_ON_PAGE(!is_first_page(first_page), first_page); return first_page->inuse == first_page->objects; } @@ -1276,14 +1276,12 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, struct page *pages[2]; void *ret; - BUG_ON(!handle); - /* * Because we use per-cpu mapping areas shared among the * pools/users, we can't allow mapping in interrupt context * because it can corrupt another users mappings. */ - BUG_ON(in_interrupt()); + WARN_ON_ONCE(in_interrupt()); /* From now on, migration cannot move the object */ pin_tag(handle); @@ -1327,8 +1325,6 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle) struct size_class *class; struct mapping_area *area; - BUG_ON(!handle); - obj = handle_to_obj(handle); obj_to_location(obj, &page, &obj_idx); get_zspage_mapping(get_first_page(page), &class_idx, &fg); @@ -1448,8 +1444,6 @@ static void obj_free(struct zs_pool *pool, struct size_class *class, unsigned long f_objidx, f_offset; void *vaddr; - BUG_ON(!obj); - obj &= ~OBJ_ALLOCATED_TAG; obj_to_location(obj, &f_page, &f_objidx); first_page = get_first_page(f_page); @@ -1549,7 +1543,6 @@ static void zs_object_copy(unsigned long dst, unsigned long src, kunmap_atomic(d_addr); kunmap_atomic(s_addr); s_page = get_next_page(s_page); - BUG_ON(!s_page); s_addr = kmap_atomic(s_page); d_addr = kmap_atomic(d_page); s_size = class->size - written; @@ -1559,7 +1552,6 @@ static void zs_object_copy(unsigned long dst, unsigned long src, if (d_off >= PAGE_SIZE) { kunmap_atomic(d_addr); d_page = get_next_page(d_page); - BUG_ON(!d_page); d_addr = kmap_atomic(d_page); d_size = class->size - written; d_off = 0; @@ -1694,8 +1686,6 @@ static enum fullness_group putback_zspage(struct zs_pool *pool, { enum fullness_group fullness; - BUG_ON(!is_first_page(first_page)); - fullness = get_fullness_group(first_page); insert_zspage(first_page, class, fullness); set_zspage_mapping(first_page, class->index, fullness); @@ -1759,8 +1749,6 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class) spin_lock(&class->lock); while ((src_page = isolate_source_page(class))) { - BUG_ON(!is_first_page(src_page)); - if (!zs_can_compact(class)) break; -- cgit v0.10.2 From 251cbb951b831acd8451d75b40696834f07c29c5 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Fri, 20 May 2016 16:59:42 -0700 Subject: zsmalloc: reorder function parameters Clean up function parameter ordering to order higher data structure first. Signed-off-by: Minchan Kim Reviewed-by: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index dfe684c..18535ab 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -569,7 +569,7 @@ static const struct file_operations zs_stat_size_ops = { .release = single_release, }; -static int zs_pool_stat_create(const char *name, struct zs_pool *pool) +static int zs_pool_stat_create(struct zs_pool *pool, const char *name) { struct dentry *entry; @@ -609,7 +609,7 @@ static void __exit zs_stat_exit(void) { } -static inline int zs_pool_stat_create(const char *name, struct zs_pool *pool) +static inline int zs_pool_stat_create(struct zs_pool *pool, const char *name) { return 0; } @@ -655,8 +655,9 @@ static enum fullness_group get_fullness_group(struct page *first_page) * have. This functions inserts the given zspage into the freelist * identified by . */ -static void insert_zspage(struct page *first_page, struct size_class *class, - enum fullness_group fullness) +static void insert_zspage(struct size_class *class, + enum fullness_group fullness, + struct page *first_page) { struct page **head; @@ -687,8 +688,9 @@ static void insert_zspage(struct page *first_page, struct size_class *class, * This function removes the given zspage from the freelist identified * by . */ -static void remove_zspage(struct page *first_page, struct size_class *class, - enum fullness_group fullness) +static void remove_zspage(struct size_class *class, + enum fullness_group fullness, + struct page *first_page) { struct page **head; @@ -730,8 +732,8 @@ static enum fullness_group fix_fullness_group(struct size_class *class, if (newfg == currfg) goto out; - remove_zspage(first_page, class, currfg); - insert_zspage(first_page, class, newfg); + remove_zspage(class, currfg, first_page); + insert_zspage(class, newfg, first_page); set_zspage_mapping(first_page, class_idx, newfg); out: @@ -915,7 +917,7 @@ static void free_zspage(struct page *first_page) } /* Initialize a newly allocated zspage */ -static void init_zspage(struct page *first_page, struct size_class *class) +static void init_zspage(struct size_class *class, struct page *first_page) { unsigned long off = 0; struct page *page = first_page; @@ -1003,7 +1005,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags) prev_page = page; } - init_zspage(first_page, class); + init_zspage(class, first_page); first_page->freelist = location_to_obj(first_page, 0); /* Maximum number of objects we can store in this zspage */ @@ -1348,8 +1350,8 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle) } EXPORT_SYMBOL_GPL(zs_unmap_object); -static unsigned long obj_malloc(struct page *first_page, - struct size_class *class, unsigned long handle) +static unsigned long obj_malloc(struct size_class *class, + struct page *first_page, unsigned long handle) { unsigned long obj; struct link_free *link; @@ -1426,7 +1428,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size) class->size, class->pages_per_zspage)); } - obj = obj_malloc(first_page, class, handle); + obj = obj_malloc(class, first_page, handle); /* Now move the zspage to another fullness group, if required */ fix_fullness_group(class, first_page); record_obj(handle, obj); @@ -1499,8 +1501,8 @@ void zs_free(struct zs_pool *pool, unsigned long handle) } EXPORT_SYMBOL_GPL(zs_free); -static void zs_object_copy(unsigned long dst, unsigned long src, - struct size_class *class) +static void zs_object_copy(struct size_class *class, unsigned long dst, + unsigned long src) { struct page *s_page, *d_page; unsigned long s_objidx, d_objidx; @@ -1566,8 +1568,8 @@ static void zs_object_copy(unsigned long dst, unsigned long src, * Find alloced object in zspage from index object and * return handle. */ -static unsigned long find_alloced_obj(struct page *page, int index, - struct size_class *class) +static unsigned long find_alloced_obj(struct size_class *class, + struct page *page, int index) { unsigned long head; int offset = 0; @@ -1617,7 +1619,7 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class, int ret = 0; while (1) { - handle = find_alloced_obj(s_page, index, class); + handle = find_alloced_obj(class, s_page, index); if (!handle) { s_page = get_next_page(s_page); if (!s_page) @@ -1634,8 +1636,8 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class, } used_obj = handle_to_obj(handle); - free_obj = obj_malloc(d_page, class, handle); - zs_object_copy(free_obj, used_obj, class); + free_obj = obj_malloc(class, d_page, handle); + zs_object_copy(class, free_obj, used_obj); index++; /* * record_obj updates handle's value to free_obj and it will @@ -1664,7 +1666,7 @@ static struct page *isolate_target_page(struct size_class *class) for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) { page = class->fullness_list[i]; if (page) { - remove_zspage(page, class, i); + remove_zspage(class, i, page); break; } } @@ -1687,7 +1689,7 @@ static enum fullness_group putback_zspage(struct zs_pool *pool, enum fullness_group fullness; fullness = get_fullness_group(first_page); - insert_zspage(first_page, class, fullness); + insert_zspage(class, fullness, first_page); set_zspage_mapping(first_page, class->index, fullness); if (fullness == ZS_EMPTY) { @@ -1712,7 +1714,7 @@ static struct page *isolate_source_page(struct size_class *class) if (!page) continue; - remove_zspage(page, class, i); + remove_zspage(class, i, page); break; } @@ -1949,7 +1951,7 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags) pool->flags = flags; - if (zs_pool_stat_create(name, pool)) + if (zs_pool_stat_create(pool, name)) goto err; /* -- cgit v0.10.2 From 1ee4716585ed80b7917ba3c5aa38e5e0d677d583 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Fri, 20 May 2016 16:59:45 -0700 Subject: zsmalloc: remove unused pool param in obj_free Let's remove unused pool param in obj_free Signed-off-by: Minchan Kim Reviewed-by: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 18535ab..ae288c9 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -1438,8 +1438,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size) } EXPORT_SYMBOL_GPL(zs_malloc); -static void obj_free(struct zs_pool *pool, struct size_class *class, - unsigned long obj) +static void obj_free(struct size_class *class, unsigned long obj) { struct link_free *link; struct page *first_page, *f_page; @@ -1485,7 +1484,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle) class = pool->size_class[class_idx]; spin_lock(&class->lock); - obj_free(pool, class, obj); + obj_free(class, obj); fullness = fix_fullness_group(class, first_page); if (fullness == ZS_EMPTY) { zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage( @@ -1648,7 +1647,7 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class, free_obj |= BIT(HANDLE_PIN_BIT); record_obj(handle, free_obj); unpin_tag(handle); - obj_free(pool, class, used_obj); + obj_free(class, used_obj); } /* Remember last position in this iteration */ -- cgit v0.10.2 From d0d8da2dc49dfdfe1d788eaf4d55eb5d4964d926 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Fri, 20 May 2016 16:59:48 -0700 Subject: zsmalloc: require GFP in zs_malloc() Pass GFP flags to zs_malloc() instead of using a fixed mask supplied to zs_create_pool(), so we can be more flexible, but, more importantly, we need this to switch zram to per-cpu compression streams -- zram will try to allocate handle with preemption disabled in a fast path and switch to a slow path (using different gfp mask) if the fast one has failed. Apart from that, this also align zs_malloc() interface with zspool/zbud. [sergey.senozhatsky@gmail.com: pass GFP flags to zs_malloc() instead of using a fixed mask] Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish Signed-off-by: Sergey Senozhatsky Acked-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 370c2f7..b09acdb 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -514,7 +514,7 @@ static struct zram_meta *zram_meta_alloc(char *pool_name, u64 disksize) goto out_error; } - meta->mem_pool = zs_create_pool(pool_name, GFP_NOIO | __GFP_HIGHMEM); + meta->mem_pool = zs_create_pool(pool_name); if (!meta->mem_pool) { pr_err("Error creating memory pool\n"); goto out_error; @@ -717,7 +717,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, src = uncmem; } - handle = zs_malloc(meta->mem_pool, clen); + handle = zs_malloc(meta->mem_pool, clen, GFP_NOIO | __GFP_HIGHMEM); if (!handle) { pr_err("Error allocating memory for compressed page: %u, size=%zu\n", index, clen); diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h index 34eb160..57a8e98 100644 --- a/include/linux/zsmalloc.h +++ b/include/linux/zsmalloc.h @@ -41,10 +41,10 @@ struct zs_pool_stats { struct zs_pool; -struct zs_pool *zs_create_pool(const char *name, gfp_t flags); +struct zs_pool *zs_create_pool(const char *name); void zs_destroy_pool(struct zs_pool *pool); -unsigned long zs_malloc(struct zs_pool *pool, size_t size); +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags); void zs_free(struct zs_pool *pool, unsigned long obj); void *zs_map_object(struct zs_pool *pool, unsigned long handle, diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index ae288c9..aba39a2 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -247,7 +247,6 @@ struct zs_pool { struct size_class **size_class; struct kmem_cache *handle_cachep; - gfp_t flags; /* allocation flags used when growing pool */ atomic_long_t pages_allocated; struct zs_pool_stats stats; @@ -295,10 +294,10 @@ static void destroy_handle_cache(struct zs_pool *pool) kmem_cache_destroy(pool->handle_cachep); } -static unsigned long alloc_handle(struct zs_pool *pool) +static unsigned long alloc_handle(struct zs_pool *pool, gfp_t gfp) { return (unsigned long)kmem_cache_alloc(pool->handle_cachep, - pool->flags & ~__GFP_HIGHMEM); + gfp & ~__GFP_HIGHMEM); } static void free_handle(struct zs_pool *pool, unsigned long handle) @@ -324,7 +323,12 @@ static void *zs_zpool_create(const char *name, gfp_t gfp, const struct zpool_ops *zpool_ops, struct zpool *zpool) { - return zs_create_pool(name, gfp); + /* + * Ignore global gfp flags: zs_malloc() may be invoked from + * different contexts and its caller must provide a valid + * gfp mask. + */ + return zs_create_pool(name); } static void zs_zpool_destroy(void *pool) @@ -335,7 +339,7 @@ static void zs_zpool_destroy(void *pool) static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp, unsigned long *handle) { - *handle = zs_malloc(pool, size); + *handle = zs_malloc(pool, size, gfp); return *handle ? 0 : -1; } static void zs_zpool_free(void *pool, unsigned long handle) @@ -1391,7 +1395,7 @@ static unsigned long obj_malloc(struct size_class *class, * otherwise 0. * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail. */ -unsigned long zs_malloc(struct zs_pool *pool, size_t size) +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp) { unsigned long handle, obj; struct size_class *class; @@ -1400,7 +1404,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size) if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE)) return 0; - handle = alloc_handle(pool); + handle = alloc_handle(pool, gfp); if (!handle) return 0; @@ -1413,7 +1417,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size) if (!first_page) { spin_unlock(&class->lock); - first_page = alloc_zspage(class, pool->flags); + first_page = alloc_zspage(class, gfp); if (unlikely(!first_page)) { free_handle(pool, handle); return 0; @@ -1878,7 +1882,7 @@ static int zs_register_shrinker(struct zs_pool *pool) * On success, a pointer to the newly created pool is returned, * otherwise NULL. */ -struct zs_pool *zs_create_pool(const char *name, gfp_t flags) +struct zs_pool *zs_create_pool(const char *name) { int i; struct zs_pool *pool; @@ -1948,8 +1952,6 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags) prev_class = class; } - pool->flags = flags; - if (zs_pool_stat_create(pool, name)) goto err; -- cgit v0.10.2 From da9556a2367cf2261ab4d3e100693c82fb1ddb26 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Fri, 20 May 2016 16:59:51 -0700 Subject: zram: user per-cpu compression streams Remove idle streams list and keep compression streams in per-cpu data. This removes two contented spin_lock()/spin_unlock() calls from write path and also prevent write OP from being preempted while holding the compression stream, which can cause slow downs. For instance, let's assume that we have N cpus and N-2 max_comp_streams.TASK1 owns the last idle stream, TASK2-TASK3 come in with the write requests: TASK1 TASK2 TASK3 zram_bvec_write() spin_lock find stream spin_unlock compress <> zram_bvec_write() spin_lock find stream spin_unlock no_stream schedule zram_bvec_write() spin_lock find_stream spin_unlock no_stream schedule spin_lock release stream spin_unlock wake up TASK2 not only TASK2 and TASK3 will not get the stream, TASK1 will be preempted in the middle of its operation; while we would prefer it to finish compression and release the stream. Test environment: x86_64, 4 CPU box, 3G zram, lzo The following fio tests were executed: read, randread, write, randwrite, rw, randrw with the increasing number of jobs from 1 to 10. 4 streams 8 streams per-cpu =========================================================== jobs1 READ: 2520.1MB/s 2566.5MB/s 2491.5MB/s READ: 2102.7MB/s 2104.2MB/s 2091.3MB/s WRITE: 1355.1MB/s 1320.2MB/s 1378.9MB/s WRITE: 1103.5MB/s 1097.2MB/s 1122.5MB/s READ: 434013KB/s 435153KB/s 439961KB/s WRITE: 433969KB/s 435109KB/s 439917KB/s READ: 403166KB/s 405139KB/s 403373KB/s WRITE: 403223KB/s 405197KB/s 403430KB/s jobs2 READ: 7958.6MB/s 8105.6MB/s 8073.7MB/s READ: 6864.9MB/s 6989.8MB/s 7021.8MB/s WRITE: 2438.1MB/s 2346.9MB/s 3400.2MB/s WRITE: 1994.2MB/s 1990.3MB/s 2941.2MB/s READ: 981504KB/s 973906KB/s 1018.8MB/s WRITE: 981659KB/s 974060KB/s 1018.1MB/s READ: 937021KB/s 938976KB/s 987250KB/s WRITE: 934878KB/s 936830KB/s 984993KB/s jobs3 READ: 13280MB/s 13553MB/s 13553MB/s READ: 11534MB/s 11785MB/s 11755MB/s WRITE: 3456.9MB/s 3469.9MB/s 4810.3MB/s WRITE: 3029.6MB/s 3031.6MB/s 4264.8MB/s READ: 1363.8MB/s 1362.6MB/s 1448.9MB/s WRITE: 1361.9MB/s 1360.7MB/s 1446.9MB/s READ: 1309.4MB/s 1310.6MB/s 1397.5MB/s WRITE: 1307.4MB/s 1308.5MB/s 1395.3MB/s jobs4 READ: 20244MB/s 20177MB/s 20344MB/s READ: 17886MB/s 17913MB/s 17835MB/s WRITE: 4071.6MB/s 4046.1MB/s 6370.2MB/s WRITE: 3608.9MB/s 3576.3MB/s 5785.4MB/s READ: 1824.3MB/s 1821.6MB/s 1997.5MB/s WRITE: 1819.8MB/s 1817.4MB/s 1992.5MB/s READ: 1765.7MB/s 1768.3MB/s 1937.3MB/s WRITE: 1767.5MB/s 1769.1MB/s 1939.2MB/s jobs5 READ: 18663MB/s 18986MB/s 18823MB/s READ: 16659MB/s 16605MB/s 16954MB/s WRITE: 3912.4MB/s 3888.7MB/s 6126.9MB/s WRITE: 3506.4MB/s 3442.5MB/s 5519.3MB/s READ: 1798.2MB/s 1746.5MB/s 1935.8MB/s WRITE: 1792.7MB/s 1740.7MB/s 1929.1MB/s READ: 1727.6MB/s 1658.2MB/s 1917.3MB/s WRITE: 1726.5MB/s 1657.2MB/s 1916.6MB/s jobs6 READ: 21017MB/s 20922MB/s 21162MB/s READ: 19022MB/s 19140MB/s 18770MB/s WRITE: 3968.2MB/s 4037.7MB/s 6620.8MB/s WRITE: 3643.5MB/s 3590.2MB/s 6027.5MB/s READ: 1871.8MB/s 1880.5MB/s 2049.9MB/s WRITE: 1867.8MB/s 1877.2MB/s 2046.2MB/s READ: 1755.8MB/s 1710.3MB/s 1964.7MB/s WRITE: 1750.5MB/s 1705.9MB/s 1958.8MB/s jobs7 READ: 21103MB/s 20677MB/s 21482MB/s READ: 18522MB/s 18379MB/s 19443MB/s WRITE: 4022.5MB/s 4067.4MB/s 6755.9MB/s WRITE: 3691.7MB/s 3695.5MB/s 5925.6MB/s READ: 1841.5MB/s 1933.9MB/s 2090.5MB/s WRITE: 1842.7MB/s 1935.3MB/s 2091.9MB/s READ: 1832.4MB/s 1856.4MB/s 1971.5MB/s WRITE: 1822.3MB/s 1846.2MB/s 1960.6MB/s jobs8 READ: 20463MB/s 20194MB/s 20862MB/s READ: 18178MB/s 17978MB/s 18299MB/s WRITE: 4085.9MB/s 4060.2MB/s 7023.8MB/s WRITE: 3776.3MB/s 3737.9MB/s 6278.2MB/s READ: 1957.6MB/s 1944.4MB/s 2109.5MB/s WRITE: 1959.2MB/s 1946.2MB/s 2111.4MB/s READ: 1900.6MB/s 1885.7MB/s 2082.1MB/s WRITE: 1896.2MB/s 1881.4MB/s 2078.3MB/s jobs9 READ: 19692MB/s 19734MB/s 19334MB/s READ: 17678MB/s 18249MB/s 17666MB/s WRITE: 4004.7MB/s 4064.8MB/s 6990.7MB/s WRITE: 3724.7MB/s 3772.1MB/s 6193.6MB/s READ: 1953.7MB/s 1967.3MB/s 2105.6MB/s WRITE: 1953.4MB/s 1966.7MB/s 2104.1MB/s READ: 1860.4MB/s 1897.4MB/s 2068.5MB/s WRITE: 1858.9MB/s 1895.9MB/s 2066.8MB/s jobs10 READ: 19730MB/s 19579MB/s 19492MB/s READ: 18028MB/s 18018MB/s 18221MB/s WRITE: 4027.3MB/s 4090.6MB/s 7020.1MB/s WRITE: 3810.5MB/s 3846.8MB/s 6426.8MB/s READ: 1956.1MB/s 1994.6MB/s 2145.2MB/s WRITE: 1955.9MB/s 1993.5MB/s 2144.8MB/s READ: 1852.8MB/s 1911.6MB/s 2075.8MB/s WRITE: 1855.7MB/s 1914.6MB/s 2078.1MB/s perf stat 4 streams 8 streams per-cpu ==================================================================================================================== jobs1 stalled-cycles-frontend 23,174,811,209 ( 38.21%) 23,220,254,188 ( 38.25%) 23,061,406,918 ( 38.34%) stalled-cycles-backend 11,514,174,638 ( 18.98%) 11,696,722,657 ( 19.27%) 11,370,852,810 ( 18.90%) instructions 73,925,005,782 ( 1.22) 73,903,177,632 ( 1.22) 73,507,201,037 ( 1.22) branches 14,455,124,835 ( 756.063) 14,455,184,779 ( 755.281) 14,378,599,509 ( 758.546) branch-misses 69,801,336 ( 0.48%) 80,225,529 ( 0.55%) 72,044,726 ( 0.50%) jobs2 stalled-cycles-frontend 49,912,741,782 ( 46.11%) 50,101,189,290 ( 45.95%) 32,874,195,633 ( 35.11%) stalled-cycles-backend 27,080,366,230 ( 25.02%) 27,949,970,232 ( 25.63%) 16,461,222,706 ( 17.58%) instructions 122,831,629,690 ( 1.13) 122,919,846,419 ( 1.13) 121,924,786,775 ( 1.30) branches 23,725,889,239 ( 692.663) 23,733,547,140 ( 688.062) 23,553,950,311 ( 794.794) branch-misses 90,733,041 ( 0.38%) 96,320,895 ( 0.41%) 84,561,092 ( 0.36%) jobs3 stalled-cycles-frontend 66,437,834,608 ( 45.58%) 63,534,923,344 ( 43.69%) 42,101,478,505 ( 33.19%) stalled-cycles-backend 34,940,799,661 ( 23.97%) 34,774,043,148 ( 23.91%) 21,163,324,388 ( 16.68%) instructions 171,692,121,862 ( 1.18) 171,775,373,044 ( 1.18) 170,353,542,261 ( 1.34) branches 32,968,962,622 ( 628.723) 32,987,739,894 ( 630.512) 32,729,463,918 ( 717.027) branch-misses 111,522,732 ( 0.34%) 110,472,894 ( 0.33%) 99,791,291 ( 0.30%) jobs4 stalled-cycles-frontend 98,741,701,675 ( 49.72%) 94,797,349,965 ( 47.59%) 54,535,655,381 ( 33.53%) stalled-cycles-backend 54,642,609,615 ( 27.51%) 55,233,554,408 ( 27.73%) 27,882,323,541 ( 17.14%) instructions 220,884,807,851 ( 1.11) 220,930,887,273 ( 1.11) 218,926,845,851 ( 1.35) branches 42,354,518,180 ( 592.105) 42,362,770,587 ( 590.452) 41,955,552,870 ( 716.154) branch-misses 138,093,449 ( 0.33%) 131,295,286 ( 0.31%) 121,794,771 ( 0.29%) jobs5 stalled-cycles-frontend 116,219,747,212 ( 48.14%) 110,310,397,012 ( 46.29%) 66,373,082,723 ( 33.70%) stalled-cycles-backend 66,325,434,776 ( 27.48%) 64,157,087,914 ( 26.92%) 32,999,097,299 ( 16.76%) instructions 270,615,008,466 ( 1.12) 270,546,409,525 ( 1.14) 268,439,910,948 ( 1.36) branches 51,834,046,557 ( 599.108) 51,811,867,722 ( 608.883) 51,412,576,077 ( 729.213) branch-misses 158,197,086 ( 0.31%) 142,639,805 ( 0.28%) 133,425,455 ( 0.26%) jobs6 stalled-cycles-frontend 138,009,414,492 ( 48.23%) 139,063,571,254 ( 48.80%) 75,278,568,278 ( 32.80%) stalled-cycles-backend 79,211,949,650 ( 27.68%) 79,077,241,028 ( 27.75%) 37,735,797,899 ( 16.44%) instructions 319,763,993,731 ( 1.12) 319,937,782,834 ( 1.12) 316,663,600,784 ( 1.38) branches 61,219,433,294 ( 595.056) 61,250,355,540 ( 598.215) 60,523,446,617 ( 733.706) branch-misses 169,257,123 ( 0.28%) 154,898,028 ( 0.25%) 141,180,587 ( 0.23%) jobs7 stalled-cycles-frontend 162,974,812,119 ( 49.20%) 159,290,061,987 ( 48.43%) 88,046,641,169 ( 33.21%) stalled-cycles-backend 92,223,151,661 ( 27.84%) 91,667,904,406 ( 27.87%) 44,068,454,971 ( 16.62%) instructions 369,516,432,430 ( 1.12) 369,361,799,063 ( 1.12) 365,290,380,661 ( 1.38) branches 70,795,673,950 ( 594.220) 70,743,136,124 ( 597.876) 69,803,996,038 ( 732.822) branch-misses 181,708,327 ( 0.26%) 165,767,821 ( 0.23%) 150,109,797 ( 0.22%) jobs8 stalled-cycles-frontend 185,000,017,027 ( 49.30%) 182,334,345,473 ( 48.37%) 99,980,147,041 ( 33.26%) stalled-cycles-backend 105,753,516,186 ( 28.18%) 107,937,830,322 ( 28.63%) 51,404,177,181 ( 17.10%) instructions 418,153,161,055 ( 1.11) 418,308,565,828 ( 1.11) 413,653,475,581 ( 1.38) branches 80,035,882,398 ( 592.296) 80,063,204,510 ( 589.843) 79,024,105,589 ( 730.530) branch-misses 199,764,528 ( 0.25%) 177,936,926 ( 0.22%) 160,525,449 ( 0.20%) jobs9 stalled-cycles-frontend 210,941,799,094 ( 49.63%) 204,714,679,254 ( 48.55%) 114,251,113,756 ( 33.96%) stalled-cycles-backend 122,640,849,067 ( 28.85%) 122,188,553,256 ( 28.98%) 58,360,041,127 ( 17.35%) instructions 468,151,025,415 ( 1.10) 467,354,869,323 ( 1.11) 462,665,165,216 ( 1.38) branches 89,657,067,510 ( 585.628) 89,411,550,407 ( 588.990) 88,360,523,943 ( 730.151) branch-misses 218,292,301 ( 0.24%) 191,701,247 ( 0.21%) 178,535,678 ( 0.20%) jobs10 stalled-cycles-frontend 233,595,958,008 ( 49.81%) 227,540,615,689 ( 49.11%) 160,341,979,938 ( 43.07%) stalled-cycles-backend 136,153,676,021 ( 29.03%) 133,635,240,742 ( 28.84%) 65,909,135,465 ( 17.70%) instructions 517,001,168,497 ( 1.10) 516,210,976,158 ( 1.11) 511,374,038,613 ( 1.37) branches 98,911,641,329 ( 585.796) 98,700,069,712 ( 591.583) 97,646,761,028 ( 728.712) branch-misses 232,341,823 ( 0.23%) 199,256,308 ( 0.20%) 183,135,268 ( 0.19%) per-cpu streams tend to cause significantly less stalled cycles; execute less branches and hit less branch-misses. perf stat reported execution time 4 streams 8 streams per-cpu ==================================================================== jobs1 seconds elapsed 20.909073870 20.875670495 20.817838540 jobs2 seconds elapsed 18.529488399 18.720566469 16.356103108 jobs3 seconds elapsed 18.991159531 18.991340812 16.766216066 jobs4 seconds elapsed 19.560643828 19.551323547 16.246621715 jobs5 seconds elapsed 24.746498464 25.221646740 20.696112444 jobs6 seconds elapsed 28.258181828 28.289765505 22.885688857 jobs7 seconds elapsed 32.632490241 31.909125381 26.272753738 jobs8 seconds elapsed 35.651403851 36.027596308 29.108024711 jobs9 seconds elapsed 40.569362365 40.024227989 32.898204012 jobs10 seconds elapsed 44.673112304 43.874898137 35.632952191 Please see Link: http://marc.info/?l=linux-kernel&m=146166970727530 Link: http://marc.info/?l=linux-kernel&m=146174716719650 for more test results (under low memory conditions). Signed-off-by: Sergey Senozhatsky Suggested-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index 3ef42e5..bc98d5e 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "zcomp.h" #include "zcomp_lzo.h" @@ -20,29 +21,6 @@ #include "zcomp_lz4.h" #endif -/* - * single zcomp_strm backend - */ -struct zcomp_strm_single { - struct mutex strm_lock; - struct zcomp_strm *zstrm; -}; - -/* - * multi zcomp_strm backend - */ -struct zcomp_strm_multi { - /* protect strm list */ - spinlock_t strm_lock; - /* max possible number of zstrm streams */ - int max_strm; - /* number of available zstrm streams */ - int avail_strm; - /* list of available strms */ - struct list_head idle_strm; - wait_queue_head_t strm_wait; -}; - static struct zcomp_backend *backends[] = { &zcomp_lzo, #ifdef CONFIG_ZRAM_LZ4_COMPRESS @@ -93,188 +71,6 @@ static struct zcomp_strm *zcomp_strm_alloc(struct zcomp *comp, gfp_t flags) return zstrm; } -/* - * get idle zcomp_strm or wait until other process release - * (zcomp_strm_release()) one for us - */ -static struct zcomp_strm *zcomp_strm_multi_find(struct zcomp *comp) -{ - struct zcomp_strm_multi *zs = comp->stream; - struct zcomp_strm *zstrm; - - while (1) { - spin_lock(&zs->strm_lock); - if (!list_empty(&zs->idle_strm)) { - zstrm = list_entry(zs->idle_strm.next, - struct zcomp_strm, list); - list_del(&zstrm->list); - spin_unlock(&zs->strm_lock); - return zstrm; - } - /* zstrm streams limit reached, wait for idle stream */ - if (zs->avail_strm >= zs->max_strm) { - spin_unlock(&zs->strm_lock); - wait_event(zs->strm_wait, !list_empty(&zs->idle_strm)); - continue; - } - /* allocate new zstrm stream */ - zs->avail_strm++; - spin_unlock(&zs->strm_lock); - /* - * This function can be called in swapout/fs write path - * so we can't use GFP_FS|IO. And it assumes we already - * have at least one stream in zram initialization so we - * don't do best effort to allocate more stream in here. - * A default stream will work well without further multiple - * streams. That's why we use NORETRY | NOWARN. - */ - zstrm = zcomp_strm_alloc(comp, GFP_NOIO | __GFP_NORETRY | - __GFP_NOWARN); - if (!zstrm) { - spin_lock(&zs->strm_lock); - zs->avail_strm--; - spin_unlock(&zs->strm_lock); - wait_event(zs->strm_wait, !list_empty(&zs->idle_strm)); - continue; - } - break; - } - return zstrm; -} - -/* add stream back to idle list and wake up waiter or free the stream */ -static void zcomp_strm_multi_release(struct zcomp *comp, struct zcomp_strm *zstrm) -{ - struct zcomp_strm_multi *zs = comp->stream; - - spin_lock(&zs->strm_lock); - if (zs->avail_strm <= zs->max_strm) { - list_add(&zstrm->list, &zs->idle_strm); - spin_unlock(&zs->strm_lock); - wake_up(&zs->strm_wait); - return; - } - - zs->avail_strm--; - spin_unlock(&zs->strm_lock); - zcomp_strm_free(comp, zstrm); -} - -/* change max_strm limit */ -static bool zcomp_strm_multi_set_max_streams(struct zcomp *comp, int num_strm) -{ - struct zcomp_strm_multi *zs = comp->stream; - struct zcomp_strm *zstrm; - - spin_lock(&zs->strm_lock); - zs->max_strm = num_strm; - /* - * if user has lowered the limit and there are idle streams, - * immediately free as much streams (and memory) as we can. - */ - while (zs->avail_strm > num_strm && !list_empty(&zs->idle_strm)) { - zstrm = list_entry(zs->idle_strm.next, - struct zcomp_strm, list); - list_del(&zstrm->list); - zcomp_strm_free(comp, zstrm); - zs->avail_strm--; - } - spin_unlock(&zs->strm_lock); - return true; -} - -static void zcomp_strm_multi_destroy(struct zcomp *comp) -{ - struct zcomp_strm_multi *zs = comp->stream; - struct zcomp_strm *zstrm; - - while (!list_empty(&zs->idle_strm)) { - zstrm = list_entry(zs->idle_strm.next, - struct zcomp_strm, list); - list_del(&zstrm->list); - zcomp_strm_free(comp, zstrm); - } - kfree(zs); -} - -static int zcomp_strm_multi_create(struct zcomp *comp, int max_strm) -{ - struct zcomp_strm *zstrm; - struct zcomp_strm_multi *zs; - - comp->destroy = zcomp_strm_multi_destroy; - comp->strm_find = zcomp_strm_multi_find; - comp->strm_release = zcomp_strm_multi_release; - comp->set_max_streams = zcomp_strm_multi_set_max_streams; - zs = kmalloc(sizeof(struct zcomp_strm_multi), GFP_KERNEL); - if (!zs) - return -ENOMEM; - - comp->stream = zs; - spin_lock_init(&zs->strm_lock); - INIT_LIST_HEAD(&zs->idle_strm); - init_waitqueue_head(&zs->strm_wait); - zs->max_strm = max_strm; - zs->avail_strm = 1; - - zstrm = zcomp_strm_alloc(comp, GFP_KERNEL); - if (!zstrm) { - kfree(zs); - return -ENOMEM; - } - list_add(&zstrm->list, &zs->idle_strm); - return 0; -} - -static struct zcomp_strm *zcomp_strm_single_find(struct zcomp *comp) -{ - struct zcomp_strm_single *zs = comp->stream; - mutex_lock(&zs->strm_lock); - return zs->zstrm; -} - -static void zcomp_strm_single_release(struct zcomp *comp, - struct zcomp_strm *zstrm) -{ - struct zcomp_strm_single *zs = comp->stream; - mutex_unlock(&zs->strm_lock); -} - -static bool zcomp_strm_single_set_max_streams(struct zcomp *comp, int num_strm) -{ - /* zcomp_strm_single support only max_comp_streams == 1 */ - return false; -} - -static void zcomp_strm_single_destroy(struct zcomp *comp) -{ - struct zcomp_strm_single *zs = comp->stream; - zcomp_strm_free(comp, zs->zstrm); - kfree(zs); -} - -static int zcomp_strm_single_create(struct zcomp *comp) -{ - struct zcomp_strm_single *zs; - - comp->destroy = zcomp_strm_single_destroy; - comp->strm_find = zcomp_strm_single_find; - comp->strm_release = zcomp_strm_single_release; - comp->set_max_streams = zcomp_strm_single_set_max_streams; - zs = kmalloc(sizeof(struct zcomp_strm_single), GFP_KERNEL); - if (!zs) - return -ENOMEM; - - comp->stream = zs; - mutex_init(&zs->strm_lock); - zs->zstrm = zcomp_strm_alloc(comp, GFP_KERNEL); - if (!zs->zstrm) { - kfree(zs); - return -ENOMEM; - } - return 0; -} - /* show available compressors */ ssize_t zcomp_available_show(const char *comp, char *buf) { @@ -301,17 +97,17 @@ bool zcomp_available_algorithm(const char *comp) bool zcomp_set_max_streams(struct zcomp *comp, int num_strm) { - return comp->set_max_streams(comp, num_strm); + return true; } struct zcomp_strm *zcomp_strm_find(struct zcomp *comp) { - return comp->strm_find(comp); + return *get_cpu_ptr(comp->stream); } void zcomp_strm_release(struct zcomp *comp, struct zcomp_strm *zstrm) { - comp->strm_release(comp, zstrm); + put_cpu_ptr(comp->stream); } int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm, @@ -327,9 +123,83 @@ int zcomp_decompress(struct zcomp *comp, const unsigned char *src, return comp->backend->decompress(src, src_len, dst); } +static int __zcomp_cpu_notifier(struct zcomp *comp, + unsigned long action, unsigned long cpu) +{ + struct zcomp_strm *zstrm; + + switch (action) { + case CPU_UP_PREPARE: + if (WARN_ON(*per_cpu_ptr(comp->stream, cpu))) + break; + zstrm = zcomp_strm_alloc(comp, GFP_KERNEL); + if (IS_ERR_OR_NULL(zstrm)) { + pr_err("Can't allocate a compression stream\n"); + return NOTIFY_BAD; + } + *per_cpu_ptr(comp->stream, cpu) = zstrm; + break; + case CPU_DEAD: + case CPU_UP_CANCELED: + zstrm = *per_cpu_ptr(comp->stream, cpu); + if (!IS_ERR_OR_NULL(zstrm)) + zcomp_strm_free(comp, zstrm); + *per_cpu_ptr(comp->stream, cpu) = NULL; + break; + default: + break; + } + return NOTIFY_OK; +} + +static int zcomp_cpu_notifier(struct notifier_block *nb, + unsigned long action, void *pcpu) +{ + unsigned long cpu = (unsigned long)pcpu; + struct zcomp *comp = container_of(nb, typeof(*comp), notifier); + + return __zcomp_cpu_notifier(comp, action, cpu); +} + +static int zcomp_init(struct zcomp *comp) +{ + unsigned long cpu; + int ret; + + comp->notifier.notifier_call = zcomp_cpu_notifier; + + comp->stream = alloc_percpu(struct zcomp_strm *); + if (!comp->stream) + return -ENOMEM; + + cpu_notifier_register_begin(); + for_each_online_cpu(cpu) { + ret = __zcomp_cpu_notifier(comp, CPU_UP_PREPARE, cpu); + if (ret == NOTIFY_BAD) + goto cleanup; + } + __register_cpu_notifier(&comp->notifier); + cpu_notifier_register_done(); + return 0; + +cleanup: + for_each_online_cpu(cpu) + __zcomp_cpu_notifier(comp, CPU_UP_CANCELED, cpu); + cpu_notifier_register_done(); + return -ENOMEM; +} + void zcomp_destroy(struct zcomp *comp) { - comp->destroy(comp); + unsigned long cpu; + + cpu_notifier_register_begin(); + for_each_online_cpu(cpu) + __zcomp_cpu_notifier(comp, CPU_UP_CANCELED, cpu); + __unregister_cpu_notifier(&comp->notifier); + cpu_notifier_register_done(); + + free_percpu(comp->stream); kfree(comp); } @@ -339,9 +209,9 @@ void zcomp_destroy(struct zcomp *comp) * backend pointer or ERR_PTR if things went bad. ERR_PTR(-EINVAL) * if requested algorithm is not supported, ERR_PTR(-ENOMEM) in * case of allocation error, or any other error potentially - * returned by functions zcomp_strm_{multi,single}_create. + * returned by zcomp_init(). */ -struct zcomp *zcomp_create(const char *compress, int max_strm) +struct zcomp *zcomp_create(const char *compress) { struct zcomp *comp; struct zcomp_backend *backend; @@ -356,10 +226,7 @@ struct zcomp *zcomp_create(const char *compress, int max_strm) return ERR_PTR(-ENOMEM); comp->backend = backend; - if (max_strm > 1) - error = zcomp_strm_multi_create(comp, max_strm); - else - error = zcomp_strm_single_create(comp); + error = zcomp_init(comp); if (error) { kfree(comp); return ERR_PTR(error); diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h index b7d2a4b..ffd88cb 100644 --- a/drivers/block/zram/zcomp.h +++ b/drivers/block/zram/zcomp.h @@ -10,8 +10,6 @@ #ifndef _ZCOMP_H_ #define _ZCOMP_H_ -#include - struct zcomp_strm { /* compression/decompression buffer */ void *buffer; @@ -21,8 +19,6 @@ struct zcomp_strm { * working memory) */ void *private; - /* used in multi stream backend, protected by backend strm_lock */ - struct list_head list; }; /* static compression backend */ @@ -41,19 +37,15 @@ struct zcomp_backend { /* dynamic per-device compression frontend */ struct zcomp { - void *stream; + struct zcomp_strm * __percpu *stream; struct zcomp_backend *backend; - - struct zcomp_strm *(*strm_find)(struct zcomp *comp); - void (*strm_release)(struct zcomp *comp, struct zcomp_strm *zstrm); - bool (*set_max_streams)(struct zcomp *comp, int num_strm); - void (*destroy)(struct zcomp *comp); + struct notifier_block notifier; }; ssize_t zcomp_available_show(const char *comp, char *buf); bool zcomp_available_algorithm(const char *comp); -struct zcomp *zcomp_create(const char *comp, int max_strm); +struct zcomp *zcomp_create(const char *comp); void zcomp_destroy(struct zcomp *comp); struct zcomp_strm *zcomp_strm_find(struct zcomp *comp); diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index b09acdb..f92965c 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -650,7 +650,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, { int ret = 0; size_t clen; - unsigned long handle; + unsigned long handle = 0; struct page *page; unsigned char *user_mem, *cmem, *src, *uncmem = NULL; struct zram_meta *meta = zram->meta; @@ -673,9 +673,8 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, goto out; } - zstrm = zcomp_strm_find(zram->comp); +compress_again: user_mem = kmap_atomic(page); - if (is_partial_io(bvec)) { memcpy(uncmem + offset, user_mem + bvec->bv_offset, bvec->bv_len); @@ -699,6 +698,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, goto out; } + zstrm = zcomp_strm_find(zram->comp); ret = zcomp_compress(zram->comp, zstrm, uncmem, &clen); if (!is_partial_io(bvec)) { kunmap_atomic(user_mem); @@ -710,6 +710,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, pr_err("Compression failed! err=%d\n", ret); goto out; } + src = zstrm->buffer; if (unlikely(clen > max_zpage_size)) { clen = PAGE_SIZE; @@ -717,8 +718,33 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, src = uncmem; } - handle = zs_malloc(meta->mem_pool, clen, GFP_NOIO | __GFP_HIGHMEM); + /* + * handle allocation has 2 paths: + * a) fast path is executed with preemption disabled (for + * per-cpu streams) and has __GFP_DIRECT_RECLAIM bit clear, + * since we can't sleep; + * b) slow path enables preemption and attempts to allocate + * the page with __GFP_DIRECT_RECLAIM bit set. we have to + * put per-cpu compression stream and, thus, to re-do + * the compression once handle is allocated. + * + * if we have a 'non-null' handle here then we are coming + * from the slow path and handle has already been allocated. + */ + if (!handle) + handle = zs_malloc(meta->mem_pool, clen, + __GFP_KSWAPD_RECLAIM | + __GFP_NOWARN | + __GFP_HIGHMEM); if (!handle) { + zcomp_strm_release(zram->comp, zstrm); + zstrm = NULL; + + handle = zs_malloc(meta->mem_pool, clen, + GFP_NOIO | __GFP_HIGHMEM); + if (handle) + goto compress_again; + pr_err("Error allocating memory for compressed page: %u, size=%zu\n", index, clen); ret = -ENOMEM; @@ -1038,7 +1064,7 @@ static ssize_t disksize_store(struct device *dev, if (!meta) return -ENOMEM; - comp = zcomp_create(zram->compressor, zram->max_comp_streams); + comp = zcomp_create(zram->compressor); if (IS_ERR(comp)) { pr_err("Cannot initialise %s compressing backend\n", zram->compressor); -- cgit v0.10.2 From 200867af4dedfe7cb707f96773684de1d1fd21e6 Mon Sep 17 00:00:00 2001 From: Dan Streetman Date: Fri, 20 May 2016 16:59:54 -0700 Subject: mm/zswap: use workqueue to destroy pool Add a work_struct to struct zswap_pool, and change __zswap_pool_empty to use the workqueue instead of using call_rcu(). When zswap destroys a pool no longer in use, it uses call_rcu() to perform the destruction/freeing. Since that executes in softirq context, it must not sleep. However, actually destroying the pool involves freeing the per-cpu compressors (which requires locking the cpu_add_remove_lock mutex) and freeing the zpool, for which the implementation may sleep (e.g. zsmalloc calls kmem_cache_destroy, which locks the slab_mutex). So if either mutex is currently taken, or any other part of the compressor or zpool implementation sleeps, it will result in a BUG(). It's not easy to reproduce this when changing zswap's params normally. In testing with a loaded system, this does not fail: $ cd /sys/module/zswap/parameters $ echo lz4 > compressor ; echo zsmalloc > zpool nor does this: $ while true ; do > echo lzo > compressor ; echo zbud > zpool > sleep 1 > echo lz4 > compressor ; echo zsmalloc > zpool > sleep 1 > done although it's still possible either of those might fail, depending on whether anything else besides zswap has locked the mutexes. However, changing a parameter with no delay immediately causes the schedule while atomic BUG: $ while true ; do > echo lzo > compressor ; echo lz4 > compressor > done This is essentially the same as Yu Zhao's proposed patch to zsmalloc, but moved to zswap, to cover compressor and zpool freeing. Fixes: f1c54846ee45 ("zswap: dynamic pool creation") Signed-off-by: Dan Streetman Reported-by: Yu Zhao Reviewed-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Dan Streetman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/zswap.c b/mm/zswap.c index de0f119b..275b22c 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -117,7 +117,7 @@ struct zswap_pool { struct crypto_comp * __percpu *tfm; struct kref kref; struct list_head list; - struct rcu_head rcu_head; + struct work_struct work; struct notifier_block notifier; char tfm_name[CRYPTO_MAX_ALG_NAME]; }; @@ -658,9 +658,11 @@ static int __must_check zswap_pool_get(struct zswap_pool *pool) return kref_get_unless_zero(&pool->kref); } -static void __zswap_pool_release(struct rcu_head *head) +static void __zswap_pool_release(struct work_struct *work) { - struct zswap_pool *pool = container_of(head, typeof(*pool), rcu_head); + struct zswap_pool *pool = container_of(work, typeof(*pool), work); + + synchronize_rcu(); /* nobody should have been able to get a kref... */ WARN_ON(kref_get_unless_zero(&pool->kref)); @@ -680,7 +682,9 @@ static void __zswap_pool_empty(struct kref *kref) WARN_ON(pool == zswap_pool_current()); list_del_rcu(&pool->list); - call_rcu(&pool->rcu_head, __zswap_pool_release); + + INIT_WORK(&pool->work, __zswap_pool_release); + schedule_work(&pool->work); spin_unlock(&zswap_pools_lock); } -- cgit v0.10.2 From d34f615720d17c49b6779f6fcd5cb7eb82231a38 Mon Sep 17 00:00:00 2001 From: Dan Streetman Date: Fri, 20 May 2016 16:59:56 -0700 Subject: mm/zsmalloc: don't fail if can't create debugfs info Change the return type of zs_pool_stat_create() to void, and remove the logic to abort pool creation if the stat debugfs dir/file could not be created. The debugfs stat file is for debugging/information only, and doesn't affect operation of zsmalloc; there is no reason to abort creating the pool if the stat file can't be created. This was seen with zswap, which used the same name for all pool creations, which caused zsmalloc to fail to create a second pool for zswap if CONFIG_ZSMALLOC_STAT was enabled. Signed-off-by: Dan Streetman Reviewed-by: Sergey Senozhatsky Cc: Dan Streetman Cc: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index aba39a2..72698db 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -573,17 +573,17 @@ static const struct file_operations zs_stat_size_ops = { .release = single_release, }; -static int zs_pool_stat_create(struct zs_pool *pool, const char *name) +static void zs_pool_stat_create(struct zs_pool *pool, const char *name) { struct dentry *entry; if (!zs_stat_root) - return -ENODEV; + return; entry = debugfs_create_dir(name, zs_stat_root); if (!entry) { pr_warn("debugfs dir <%s> creation failed\n", name); - return -ENOMEM; + return; } pool->stat_dentry = entry; @@ -592,10 +592,8 @@ static int zs_pool_stat_create(struct zs_pool *pool, const char *name) if (!entry) { pr_warn("%s: debugfs file entry <%s> creation failed\n", name, "classes"); - return -ENOMEM; + return; } - - return 0; } static void zs_pool_stat_destroy(struct zs_pool *pool) @@ -613,9 +611,8 @@ static void __exit zs_stat_exit(void) { } -static inline int zs_pool_stat_create(struct zs_pool *pool, const char *name) +static inline void zs_pool_stat_create(struct zs_pool *pool, const char *name) { - return 0; } static inline void zs_pool_stat_destroy(struct zs_pool *pool) @@ -623,7 +620,6 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool) } #endif - /* * For each size class, zspages are divided into different groups * depending on how "full" they are. This was done so that we could @@ -1952,8 +1948,8 @@ struct zs_pool *zs_create_pool(const char *name) prev_class = class; } - if (zs_pool_stat_create(pool, name)) - goto err; + /* debug only, don't abort if it fails */ + zs_pool_stat_create(pool, name); /* * Not critical, we still can use the pool -- cgit v0.10.2 From 43209ea2d17aae1540d4e28274e36404f72702f2 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Fri, 20 May 2016 16:59:59 -0700 Subject: zram: remove max_comp_streams internals Remove the internal part of max_comp_streams interface, since we switched to per-cpu streams. We will keep RW max_comp_streams attr around, because: a) we may (silently) switch back to idle compression streams list and don't want to disturb user space b) max_comp_streams attr must wait for the next 'lay off cycle'; we give user space 2 years to adjust before we remove/downgrade the attr, and there are already several attrs scheduled for removal in 4.11, so it's too late for max_comp_streams. This slightly change a user visible behaviour: - First, reading from max_comp_stream file now will always return the number of online CPUs. - Second, writing to max_comp_stream will not take any effect. Link: http://lkml.kernel.org/r/20160503165546.25201-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 5bda503..d88f0c7 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -59,27 +59,16 @@ num_devices parameter is optional and tells zram how many devices should be pre-created. Default: 1. 2) Set max number of compression streams - Compression backend may use up to max_comp_streams compression streams, - thus allowing up to max_comp_streams concurrent compression operations. - By default, compression backend uses single compression stream. - - Examples: - #show max compression streams number + Regardless the value passed to this attribute, ZRAM will always + allocate multiple compression streams - one per online CPUs - thus + allowing several concurrent compression operations. The number of + allocated compression streams goes down when some of the CPUs + become offline. There is no single-compression-stream mode anymore, + unless you are running a UP system or has only 1 CPU online. + + To find out how many streams are currently available: cat /sys/block/zram0/max_comp_streams - #set max compression streams number to 3 - echo 3 > /sys/block/zram0/max_comp_streams - -Note: -In order to enable compression backend's multi stream support max_comp_streams -must be initially set to desired concurrency level before ZRAM device -initialisation. Once the device initialised as a single stream compression -backend (max_comp_streams equals to 1), you will see error if you try to change -the value of max_comp_streams because single stream compression backend -implemented as a special case by lock overhead issue and does not support -dynamic max_comp_streams. Only multi stream backend supports dynamic -max_comp_streams adjustment. - 3) Select compression algorithm Using comp_algorithm device attribute one can see available and currently selected (shown in square brackets) compression algorithms, diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index bc98d5e..b51a816 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -95,11 +95,6 @@ bool zcomp_available_algorithm(const char *comp) return find_backend(comp) != NULL; } -bool zcomp_set_max_streams(struct zcomp *comp, int num_strm) -{ - return true; -} - struct zcomp_strm *zcomp_strm_find(struct zcomp *comp) { return *get_cpu_ptr(comp->stream); diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index f92965c..8fcfbeb 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -304,46 +304,25 @@ static ssize_t mem_used_max_store(struct device *dev, return len; } +/* + * We switched to per-cpu streams and this attr is not needed anymore. + * However, we will keep it around for some time, because: + * a) we may revert per-cpu streams in the future + * b) it's visible to user space and we need to follow our 2 years + * retirement rule; but we already have a number of 'soon to be + * altered' attrs, so max_comp_streams need to wait for the next + * layoff cycle. + */ static ssize_t max_comp_streams_show(struct device *dev, struct device_attribute *attr, char *buf) { - int val; - struct zram *zram = dev_to_zram(dev); - - down_read(&zram->init_lock); - val = zram->max_comp_streams; - up_read(&zram->init_lock); - - return scnprintf(buf, PAGE_SIZE, "%d\n", val); + return scnprintf(buf, PAGE_SIZE, "%d\n", num_online_cpus()); } static ssize_t max_comp_streams_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t len) { - int num; - struct zram *zram = dev_to_zram(dev); - int ret; - - ret = kstrtoint(buf, 0, &num); - if (ret < 0) - return ret; - if (num < 1) - return -EINVAL; - - down_write(&zram->init_lock); - if (init_done(zram)) { - if (!zcomp_set_max_streams(zram->comp, num)) { - pr_info("Cannot change max compression streams\n"); - ret = -EINVAL; - goto out; - } - } - - zram->max_comp_streams = num; - ret = len; -out: - up_write(&zram->init_lock); - return ret; + return len; } static ssize_t comp_algorithm_show(struct device *dev, @@ -1035,7 +1014,6 @@ static void zram_reset_device(struct zram *zram) /* Reset stats */ memset(&zram->stats, 0, sizeof(zram->stats)); zram->disksize = 0; - zram->max_comp_streams = 1; set_capacity(zram->disk, 0); part_stat_set_all(&zram->disk->part0, 0); @@ -1299,7 +1277,6 @@ static int zram_add(void) } strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor)); zram->meta = NULL; - zram->max_comp_streams = 1; pr_info("Added device: %s\n", zram->disk->disk_name); return device_id; diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index 8e92339..06b1636 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -102,7 +102,6 @@ struct zram { * the number of pages zram can consume for storing compressed data */ unsigned long limit_pages; - int max_comp_streams; struct zram_stats stats; atomic_t refcount; /* refcount for zram_meta */ -- cgit v0.10.2 From 623e47fc64f8de480b322b7ed68855f97137e2a5 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Fri, 20 May 2016 17:00:02 -0700 Subject: zram: introduce per-device debug_stat sysfs node debug_stat sysfs is read-only and represents various debugging data that zram developers may need. This file is not meant to be used by anyone else: its content is not documented and will change any time w/o any notice. Therefore, the output of debug_stat file contains a version string. To avoid any confusion, we will increase the version number every time we modify the output. At the moment this file exports only one value -- the number of re-compressions, IOW, the number of times compression fast path has failed. This stat is temporary any will be useful in case if any per-cpu compression streams regressions will be reported. Link: http://lkml.kernel.org/r/20160513230834.GB26763@bbox Link: http://lkml.kernel.org/r/20160511134553.12655-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky Signed-off-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index 2e69e83..4518d30 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -166,3 +166,12 @@ Description: The mm_stat file is read-only and represents device's mm statistics (orig_data_size, compr_data_size, etc.) in a format similar to block layer statistics file format. + +What: /sys/block/zram/debug_stat +Date: July 2016 +Contact: Sergey Senozhatsky +Description: + The debug_stat file is read-only and represents various + device's debugging info useful for kernel developers. Its + format is not documented intentionally and may change + anytime without any notice. diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index d88f0c7..13100fb 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -172,6 +172,7 @@ mem_limit RW the maximum amount of memory ZRAM can use to store pages_compacted RO the number of pages freed during compaction (available only via zram/mm_stat node) compact WO trigger memory compaction +debug_stat RO this file is used for zram debugging purposes WARNING ======= diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 8fcfbeb..8fcad8b 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -435,8 +435,26 @@ static ssize_t mm_stat_show(struct device *dev, return ret; } +static ssize_t debug_stat_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + int version = 1; + struct zram *zram = dev_to_zram(dev); + ssize_t ret; + + down_read(&zram->init_lock); + ret = scnprintf(buf, PAGE_SIZE, + "version: %d\n%8llu\n", + version, + (u64)atomic64_read(&zram->stats.writestall)); + up_read(&zram->init_lock); + + return ret; +} + static DEVICE_ATTR_RO(io_stat); static DEVICE_ATTR_RO(mm_stat); +static DEVICE_ATTR_RO(debug_stat); ZRAM_ATTR_RO(num_reads); ZRAM_ATTR_RO(num_writes); ZRAM_ATTR_RO(failed_reads); @@ -719,6 +737,8 @@ compress_again: zcomp_strm_release(zram->comp, zstrm); zstrm = NULL; + atomic64_inc(&zram->stats.writestall); + handle = zs_malloc(meta->mem_pool, clen, GFP_NOIO | __GFP_HIGHMEM); if (handle) @@ -1181,6 +1201,7 @@ static struct attribute *zram_disk_attrs[] = { &dev_attr_comp_algorithm.attr, &dev_attr_io_stat.attr, &dev_attr_mm_stat.attr, + &dev_attr_debug_stat.attr, NULL, }; diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index 06b1636..3f5bf66 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -85,6 +85,7 @@ struct zram_stats { atomic64_t zero_pages; /* no. of zero filled pages */ atomic64_t pages_stored; /* no. of pages currently stored */ atomic_long_t max_used_pages; /* no. of maximum pages stored */ + atomic64_t writestall; /* no. of write slow paths */ }; struct zram_meta { -- cgit v0.10.2 From 3e42979e65dace1f9268dd5440e5ab096b8dee59 Mon Sep 17 00:00:00 2001 From: "Richard W.M. Jones" Date: Fri, 20 May 2016 17:00:05 -0700 Subject: procfs: expose umask in /proc//status It's not possible to read the process umask without also modifying it, which is what umask(2) does. A library cannot read umask safely, especially if the main program might be multithreaded. Add a new status line ("Umask") in /proc//status. It contains the file mode creation mask (umask) in octal. It is only shown for tasks which have task->fs. This patch is adapted from one originally written by Pierre Carrier. The use case is that we have endless trouble with people setting weird umask() values (usually on the grounds of "security"), and then everything breaking. I'm on the hook to fix these. We'd like to add debugging to our program so we can dump out the umask in debug reports. Previous versions of the patch used a syscall so you could only read your own umask. That's all I need. However there was quite a lot of push-back from those, so this new version exports it in /proc. See: https://lkml.org/lkml/2016/4/13/704 [umask2] https://lkml.org/lkml/2016/4/13/487 [getumask] Signed-off-by: Richard W.M. Jones Acked-by: Konstantin Khlebnikov Acked-by: Jerome Marchand Acked-by: Kees Cook Cc: "Theodore Ts'o" Cc: Michal Hocko Cc: Pierre Carrier Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 7f5607a..e8d0075 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1) TracerPid PID of process tracing this process (0 if not) Uid Real, effective, saved set, and file system UIDs Gid Real, effective, saved set, and file system GIDs + Umask file mode creation mask FDSize number of file descriptor slots currently allocated Groups supplementary group list NStgid descendant namespace thread group ID hierarchy diff --git a/fs/proc/array.c b/fs/proc/array.c index b6c00ce..88c7de1 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -83,6 +83,7 @@ #include #include #include +#include #include #include @@ -139,12 +140,25 @@ static inline const char *get_task_state(struct task_struct *tsk) return task_state_array[fls(state)]; } +static inline int get_task_umask(struct task_struct *tsk) +{ + struct fs_struct *fs; + int umask = -ENOENT; + + task_lock(tsk); + fs = tsk->fs; + if (fs) + umask = fs->umask; + task_unlock(tsk); + return umask; +} + static inline void task_state(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *p) { struct user_namespace *user_ns = seq_user_ns(m); struct group_info *group_info; - int g; + int g, umask; struct task_struct *tracer; const struct cred *cred; pid_t ppid, tpid = 0, tgid, ngid; @@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns, ngid = task_numa_group_id(p); cred = get_task_cred(p); + umask = get_task_umask(p); + if (umask >= 0) + seq_printf(m, "Umask:\t%#04o\n", umask); + task_lock(p); if (p->files) max_fds = files_fdtable(p->files)->max_fds; -- cgit v0.10.2 From 1b3044e39a89cb1d4d5313da477e8dfea2b5232d Mon Sep 17 00:00:00 2001 From: Janis Danisevskis Date: Fri, 20 May 2016 17:00:08 -0700 Subject: procfs: fix pthread cross-thread naming if !PR_DUMPABLE The PR_DUMPABLE flag causes the pid related paths of the proc file system to be owned by ROOT. The implementation of pthread_set/getname_np however needs access to /proc//task//comm. If PR_DUMPABLE is false this implementation is locked out. This patch installs a special permission function for the file "comm" that grants read and write access to all threads of the same group regardless of the ownership of the inode. For all other threads the function falls back to the generic inode permission check. [akpm@linux-foundation.org: fix spello in comment] Signed-off-by: Janis Danisevskis Acked-by: Kees Cook Cc: Al Viro Cc: Cyrill Gorcunov Cc: Alexey Dobriyan Cc: Colin Ian King Cc: David Rientjes Cc: Minfei Huang Cc: John Stultz Cc: Calvin Owens Cc: Jann Horn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/proc/base.c b/fs/proc/base.c index ff4527d..a11eb71 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3163,6 +3163,44 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx) } /* + * proc_tid_comm_permission is a special permission function exclusively + * used for the node /proc//task//comm. + * It bypasses generic permission checks in the case where a task of the same + * task group attempts to access the node. + * The rationale behind this is that glibc and bionic access this node for + * cross thread naming (pthread_set/getname_np(!self)). However, if + * PR_SET_DUMPABLE gets set to 0 this node among others becomes uid=0 gid=0, + * which locks out the cross thread naming implementation. + * This function makes sure that the node is always accessible for members of + * same thread group. + */ +static int proc_tid_comm_permission(struct inode *inode, int mask) +{ + bool is_same_tgroup; + struct task_struct *task; + + task = get_proc_task(inode); + if (!task) + return -ESRCH; + is_same_tgroup = same_thread_group(current, task); + put_task_struct(task); + + if (likely(is_same_tgroup && !(mask & MAY_EXEC))) { + /* This file (/proc//task//comm) can always be + * read or written by the members of the corresponding + * thread group. + */ + return 0; + } + + return generic_permission(inode, mask); +} + +static const struct inode_operations proc_tid_comm_inode_operations = { + .permission = proc_tid_comm_permission, +}; + +/* * Tasks */ static const struct pid_entry tid_base_stuff[] = { @@ -3180,7 +3218,9 @@ static const struct pid_entry tid_base_stuff[] = { #ifdef CONFIG_SCHED_DEBUG REG("sched", S_IRUGO|S_IWUSR, proc_pid_sched_operations), #endif - REG("comm", S_IRUGO|S_IWUSR, proc_pid_set_comm_operations), + NOD("comm", S_IFREG|S_IRUGO|S_IWUSR, + &proc_tid_comm_inode_operations, + &proc_pid_set_comm_operations, {}), #ifdef CONFIG_HAVE_ARCH_TRACEHOOK ONE("syscall", S_IRUSR, proc_pid_syscall), #endif -- cgit v0.10.2 From 2ec656eb4054ce55e6d453b8614ef9669e84c542 Mon Sep 17 00:00:00 2001 From: Jiri Slaby Date: Fri, 20 May 2016 17:00:11 -0700 Subject: mn10300: let exit_fpu accept a task We need to call exit_thread from copy_process in a fail path. Since exit_thread on mn10300 calls exit_thread_runtime_instr, make it accept task_struct as a parameter now. Signed-off-by: Jiri Slaby Cc: "David S. Miller" Cc: "H. Peter Anvin" Cc: "James E.J. Bottomley" Cc: Aurelien Jacquiot Cc: Benjamin Herrenschmidt Cc: Catalin Marinas Cc: Chen Liqin Cc: Chris Metcalf Cc: Chris Zankel Cc: David Howells Cc: Fenghua Yu Cc: Geert Uytterhoeven Cc: Guan Xuetao Cc: Haavard Skinnemoen Cc: Hans-Christian Egtvedt Cc: Heiko Carstens Cc: Helge Deller Cc: Ingo Molnar Cc: Ivan Kokshaysky Cc: James Hogan Cc: Jeff Dike Cc: Jesper Nilsson Cc: Jiri Slaby Cc: Jonas Bonn Cc: Koichi Yasutake Cc: Lennox Wu Cc: Ley Foon Tan Cc: Mark Salter Cc: Martin Schwidefsky Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Mikael Starvik Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Ralf Baechle Cc: Rich Felker Cc: Richard Henderson Cc: Richard Kuo Cc: Richard Weinberger Cc: Russell King Cc: Steven Miao Cc: Thomas Gleixner Cc: Tony Luck Cc: Vineet Gupta Cc: Will Deacon Cc: Yoshinori Sato Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/mn10300/include/asm/fpu.h b/arch/mn10300/include/asm/fpu.h index 738ff72..a47e995 100644 --- a/arch/mn10300/include/asm/fpu.h +++ b/arch/mn10300/include/asm/fpu.h @@ -76,11 +76,9 @@ static inline void unlazy_fpu(struct task_struct *tsk) preempt_enable(); } -static inline void exit_fpu(void) +static inline void exit_fpu(struct task_struct *tsk) { #ifdef CONFIG_LAZY_SAVE_FPU - struct task_struct *tsk = current; - preempt_disable(); if (fpu_state_owner == tsk) fpu_state_owner = NULL; @@ -123,7 +121,7 @@ static inline void fpu_init_state(void) {} static inline void fpu_save(struct fpu_state_struct *s) {} static inline void fpu_kill_state(struct task_struct *tsk) {} static inline void unlazy_fpu(struct task_struct *tsk) {} -static inline void exit_fpu(void) {} +static inline void exit_fpu(struct task_struct *tsk) {} static inline void flush_fpu(void) {} static inline int fpu_setup_sigcontext(struct fpucontext *buf) { return 0; } static inline int fpu_restore_sigcontext(struct fpucontext *buf) { return 0; } diff --git a/arch/mn10300/kernel/process.c b/arch/mn10300/kernel/process.c index 3707da5..74a96cc 100644 --- a/arch/mn10300/kernel/process.c +++ b/arch/mn10300/kernel/process.c @@ -105,7 +105,7 @@ void show_regs(struct pt_regs *regs) */ void exit_thread(void) { - exit_fpu(); + exit_fpu(current); } void flush_thread(void) -- cgit v0.10.2 From 5f56a5dfdb9bcb3bca03df59980d4d2f012cbb53 Mon Sep 17 00:00:00 2001 From: Jiri Slaby Date: Fri, 20 May 2016 17:00:16 -0700 Subject: exit_thread: remove empty bodies Define HAVE_EXIT_THREAD for archs which want to do something in exit_thread. For others, let's define exit_thread as an empty inline. This is a cleanup before we change the prototype of exit_thread to accept a task parameter. [akpm@linux-foundation.org: fix mips] Signed-off-by: Jiri Slaby Cc: "David S. Miller" Cc: "H. Peter Anvin" Cc: "James E.J. Bottomley" Cc: Aurelien Jacquiot Cc: Benjamin Herrenschmidt Cc: Catalin Marinas Cc: Chen Liqin Cc: Chris Metcalf Cc: Chris Zankel Cc: David Howells Cc: Fenghua Yu Cc: Geert Uytterhoeven Cc: Guan Xuetao Cc: Haavard Skinnemoen Cc: Hans-Christian Egtvedt Cc: Heiko Carstens Cc: Helge Deller Cc: Ingo Molnar Cc: Ivan Kokshaysky Cc: James Hogan Cc: Jeff Dike Cc: Jesper Nilsson Cc: Jiri Slaby Cc: Jonas Bonn Cc: Koichi Yasutake Cc: Lennox Wu Cc: Ley Foon Tan Cc: Mark Salter Cc: Martin Schwidefsky Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Mikael Starvik Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Ralf Baechle Cc: Rich Felker Cc: Richard Henderson Cc: Richard Kuo Cc: Richard Weinberger Cc: Russell King Cc: Steven Miao Cc: Thomas Gleixner Cc: Tony Luck Cc: Vineet Gupta Cc: Will Deacon Cc: Yoshinori Sato Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/Kconfig b/arch/Kconfig index 81869a5..0f298f9 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -517,6 +517,11 @@ config HAVE_ARCH_MMAP_RND_BITS - ARCH_MMAP_RND_BITS_MIN - ARCH_MMAP_RND_BITS_MAX +config HAVE_EXIT_THREAD + bool + help + An architecture implements exit_thread. + config ARCH_MMAP_RND_BITS_MIN int diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c index 84d1326..b483156 100644 --- a/arch/alpha/kernel/process.c +++ b/arch/alpha/kernel/process.c @@ -210,14 +210,6 @@ start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp) } EXPORT_SYMBOL(start_thread); -/* - * Free current thread data structures etc.. - */ -void -exit_thread(void) -{ -} - void flush_thread(void) { diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c index a3f750e..b5db9e7 100644 --- a/arch/arc/kernel/process.c +++ b/arch/arc/kernel/process.c @@ -183,13 +183,6 @@ void flush_thread(void) { } -/* - * Free any architecture-specific thread data structures, etc. - */ -void exit_thread(void) -{ -} - int dump_fpu(struct pt_regs *regs, elf_fpregset_t *fpu) { return 0; diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index b99d25b..956d3575 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -50,6 +50,7 @@ config ARM select HAVE_DMA_CONTIGUOUS if MMU select HAVE_DYNAMIC_FTRACE if (!XIP_KERNEL) && !CPU_ENDIAN_BE32 && MMU select HAVE_EFFICIENT_UNALIGNED_ACCESS if (CPU_V6 || CPU_V6K || CPU_V7) && MMU + select HAVE_EXIT_THREAD select HAVE_FTRACE_MCOUNT_RECORD if (!XIP_KERNEL) select HAVE_FUNCTION_GRAPH_TRACER if (!THUMB2_KERNEL) select HAVE_FUNCTION_TRACER if (!XIP_KERNEL) diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 48eea68..6cd2612 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -200,13 +200,6 @@ void show_regs(struct pt_regs * regs) __show_regs(regs); } -/* - * Free current thread data structures etc.. - */ -void exit_thread(void) -{ -} - static void tls_thread_flush(void) { asm ("msr tpidr_el0, xzr"); diff --git a/arch/avr32/Kconfig b/arch/avr32/Kconfig index 18b8877..e43519a 100644 --- a/arch/avr32/Kconfig +++ b/arch/avr32/Kconfig @@ -4,6 +4,7 @@ config AVR32 # that we usually don't need on AVR32. select EXPERT select HAVE_CLK + select HAVE_EXIT_THREAD select HAVE_OPROFILE select HAVE_KPROBES select VIRT_TO_BUS diff --git a/arch/blackfin/include/asm/processor.h b/arch/blackfin/include/asm/processor.h index 7acd466..0c265ab 100644 --- a/arch/blackfin/include/asm/processor.h +++ b/arch/blackfin/include/asm/processor.h @@ -76,13 +76,6 @@ static inline void release_thread(struct task_struct *dead_task) } /* - * Free current thread data structures etc.. - */ -static inline void exit_thread(void) -{ -} - -/* * Return saved PC of a blocked thread. */ #define thread_saved_pc(tsk) (tsk->thread.pc) diff --git a/arch/c6x/kernel/process.c b/arch/c6x/kernel/process.c index 3ae9f5a..0ee7686 100644 --- a/arch/c6x/kernel/process.c +++ b/arch/c6x/kernel/process.c @@ -82,10 +82,6 @@ void flush_thread(void) { } -void exit_thread(void) -{ -} - /* * Do necessary setup to start up a newly executed thread. */ diff --git a/arch/cris/Kconfig b/arch/cris/Kconfig index 99bda1b..5c0ca8a 100644 --- a/arch/cris/Kconfig +++ b/arch/cris/Kconfig @@ -59,6 +59,7 @@ config CRIS select GENERIC_IOMAP select MODULES_USE_ELF_RELA select CLONE_BACKWARDS2 + select HAVE_EXIT_THREAD if ETRAX_ARCH_V32 select OLD_SIGSUSPEND select OLD_SIGACTION select GPIOLIB diff --git a/arch/cris/arch-v10/kernel/process.c b/arch/cris/arch-v10/kernel/process.c index 02b7834..96e5afe 100644 --- a/arch/cris/arch-v10/kernel/process.c +++ b/arch/cris/arch-v10/kernel/process.c @@ -35,15 +35,6 @@ void default_idle(void) local_irq_enable(); } -/* - * Free current thread data structures etc.. - */ - -void exit_thread(void) -{ - /* Nothing needs to be done. */ -} - /* if the watchdog is enabled, we can simply disable interrupts and go * into an eternal loop, and the watchdog will reset the CPU after 0.1s * if on the other hand the watchdog wasn't enabled, we just enable it and wait diff --git a/arch/frv/include/asm/processor.h b/arch/frv/include/asm/processor.h index ae8d423..73f0a79 100644 --- a/arch/frv/include/asm/processor.h +++ b/arch/frv/include/asm/processor.h @@ -97,13 +97,6 @@ extern asmlinkage void *restore_user_regs(const struct user_context *target, ... #define forget_segments() do { } while (0) /* - * Free current thread data structures etc.. - */ -static inline void exit_thread(void) -{ -} - -/* * Return saved PC of a blocked thread. */ extern unsigned long thread_saved_pc(struct task_struct *tsk); diff --git a/arch/h8300/include/asm/processor.h b/arch/h8300/include/asm/processor.h index 54e3fd8..111df73 100644 --- a/arch/h8300/include/asm/processor.h +++ b/arch/h8300/include/asm/processor.h @@ -111,13 +111,6 @@ static inline void release_thread(struct task_struct *dead_task) } /* - * Free current thread data structures etc.. - */ -static inline void exit_thread(void) -{ -} - -/* * Return saved PC of a blocked thread. */ unsigned long thread_saved_pc(struct task_struct *tsk); diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c index a9ebd47..d9edfd3 100644 --- a/arch/hexagon/kernel/process.c +++ b/arch/hexagon/kernel/process.c @@ -137,13 +137,6 @@ void release_thread(struct task_struct *dead_task) } /* - * Free any architecture-specific thread data structures, etc. - */ -void exit_thread(void) -{ -} - -/* * Some archs flush debug and FPU info here */ void flush_thread(void) diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index b534eba..f80758c 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -18,6 +18,7 @@ config IA64 select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI select HAVE_UNSTABLE_SCHED_CLOCK + select HAVE_EXIT_THREAD select HAVE_IDE select HAVE_OPROFILE select HAVE_KPROBES diff --git a/arch/m32r/kernel/process.c b/arch/m32r/kernel/process.c index e69221d..a88b1f0 100644 --- a/arch/m32r/kernel/process.c +++ b/arch/m32r/kernel/process.c @@ -101,15 +101,6 @@ void show_regs(struct pt_regs * regs) #endif } -/* - * Free current thread data structures etc.. - */ -void exit_thread(void) -{ - /* Nothing to do. */ - DPRINTK("pid = %d\n", current->pid); -} - void flush_thread(void) { DPRINTK("pid = %d\n", current->pid); diff --git a/arch/m68k/include/asm/processor.h b/arch/m68k/include/asm/processor.h index 20dda1d..a6ce2ec 100644 --- a/arch/m68k/include/asm/processor.h +++ b/arch/m68k/include/asm/processor.h @@ -153,13 +153,6 @@ static inline void release_thread(struct task_struct *dead_task) { } -/* - * Free current thread data structures etc.. - */ -static inline void exit_thread(void) -{ -} - extern unsigned long thread_saved_pc(struct task_struct *tsk); unsigned long get_wchan(struct task_struct *p); diff --git a/arch/metag/Kconfig b/arch/metag/Kconfig index a0fa88d..e47a08d 100644 --- a/arch/metag/Kconfig +++ b/arch/metag/Kconfig @@ -11,6 +11,7 @@ config METAG select HAVE_DEBUG_KMEMLEAK select HAVE_DEBUG_STACKOVERFLOW select HAVE_DYNAMIC_FTRACE + select HAVE_EXIT_THREAD select HAVE_FTRACE_MCOUNT_RECORD select HAVE_FUNCTION_TRACER select HAVE_KERNEL_BZIP2 diff --git a/arch/metag/include/asm/processor.h b/arch/metag/include/asm/processor.h index 0838ca6..a0333eb 100644 --- a/arch/metag/include/asm/processor.h +++ b/arch/metag/include/asm/processor.h @@ -134,8 +134,6 @@ static inline void release_thread(struct task_struct *dead_task) #define copy_segments(tsk, mm) do { } while (0) #define release_segments(mm) do { } while (0) -extern void exit_thread(void); - /* * Return saved PC of a blocked thread. */ diff --git a/arch/microblaze/include/asm/processor.h b/arch/microblaze/include/asm/processor.h index 497a988..c38d0dd 100644 --- a/arch/microblaze/include/asm/processor.h +++ b/arch/microblaze/include/asm/processor.h @@ -70,11 +70,6 @@ static inline void release_thread(struct task_struct *dead_task) { } -/* Free all resources held by a thread. */ -static inline void exit_thread(void) -{ -} - extern unsigned long thread_saved_pc(struct task_struct *t); extern unsigned long get_wchan(struct task_struct *p); @@ -127,11 +122,6 @@ static inline void release_thread(struct task_struct *dead_task) { } -/* Free current thread data structures etc. */ -static inline void exit_thread(void) -{ -} - /* Return saved (kernel) PC of a blocked thread. */ # define thread_saved_pc(tsk) \ ((tsk)->thread.regs ? (tsk)->thread.regs->r15 : 0) diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c index a6b3dc5..411c971 100644 --- a/arch/mips/kernel/process.c +++ b/arch/mips/kernel/process.c @@ -73,10 +73,6 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp) regs->regs[29] = sp; } -void exit_thread(void) -{ -} - int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src) { /* diff --git a/arch/mn10300/Kconfig b/arch/mn10300/Kconfig index 06ddb55..9627e81 100644 --- a/arch/mn10300/Kconfig +++ b/arch/mn10300/Kconfig @@ -1,5 +1,6 @@ config MN10300 def_bool y + select HAVE_EXIT_THREAD select HAVE_OPROFILE select HAVE_UID16 select GENERIC_IRQ_SHOW diff --git a/arch/nios2/include/asm/processor.h b/arch/nios2/include/asm/processor.h index c2ba45c..1c953f0 100644 --- a/arch/nios2/include/asm/processor.h +++ b/arch/nios2/include/asm/processor.h @@ -75,11 +75,6 @@ static inline void release_thread(struct task_struct *dead_task) { } -/* Free current thread data structures etc.. */ -static inline void exit_thread(void) -{ -} - /* Return saved PC of a blocked thread. */ #define thread_saved_pc(tsk) ((tsk)->thread.kregs->ea) diff --git a/arch/openrisc/include/asm/processor.h b/arch/openrisc/include/asm/processor.h index 4d235e3..70334c9 100644 --- a/arch/openrisc/include/asm/processor.h +++ b/arch/openrisc/include/asm/processor.h @@ -85,15 +85,6 @@ void release_thread(struct task_struct *); unsigned long get_wchan(struct task_struct *p); /* - * Free current thread data structures etc.. - */ - -extern inline void exit_thread(void) -{ - /* Nothing needs to be done. */ -} - -/* * Return saved PC of a blocked thread. For now, this is the "user" PC */ extern unsigned long thread_saved_pc(struct task_struct *t); diff --git a/arch/parisc/kernel/process.c b/arch/parisc/kernel/process.c index 809905a..4063943 100644 --- a/arch/parisc/kernel/process.c +++ b/arch/parisc/kernel/process.c @@ -144,13 +144,6 @@ void machine_power_off(void) void (*pm_power_off)(void) = machine_power_off; EXPORT_SYMBOL(pm_power_off); -/* - * Free current thread data structures etc.. - */ -void exit_thread(void) -{ -} - void flush_thread(void) { /* Only needs to handle fpu stuff or perf monitors. diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index ea8a28f..e2f12cb 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -1329,10 +1329,6 @@ void show_regs(struct pt_regs * regs) show_instructions(regs); } -void exit_thread(void) -{ -} - void flush_thread(void) { #ifdef CONFIG_HAVE_HW_BREAKPOINT diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index de0fcc0..e2c9aaa 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -134,6 +134,7 @@ config S390 select HAVE_DMA_API_DEBUG select HAVE_DYNAMIC_FTRACE select HAVE_DYNAMIC_FTRACE_WITH_REGS + select HAVE_EXIT_THREAD select HAVE_FTRACE_MCOUNT_RECORD select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACER diff --git a/arch/score/kernel/process.c b/arch/score/kernel/process.c index a1519ad3..aae9480 100644 --- a/arch/score/kernel/process.c +++ b/arch/score/kernel/process.c @@ -56,8 +56,6 @@ void start_thread(struct pt_regs *regs, unsigned long pc, unsigned long sp) regs->regs[0] = sp; } -void exit_thread(void) {} - /* * When a process does an "exec", machine state like FPU and debug * registers need to be reset. This is a hook function for that. diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index 7ed20fc..cb93af8 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -71,6 +71,7 @@ config SUPERH32 config SUPERH64 def_bool ARCH = "sh64" + select HAVE_EXIT_THREAD select KALLSYMS config ARCH_DEFCONFIG diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c index 2885fc9..ee12e94 100644 --- a/arch/sh/kernel/process_32.c +++ b/arch/sh/kernel/process_32.c @@ -76,13 +76,6 @@ void start_thread(struct pt_regs *regs, unsigned long new_pc, } EXPORT_SYMBOL(start_thread); -/* - * Free current thread data structures etc.. - */ -void exit_thread(void) -{ -} - void flush_thread(void) { struct task_struct *tsk = current; diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index db0a26c..27b3a0a 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -20,6 +20,7 @@ config SPARC select HAVE_OPROFILE select HAVE_ARCH_KGDB if !SMP || SPARC64 select HAVE_ARCH_TRACEHOOK + select HAVE_EXIT_THREAD select SYSCTL_EXCEPTION_TRACE select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE select RTC_CLASS diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig index 8171930..1747462 100644 --- a/arch/tile/Kconfig +++ b/arch/tile/Kconfig @@ -3,6 +3,7 @@ config TILE def_bool y + select HAVE_EXIT_THREAD select HAVE_PERF_EVENTS select USE_PMC if PERF_EVENTS select HAVE_DMA_API_DEBUG diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c index 48af59a..0b04711 100644 --- a/arch/um/kernel/process.c +++ b/arch/um/kernel/process.c @@ -103,10 +103,6 @@ void interrupt_end(void) tracehook_notify_resume(regs); } -void exit_thread(void) -{ -} - int get_current_pid(void) { return task_pid_nr(current); diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c index b008e99..00299c9 100644 --- a/arch/unicore32/kernel/process.c +++ b/arch/unicore32/kernel/process.c @@ -201,13 +201,6 @@ void show_regs(struct pt_regs *regs) __backtrace(); } -/* - * Free current thread data structures etc.. - */ -void exit_thread(void) -{ -} - void flush_thread(void) { struct thread_info *thread = current_thread_info(); diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index ace79d2..8ff5b3b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -105,6 +105,7 @@ config X86 select HAVE_DYNAMIC_FTRACE select HAVE_DYNAMIC_FTRACE_WITH_REGS select HAVE_EFFICIENT_UNALIGNED_ACCESS + select HAVE_EXIT_THREAD select HAVE_FENTRY if X86_64 select HAVE_FTRACE_MCOUNT_RECORD select HAVE_FUNCTION_GRAPH_FP_TEST diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig index 85257af..64336f6 100644 --- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -14,6 +14,7 @@ config XTENSA select GENERIC_PCI_IOMAP select GENERIC_SCHED_CLOCK select HAVE_DMA_API_DEBUG + select HAVE_EXIT_THREAD select HAVE_FUNCTION_TRACER select HAVE_FUTEX_CMPXCHG if !MMU select HAVE_HW_BREAKPOINT if PERF_EVENTS diff --git a/include/linux/sched.h b/include/linux/sched.h index 6b3213d..167c0d4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2769,7 +2769,14 @@ static inline int copy_thread_tls( } #endif extern void flush_thread(void); + +#ifdef CONFIG_HAVE_EXIT_THREAD extern void exit_thread(void); +#else +static inline void exit_thread(void) +{ +} +#endif extern void exit_files(struct task_struct *); extern void __cleanup_sighand(struct sighand_struct *); -- cgit v0.10.2 From e64646946ed32902fd597fa6e514b1da84642de3 Mon Sep 17 00:00:00 2001 From: Jiri Slaby Date: Fri, 20 May 2016 17:00:20 -0700 Subject: exit_thread: accept a task parameter to be exited We need to call exit_thread from copy_process in a fail path. So make it accept task_struct as a parameter. [v2] * s390: exit_thread_runtime_instr doesn't make sense to be called for non-current tasks. * arm: fix the comment in vfp_thread_copy * change 'me' to 'tsk' for task_struct * now we can change only archs that actually have exit_thread [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jiri Slaby Cc: "David S. Miller" Cc: "H. Peter Anvin" Cc: "James E.J. Bottomley" Cc: Aurelien Jacquiot Cc: Benjamin Herrenschmidt Cc: Catalin Marinas Cc: Chen Liqin Cc: Chris Metcalf Cc: Chris Zankel Cc: David Howells Cc: Fenghua Yu Cc: Geert Uytterhoeven Cc: Guan Xuetao Cc: Haavard Skinnemoen Cc: Hans-Christian Egtvedt Cc: Heiko Carstens Cc: Helge Deller Cc: Ingo Molnar Cc: Ivan Kokshaysky Cc: James Hogan Cc: Jeff Dike Cc: Jesper Nilsson Cc: Jiri Slaby Cc: Jonas Bonn Cc: Koichi Yasutake Cc: Lennox Wu Cc: Ley Foon Tan Cc: Mark Salter Cc: Martin Schwidefsky Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Mikael Starvik Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Ralf Baechle Cc: Rich Felker Cc: Richard Henderson Cc: Richard Kuo Cc: Richard Weinberger Cc: Russell King Cc: Steven Miao Cc: Thomas Gleixner Cc: Tony Luck Cc: Vineet Gupta Cc: Will Deacon Cc: Yoshinori Sato Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c index 4adfb46..a647d66 100644 --- a/arch/arm/kernel/process.c +++ b/arch/arm/kernel/process.c @@ -193,9 +193,9 @@ EXPORT_SYMBOL_GPL(thread_notify_head); /* * Free current thread data structures etc.. */ -void exit_thread(void) +void exit_thread(struct task_struct *tsk) { - thread_notify(THREAD_NOTIFY_EXIT, current_thread_info()); + thread_notify(THREAD_NOTIFY_EXIT, task_thread_info(tsk)); } void flush_thread(void) diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c index 2a61e4b..73085d3 100644 --- a/arch/arm/vfp/vfpmodule.c +++ b/arch/arm/vfp/vfpmodule.c @@ -156,10 +156,6 @@ static void vfp_thread_copy(struct thread_info *thread) * - we could be preempted if tree preempt rcu is enabled, so * it is unsafe to use thread->cpu. * THREAD_NOTIFY_EXIT - * - the thread (v) will be running on the local CPU, so - * v === current_thread_info() - * - thread->cpu is the local CPU number at the time it is accessed, - * but may change at any time. * - we could be preempted if tree preempt rcu is enabled, so * it is unsafe to use thread->cpu. */ diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c index 42a53e74..68e5b9d 100644 --- a/arch/avr32/kernel/process.c +++ b/arch/avr32/kernel/process.c @@ -62,9 +62,9 @@ void machine_restart(char *cmd) /* * Free current thread data structures etc */ -void exit_thread(void) +void exit_thread(struct task_struct *tsk) { - ocd_disable(current); + ocd_disable(tsk); } void flush_thread(void) diff --git a/arch/cris/arch-v32/kernel/process.c b/arch/cris/arch-v32/kernel/process.c index c7ce784..4d1afa9 100644 --- a/arch/cris/arch-v32/kernel/process.c +++ b/arch/cris/arch-v32/kernel/process.c @@ -33,9 +33,9 @@ void default_idle(void) */ extern void deconfigure_bp(long pid); -void exit_thread(void) +void exit_thread(struct task_struct *tsk) { - deconfigure_bp(current->pid); + deconfigure_bp(tsk->pid); } /* diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c index 9cd607b..2436ad5 100644 --- a/arch/ia64/kernel/perfmon.c +++ b/arch/ia64/kernel/perfmon.c @@ -4542,8 +4542,8 @@ pfm_context_unload(pfm_context_t *ctx, void *arg, int count, struct pt_regs *reg /* - * called only from exit_thread(): task == current - * we come here only if current has a context attached (loaded or masked) + * called only from exit_thread() + * we come here only if the task has a context attached (loaded or masked) */ void pfm_exit_thread(struct task_struct *task) diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c index b515149..aae6c4d 100644 --- a/arch/ia64/kernel/process.c +++ b/arch/ia64/kernel/process.c @@ -570,22 +570,22 @@ flush_thread (void) } /* - * Clean up state associated with current thread. This is called when + * Clean up state associated with a thread. This is called when * the thread calls exit(). */ void -exit_thread (void) +exit_thread (struct task_struct *tsk) { - ia64_drop_fpu(current); + ia64_drop_fpu(tsk); #ifdef CONFIG_PERFMON /* if needed, stop monitoring and flush state to perfmon context */ - if (current->thread.pfm_context) - pfm_exit_thread(current); + if (tsk->thread.pfm_context) + pfm_exit_thread(tsk); /* free debug register resources */ - if (current->thread.flags & IA64_THREAD_DBG_VALID) - pfm_release_debug_registers(current); + if (tsk->thread.flags & IA64_THREAD_DBG_VALID) + pfm_release_debug_registers(tsk); #endif } diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c index 7f54618..3506279 100644 --- a/arch/metag/kernel/process.c +++ b/arch/metag/kernel/process.c @@ -345,10 +345,10 @@ void flush_thread(void) /* * Free current thread data structures etc. */ -void exit_thread(void) +void exit_thread(struct task_struct *tsk) { - clear_fpu(¤t->thread); - clear_dsp(¤t->thread); + clear_fpu(&tsk->thread); + clear_dsp(&tsk->thread); } /* TODO: figure out how to unwind the kernel stack here to figure out diff --git a/arch/mn10300/kernel/process.c b/arch/mn10300/kernel/process.c index 74a96cc..cbede4e 100644 --- a/arch/mn10300/kernel/process.c +++ b/arch/mn10300/kernel/process.c @@ -103,9 +103,9 @@ void show_regs(struct pt_regs *regs) /* * free current thread data structures etc.. */ -void exit_thread(void) +void exit_thread(struct task_struct *tsk) { - exit_fpu(current); + exit_fpu(tsk); } void flush_thread(void) diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c index 481d7a83..bba4fa7 100644 --- a/arch/s390/kernel/process.c +++ b/arch/s390/kernel/process.c @@ -68,9 +68,10 @@ extern void kernel_thread_starter(void); /* * Free current thread data structures etc.. */ -void exit_thread(void) +void exit_thread(struct task_struct *tsk) { - exit_thread_runtime_instr(); + if (tsk == current) + exit_thread_runtime_instr(); } void flush_thread(void) diff --git a/arch/sh/kernel/process_64.c b/arch/sh/kernel/process_64.c index e2062e6..9d3e991 100644 --- a/arch/sh/kernel/process_64.c +++ b/arch/sh/kernel/process_64.c @@ -288,7 +288,7 @@ void show_regs(struct pt_regs *regs) /* * Free current thread data structures etc.. */ -void exit_thread(void) +void exit_thread(struct task_struct *tsk) { /* * See arch/sparc/kernel/process.c for the precedent for doing @@ -307,9 +307,8 @@ void exit_thread(void) * which it would get safely nulled. */ #ifdef CONFIG_SH_FPU - if (last_task_used_math == current) { + if (last_task_used_math == tsk) last_task_used_math = NULL; - } #endif } diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c index c5113c7..b7780a5 100644 --- a/arch/sparc/kernel/process_32.c +++ b/arch/sparc/kernel/process_32.c @@ -184,21 +184,21 @@ unsigned long thread_saved_pc(struct task_struct *tsk) /* * Free current thread data structures etc.. */ -void exit_thread(void) +void exit_thread(struct task_struct *tsk) { #ifndef CONFIG_SMP - if(last_task_used_math == current) { + if (last_task_used_math == tsk) { #else - if (test_thread_flag(TIF_USEDFPU)) { + if (test_ti_thread_flag(task_thread_info(tsk), TIF_USEDFPU)) { #endif /* Keep process from leaving FPU in a bogon state. */ put_psr(get_psr() | PSR_EF); - fpsave(¤t->thread.float_regs[0], ¤t->thread.fsr, - ¤t->thread.fpqueue[0], ¤t->thread.fpqdepth); + fpsave(&tsk->thread.float_regs[0], &tsk->thread.fsr, + &tsk->thread.fpqueue[0], &tsk->thread.fpqdepth); #ifndef CONFIG_SMP last_task_used_math = NULL; #else - clear_thread_flag(TIF_USEDFPU); + clear_ti_thread_flag(task_thread_info(tsk), TIF_USEDFPU); #endif } } diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c index c16ef1a..fa14402 100644 --- a/arch/sparc/kernel/process_64.c +++ b/arch/sparc/kernel/process_64.c @@ -417,9 +417,9 @@ unsigned long thread_saved_pc(struct task_struct *tsk) } /* Free current thread data structures etc.. */ -void exit_thread(void) +void exit_thread(struct task_struct *tsk) { - struct thread_info *t = current_thread_info(); + struct thread_info *t = task_thread_info(tsk); if (t->utraps) { if (t->utraps[0] < 2) diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index b5f30d3..6b705cc 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -541,7 +541,7 @@ void flush_thread(void) /* * Free current thread data structures etc.. */ -void exit_thread(void) +void exit_thread(struct task_struct *tsk) { #ifdef CONFIG_HARDWALL /* @@ -550,7 +550,7 @@ void exit_thread(void) * the last reference to a hardwall fd, it would already have * been released and deactivated at this point.) */ - hardwall_deactivate_all(current); + hardwall_deactivate_all(tsk); #endif } diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 2915d54..96becbb 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -97,10 +97,9 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src) /* * Free current thread data structures etc.. */ -void exit_thread(void) +void exit_thread(struct task_struct *tsk) { - struct task_struct *me = current; - struct thread_struct *t = &me->thread; + struct thread_struct *t = &tsk->thread; unsigned long *bp = t->io_bitmap_ptr; struct fpu *fpu = &t->fpu; diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c index 5bbfed8..e0ded48 100644 --- a/arch/xtensa/kernel/process.c +++ b/arch/xtensa/kernel/process.c @@ -115,10 +115,10 @@ void arch_cpu_idle(void) /* * This is called when the thread calls exit(). */ -void exit_thread(void) +void exit_thread(struct task_struct *tsk) { #if XTENSA_HAVE_COPROCESSORS - coprocessor_release_all(current_thread_info()); + coprocessor_release_all(task_thread_info(tsk)); #endif } diff --git a/include/linux/sched.h b/include/linux/sched.h index 167c0d4..02bdab4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2771,9 +2771,9 @@ static inline int copy_thread_tls( extern void flush_thread(void); #ifdef CONFIG_HAVE_EXIT_THREAD -extern void exit_thread(void); +extern void exit_thread(struct task_struct *tsk); #else -static inline void exit_thread(void) +static inline void exit_thread(struct task_struct *tsk) { } #endif diff --git a/kernel/exit.c b/kernel/exit.c index fd90195..75b34fe 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -746,7 +746,7 @@ void do_exit(long code) disassociate_ctty(1); exit_task_namespaces(tsk); exit_task_work(tsk); - exit_thread(); + exit_thread(tsk); /* * Flush inherited counters to the parent - before the parent -- cgit v0.10.2 From 0740aa5f6375681c57488c4ea55d05a0341cfc9c Mon Sep 17 00:00:00 2001 From: Jiri Slaby Date: Fri, 20 May 2016 17:00:25 -0700 Subject: fork: free thread in copy_process on failure When using this program (as root): #include #include #include #include #include #include #include #define ITER 1000 #define FORKERS 15 #define THREADS (6000/FORKERS) // 1850 is proc max static void fork_100_wait() { unsigned a, to_wait = 0; printf("\t%d forking %d\n", THREADS, getpid()); for (a = 0; a < THREADS; a++) { switch (fork()) { case 0: usleep(1000); exit(0); break; case -1: break; default: to_wait++; break; } } printf("\t%d forked from %d, waiting for %d\n", THREADS, getpid(), to_wait); for (a = 0; a < to_wait; a++) wait(NULL); printf("\t%d waited from %d\n", THREADS, getpid()); } static void run_forkers() { pid_t forkers[FORKERS]; unsigned a; for (a = 0; a < FORKERS; a++) { switch ((forkers[a] = fork())) { case 0: fork_100_wait(); exit(0); break; case -1: err(1, "DIE fork of %d'th forker", a); break; default: break; } } for (a = 0; a < FORKERS; a++) waitpid(forkers[a], NULL, 0); } int main() { unsigned a; int ret; ret = ioperm(10, 20, 0); if (ret < 0) err(1, "ioperm"); for (a = 0; a < ITER; a++) run_forkers(); return 0; } kmemleak reports many occurences of this leak: unreferenced object 0xffff8805917c8000 (size 8192): comm "fork-leak", pid 2932, jiffies 4295354292 (age 1871.028s) hex dump (first 32 bytes): ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ backtrace: [] kmemdup+0x25/0x50 [] copy_thread_tls+0x6c3/0x9a0 [] copy_process+0x1a84/0x5790 [] wake_up_new_task+0x2d5/0x6f0 [] _do_fork+0x12d/0x820 ... Due to the leakage of the memory items which should have been freed in arch/x86/kernel/process.c:exit_thread(). Make sure the memory is freed when fork fails later in copy_process. This is done by calling exit_thread with the thread to kill. Signed-off-by: Jiri Slaby Cc: "David S. Miller" Cc: "H. Peter Anvin" Cc: "James E.J. Bottomley" Cc: Aurelien Jacquiot Cc: Benjamin Herrenschmidt Cc: Catalin Marinas Cc: Chen Liqin Cc: Chris Metcalf Cc: Chris Zankel Cc: David Howells Cc: Fenghua Yu Cc: Geert Uytterhoeven Cc: Guan Xuetao Cc: Haavard Skinnemoen Cc: Hans-Christian Egtvedt Cc: Heiko Carstens Cc: Helge Deller Cc: Ingo Molnar Cc: Ivan Kokshaysky Cc: James Hogan Cc: Jeff Dike Cc: Jesper Nilsson Cc: Jiri Slaby Cc: Jonas Bonn Cc: Koichi Yasutake Cc: Lennox Wu Cc: Ley Foon Tan Cc: Mark Salter Cc: Martin Schwidefsky Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Mikael Starvik Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Ralf Baechle Cc: Rich Felker Cc: Richard Henderson Cc: Richard Kuo Cc: Richard Weinberger Cc: Russell King Cc: Steven Miao Cc: Thomas Gleixner Cc: Tony Luck Cc: Vineet Gupta Cc: Will Deacon Cc: Yoshinori Sato Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/kernel/fork.c b/kernel/fork.c index 8fbed71..103d78f 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1490,7 +1490,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, pid = alloc_pid(p->nsproxy->pid_ns_for_children); if (IS_ERR(pid)) { retval = PTR_ERR(pid); - goto bad_fork_cleanup_io; + goto bad_fork_cleanup_thread; } } @@ -1652,6 +1652,8 @@ bad_fork_cancel_cgroup: bad_fork_free_pid: if (pid != &init_struct_pid) free_pid(pid); +bad_fork_cleanup_thread: + exit_thread(p); bad_fork_cleanup_io: if (p->io_context) exit_io_context(p); -- cgit v0.10.2 From 2eeed7e98d6a1341b1574893a95ce5b8379140f2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ren=C3=A9=20Nyffenegger?= Date: Fri, 20 May 2016 17:00:30 -0700 Subject: include/linux/syscalls.h: use pid_t instead of int MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In include/linux/syscalls.h, the four functions sys_kill, sys_tgkill, sys_tkill and sys_rt_sigqueueinfo are declared with "int pid" and "int tgid". However, in kernel/signal.c, the corresponding definitions use the more appropriate "pid_t" (which is a typedef'd int). This patch changes "int" to "pid_t" in the declarations of sys_kill, sys_tgkill, sys_tkill and sys_rt_sigqueueinfo in in order to harmonize the function declarations with their respective definitions. Link: http://lkml.kernel.org/r/57302FDA.7020205@renenyffenegger.ch Signed-off-by: René Nyffenegger Cc: Andy Lutomirski Cc: Josh Triplett Cc: Al Viro Cc: "Steven Rostedt (Red Hat)" Cc: Zach Brown Cc: Milosz Tanski Cc: Arnd Bergmann Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index d795472..d022390 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -371,10 +371,10 @@ asmlinkage long sys_rt_sigtimedwait(const sigset_t __user *uthese, size_t sigsetsize); asmlinkage long sys_rt_tgsigqueueinfo(pid_t tgid, pid_t pid, int sig, siginfo_t __user *uinfo); -asmlinkage long sys_kill(int pid, int sig); -asmlinkage long sys_tgkill(int tgid, int pid, int sig); -asmlinkage long sys_tkill(int pid, int sig); -asmlinkage long sys_rt_sigqueueinfo(int pid, int sig, siginfo_t __user *uinfo); +asmlinkage long sys_kill(pid_t pid, int sig); +asmlinkage long sys_tgkill(pid_t tgid, pid_t pid, int sig); +asmlinkage long sys_tkill(pid_t pid, int sig); +asmlinkage long sys_rt_sigqueueinfo(pid_t pid, int sig, siginfo_t __user *uinfo); asmlinkage long sys_sgetmask(void); asmlinkage long sys_ssetmask(int newmask); asmlinkage long sys_signal(int sig, __sighandler_t handler); -- cgit v0.10.2 From 42a0bb3f71383b457a7db362f1c69e7afb96732b Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Fri, 20 May 2016 17:00:33 -0700 Subject: printk/nmi: generic solution for safe printk in NMI printk() takes some locks and could not be used a safe way in NMI context. The chance of a deadlock is real especially when printing stacks from all CPUs. This particular problem has been addressed on x86 by the commit a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all CPUs"). The patchset brings two big advantages. First, it makes the NMI backtraces safe on all architectures for free. Second, it makes all NMI messages almost safe on all architectures (the temporary buffer is limited. We still should keep the number of messages in NMI context at minimum). Note that there already are several messages printed in NMI context: WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE handlers. These are not easy to avoid. This patch reuses most of the code and makes it generic. It is useful for all messages and architectures that support NMI. The alternative printk_func is set when entering and is reseted when leaving NMI context. It queues IRQ work to copy the messages into the main ring buffer in a safe context. __printk_nmi_flush() copies all available messages and reset the buffer. Then we could use a simple cmpxchg operations to get synchronized with writers. There is also used a spinlock to get synchronized with other flushers. We do not longer use seq_buf because it depends on external lock. It would be hard to make all supported operations safe for a lockless use. It would be confusing and error prone to make only some operations safe. The code is put into separate printk/nmi.c as suggested by Steven Rostedt. It needs a per-CPU buffer and is compiled only on architectures that call nmi_enter(). This is achieved by the new HAVE_NMI Kconfig flag. The are MN10300 and Xtensa architectures. We need to clean up NMI handling there first. Let's do it separately. The patch is heavily based on the draft from Peter Zijlstra, see https://lkml.org/lkml/2015/6/10/327 [arnd@arndb.de: printk-nmi: use %zu format string for size_t] [akpm@linux-foundation.org: min_t->min - all types are size_t here] Signed-off-by: Petr Mladek Suggested-by: Peter Zijlstra Suggested-by: Steven Rostedt Cc: Jan Kara Acked-by: Russell King [arm part] Cc: Daniel Thompson Cc: Jiri Kosina Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Ralf Baechle Cc: Benjamin Herrenschmidt Cc: Martin Schwidefsky Cc: David Miller Cc: Daniel Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/Kconfig b/arch/Kconfig index 0f298f9..8f84fd2 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -187,7 +187,11 @@ config HAVE_OPTPROBES config HAVE_KPROBES_ON_FTRACE bool +config HAVE_NMI + bool + config HAVE_NMI_WATCHDOG + depends on HAVE_NMI bool # # An arch should select this if it provides all these things: diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 956d3575..90542db 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -67,6 +67,7 @@ config ARM select HAVE_KRETPROBES if (HAVE_KPROBES) select HAVE_MEMBLOCK select HAVE_MOD_ARCH_SPECIFIC + select HAVE_NMI select HAVE_OPROFILE if (HAVE_PERF_EVENTS) select HAVE_OPTPROBES if !THUMB2_KERNEL select HAVE_PERF_EVENTS diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c index baee702..df90bc5 100644 --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -644,9 +644,11 @@ void handle_IPI(int ipinr, struct pt_regs *regs) break; case IPI_CPU_BACKTRACE: + printk_nmi_enter(); irq_enter(); nmi_cpu_backtrace(regs); irq_exit(); + printk_nmi_exit(); break; default: diff --git a/arch/avr32/Kconfig b/arch/avr32/Kconfig index e43519a..7e75d45 100644 --- a/arch/avr32/Kconfig +++ b/arch/avr32/Kconfig @@ -18,6 +18,7 @@ config AVR32 select GENERIC_CLOCKEVENTS select HAVE_MOD_ARCH_SPECIFIC select MODULES_USE_ELF_RELA + select HAVE_NMI help AVR32 is a high-performance 32-bit RISC microprocessor core, designed for cost-sensitive embedded applications, with particular diff --git a/arch/blackfin/Kconfig b/arch/blackfin/Kconfig index a63c122..28c63fe 100644 --- a/arch/blackfin/Kconfig +++ b/arch/blackfin/Kconfig @@ -40,6 +40,7 @@ config BLACKFIN select HAVE_MOD_ARCH_SPECIFIC select MODULES_USE_ELF_RELA select HAVE_DEBUG_STACKOVERFLOW + select HAVE_NMI config GENERIC_CSUM def_bool y diff --git a/arch/cris/Kconfig b/arch/cris/Kconfig index 5c0ca8a..deba266 100644 --- a/arch/cris/Kconfig +++ b/arch/cris/Kconfig @@ -70,6 +70,7 @@ config CRIS select GENERIC_CLOCKEVENTS if ETRAX_ARCH_V32 select GENERIC_SCHED_CLOCK if ETRAX_ARCH_V32 select HAVE_DEBUG_BUGVERBOSE if ETRAX_ARCH_V32 + select HAVE_NMI config HZ int diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 5663f41..8040fb1 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -48,6 +48,7 @@ config MIPS select GENERIC_SCHED_CLOCK if !CAVIUM_OCTEON_SOC select GENERIC_CMOS_UPDATE select HAVE_MOD_ARCH_SPECIFIC + select HAVE_NMI select VIRT_TO_BUS select MODULES_USE_ELF_REL if MODULES select MODULES_USE_ELF_RELA if MODULES && 64BIT diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index f0403b5..01f7464 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -155,6 +155,7 @@ config PPC select NO_BOOTMEM select HAVE_GENERIC_RCU_GUP select HAVE_PERF_EVENTS_NMI if PPC64 + select HAVE_NMI if PERF_EVENTS select EDAC_SUPPORT select EDAC_ATOMIC_SCRUB select ARCH_HAS_DMA_SET_COHERENT_MASK diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index e2c9aaa..1c3c43d 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -166,6 +166,7 @@ config S390 select TTY select VIRT_CPU_ACCOUNTING select VIRT_TO_BUS + select HAVE_NMI config SCHED_OMIT_FRAME_POINTER diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index cb93af8..f625434 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -44,6 +44,7 @@ config SUPERH select OLD_SIGSUSPEND select OLD_SIGACTION select HAVE_ARCH_AUDITSYSCALL + select HAVE_NMI help The SuperH is a RISC processor targeted for use in embedded systems and consumer electronics; it was also used in the Sega Dreamcast diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index 27b3a0a..1012f7f 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -79,6 +79,7 @@ config SPARC64 select NO_BOOTMEM select HAVE_ARCH_AUDITSYSCALL select ARCH_SUPPORTS_ATOMIC_RMW + select HAVE_NMI config ARCH_DEFCONFIG string diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig index 1747462..76989b87 100644 --- a/arch/tile/Kconfig +++ b/arch/tile/Kconfig @@ -30,6 +30,7 @@ config TILE select HAVE_DEBUG_STACKOVERFLOW select ARCH_WANT_FRAME_POINTERS select HAVE_CONTEXT_TRACKING + select HAVE_NMI if USE_PMC select EDAC_SUPPORT select GENERIC_STRNCPY_FROM_USER select GENERIC_STRNLEN_USER diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 8ff5b3b..0a7b885 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -131,6 +131,7 @@ config X86 select HAVE_MEMBLOCK select HAVE_MEMBLOCK_NODE_MAP select HAVE_MIXED_BREAKPOINTS_REGS + select HAVE_NMI select HAVE_OPROFILE select HAVE_OPTPROBES select HAVE_PCSPKR_PLATFORM diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c index 045e424..7788ce6 100644 --- a/arch/x86/kernel/apic/hw_nmi.c +++ b/arch/x86/kernel/apic/hw_nmi.c @@ -18,7 +18,6 @@ #include #include #include -#include #ifdef CONFIG_HARDLOCKUP_DETECTOR u64 hw_nmi_get_sample_period(int watchdog_thresh) diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index dfd59d6..c683996 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -61,6 +61,7 @@ extern void irq_exit(void); #define nmi_enter() \ do { \ + printk_nmi_enter(); \ lockdep_off(); \ ftrace_nmi_enter(); \ BUG_ON(in_nmi()); \ @@ -77,6 +78,7 @@ extern void irq_exit(void); preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET); \ ftrace_nmi_exit(); \ lockdep_on(); \ + printk_nmi_exit(); \ } while (0) #endif /* LINUX_HARDIRQ_H */ diff --git a/include/linux/percpu.h b/include/linux/percpu.h index 4bc6daf..56939d3 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -129,7 +129,4 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr); (typeof(type) __percpu *)__alloc_percpu(sizeof(type), \ __alignof__(type)) -/* To avoid include hell, as printk can not declare this, we declare it here */ -DECLARE_PER_CPU(printk_func_t, printk_func); - #endif /* __LINUX_PERCPU_H */ diff --git a/include/linux/printk.h b/include/linux/printk.h index 9ccbdf2..51dd6b8 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -122,7 +122,17 @@ static inline __printf(1, 2) __cold void early_printk(const char *s, ...) { } #endif -typedef __printf(1, 0) int (*printk_func_t)(const char *fmt, va_list args); +#ifdef CONFIG_PRINTK_NMI +extern void printk_nmi_init(void); +extern void printk_nmi_enter(void); +extern void printk_nmi_exit(void); +extern void printk_nmi_flush(void); +#else +static inline void printk_nmi_init(void) { } +static inline void printk_nmi_enter(void) { } +static inline void printk_nmi_exit(void) { } +static inline void printk_nmi_flush(void) { } +#endif /* PRINTK_NMI */ #ifdef CONFIG_PRINTK asmlinkage __printf(5, 0) diff --git a/init/Kconfig b/init/Kconfig index 79a91a2..bccc1d6 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1454,6 +1454,11 @@ config PRINTK very difficult to diagnose system problems, saying N here is strongly discouraged. +config PRINTK_NMI + def_bool y + depends on PRINTK + depends on HAVE_NMI + config BUG bool "BUG() support" if EXPERT default y diff --git a/init/main.c b/init/main.c index 2075faf..fa9b2bd 100644 --- a/init/main.c +++ b/init/main.c @@ -569,6 +569,7 @@ asmlinkage __visible void __init start_kernel(void) timekeeping_init(); time_init(); sched_clock_postinit(); + printk_nmi_init(); perf_event_init(); profile_init(); call_function_init(); diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile index 85405bd..abb0042 100644 --- a/kernel/printk/Makefile +++ b/kernel/printk/Makefile @@ -1,2 +1,3 @@ obj-y = printk.o +obj-$(CONFIG_PRINTK_NMI) += nmi.o obj-$(CONFIG_A11Y_BRAILLE_CONSOLE) += braille.o diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h new file mode 100644 index 0000000..2de99fa --- /dev/null +++ b/kernel/printk/internal.h @@ -0,0 +1,44 @@ +/* + * internal.h - printk internal definitions + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, see . + */ +#include + +typedef __printf(1, 0) int (*printk_func_t)(const char *fmt, va_list args); + +int __printf(1, 0) vprintk_default(const char *fmt, va_list args); + +#ifdef CONFIG_PRINTK_NMI + +/* + * printk() could not take logbuf_lock in NMI context. Instead, + * it temporary stores the strings into a per-CPU buffer. + * The alternative implementation is chosen transparently + * via per-CPU variable. + */ +DECLARE_PER_CPU(printk_func_t, printk_func); +static inline __printf(1, 0) int vprintk_func(const char *fmt, va_list args) +{ + return this_cpu_read(printk_func)(fmt, args); +} + +#else /* CONFIG_PRINTK_NMI */ + +static inline __printf(1, 0) int vprintk_func(const char *fmt, va_list args) +{ + return vprintk_default(fmt, args); +} + +#endif /* CONFIG_PRINTK_NMI */ diff --git a/kernel/printk/nmi.c b/kernel/printk/nmi.c new file mode 100644 index 0000000..303cf0d --- /dev/null +++ b/kernel/printk/nmi.c @@ -0,0 +1,219 @@ +/* + * nmi.c - Safe printk in NMI context + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, see . + */ + +#include +#include +#include +#include +#include +#include + +#include "internal.h" + +/* + * printk() could not take logbuf_lock in NMI context. Instead, + * it uses an alternative implementation that temporary stores + * the strings into a per-CPU buffer. The content of the buffer + * is later flushed into the main ring buffer via IRQ work. + * + * The alternative implementation is chosen transparently + * via @printk_func per-CPU variable. + * + * The implementation allows to flush the strings also from another CPU. + * There are situations when we want to make sure that all buffers + * were handled or when IRQs are blocked. + */ +DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default; +static int printk_nmi_irq_ready; + +#define NMI_LOG_BUF_LEN (4096 - sizeof(atomic_t) - sizeof(struct irq_work)) + +struct nmi_seq_buf { + atomic_t len; /* length of written data */ + struct irq_work work; /* IRQ work that flushes the buffer */ + unsigned char buffer[NMI_LOG_BUF_LEN]; +}; +static DEFINE_PER_CPU(struct nmi_seq_buf, nmi_print_seq); + +/* + * Safe printk() for NMI context. It uses a per-CPU buffer to + * store the message. NMIs are not nested, so there is always only + * one writer running. But the buffer might get flushed from another + * CPU, so we need to be careful. + */ +static int vprintk_nmi(const char *fmt, va_list args) +{ + struct nmi_seq_buf *s = this_cpu_ptr(&nmi_print_seq); + int add = 0; + size_t len; + +again: + len = atomic_read(&s->len); + + if (len >= sizeof(s->buffer)) + return 0; + + /* + * Make sure that all old data have been read before the buffer was + * reseted. This is not needed when we just append data. + */ + if (!len) + smp_rmb(); + + add = vsnprintf(s->buffer + len, sizeof(s->buffer) - len, fmt, args); + + /* + * Do it once again if the buffer has been flushed in the meantime. + * Note that atomic_cmpxchg() is an implicit memory barrier that + * makes sure that the data were written before updating s->len. + */ + if (atomic_cmpxchg(&s->len, len, len + add) != len) + goto again; + + /* Get flushed in a more safe context. */ + if (add && printk_nmi_irq_ready) { + /* Make sure that IRQ work is really initialized. */ + smp_rmb(); + irq_work_queue(&s->work); + } + + return add; +} + +/* + * printk one line from the temporary buffer from @start index until + * and including the @end index. + */ +static void print_nmi_seq_line(struct nmi_seq_buf *s, int start, int end) +{ + const char *buf = s->buffer + start; + + printk("%.*s", (end - start) + 1, buf); +} + +/* + * Flush data from the associated per_CPU buffer. The function + * can be called either via IRQ work or independently. + */ +static void __printk_nmi_flush(struct irq_work *work) +{ + static raw_spinlock_t read_lock = + __RAW_SPIN_LOCK_INITIALIZER(read_lock); + struct nmi_seq_buf *s = container_of(work, struct nmi_seq_buf, work); + unsigned long flags; + size_t len, size; + int i, last_i; + + /* + * The lock has two functions. First, one reader has to flush all + * available message to make the lockless synchronization with + * writers easier. Second, we do not want to mix messages from + * different CPUs. This is especially important when printing + * a backtrace. + */ + raw_spin_lock_irqsave(&read_lock, flags); + + i = 0; +more: + len = atomic_read(&s->len); + + /* + * This is just a paranoid check that nobody has manipulated + * the buffer an unexpected way. If we printed something then + * @len must only increase. + */ + if (i && i >= len) + pr_err("printk_nmi_flush: internal error: i=%d >= len=%zu\n", + i, len); + + if (!len) + goto out; /* Someone else has already flushed the buffer. */ + + /* Make sure that data has been written up to the @len */ + smp_rmb(); + + size = min(len, sizeof(s->buffer)); + last_i = i; + + /* Print line by line. */ + for (; i < size; i++) { + if (s->buffer[i] == '\n') { + print_nmi_seq_line(s, last_i, i); + last_i = i + 1; + } + } + /* Check if there was a partial line. */ + if (last_i < size) { + print_nmi_seq_line(s, last_i, size - 1); + pr_cont("\n"); + } + + /* + * Check that nothing has got added in the meantime and truncate + * the buffer. Note that atomic_cmpxchg() is an implicit memory + * barrier that makes sure that the data were copied before + * updating s->len. + */ + if (atomic_cmpxchg(&s->len, len, 0) != len) + goto more; + +out: + raw_spin_unlock_irqrestore(&read_lock, flags); +} + +/** + * printk_nmi_flush - flush all per-cpu nmi buffers. + * + * The buffers are flushed automatically via IRQ work. This function + * is useful only when someone wants to be sure that all buffers have + * been flushed at some point. + */ +void printk_nmi_flush(void) +{ + int cpu; + + for_each_possible_cpu(cpu) + __printk_nmi_flush(&per_cpu(nmi_print_seq, cpu).work); +} + +void __init printk_nmi_init(void) +{ + int cpu; + + for_each_possible_cpu(cpu) { + struct nmi_seq_buf *s = &per_cpu(nmi_print_seq, cpu); + + init_irq_work(&s->work, __printk_nmi_flush); + } + + /* Make sure that IRQ works are initialized before enabling. */ + smp_wmb(); + printk_nmi_irq_ready = 1; + + /* Flush pending messages that did not have scheduled IRQ works. */ + printk_nmi_flush(); +} + +void printk_nmi_enter(void) +{ + this_cpu_write(printk_func, vprintk_nmi); +} + +void printk_nmi_exit(void) +{ + this_cpu_write(printk_func, vprintk_default); +} diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index bfbf284..71eba06 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -55,6 +55,7 @@ #include "console_cmdline.h" #include "braille.h" +#include "internal.h" int console_printk[4] = { CONSOLE_LOGLEVEL_DEFAULT, /* console_loglevel */ @@ -1807,14 +1808,6 @@ int vprintk_default(const char *fmt, va_list args) } EXPORT_SYMBOL_GPL(vprintk_default); -/* - * This allows printk to be diverted to another function per cpu. - * This is useful for calling printk functions from within NMI - * without worrying about race conditions that can lock up the - * box. - */ -DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default; - /** * printk - print a kernel message * @fmt: format string @@ -1838,21 +1831,11 @@ DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default; */ asmlinkage __visible int printk(const char *fmt, ...) { - printk_func_t vprintk_func; va_list args; int r; va_start(args, fmt); - - /* - * If a caller overrides the per_cpu printk_func, then it needs - * to disable preemption when calling printk(). Otherwise - * the printk_func should be set to the default. No need to - * disable preemption here. - */ - vprintk_func = this_cpu_read(printk_func); r = vprintk_func(fmt, args); - va_end(args); return r; diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c index 6019c53..26caf51 100644 --- a/lib/nmi_backtrace.c +++ b/lib/nmi_backtrace.c @@ -16,33 +16,14 @@ #include #include #include -#include #ifdef arch_trigger_all_cpu_backtrace /* For reliability, we're prepared to waste bits here. */ static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly; -static cpumask_t printtrace_mask; - -#define NMI_BUF_SIZE 4096 - -struct nmi_seq_buf { - unsigned char buffer[NMI_BUF_SIZE]; - struct seq_buf seq; -}; - -/* Safe printing in NMI context */ -static DEFINE_PER_CPU(struct nmi_seq_buf, nmi_print_seq); /* "in progress" flag of arch_trigger_all_cpu_backtrace */ static unsigned long backtrace_flag; -static void print_seq_line(struct nmi_seq_buf *s, int start, int end) -{ - const char *buf = s->buffer + start; - - printk("%.*s", (end - start) + 1, buf); -} - /* * When raise() is called it will be is passed a pointer to the * backtrace_mask. Architectures that call nmi_cpu_backtrace() @@ -52,8 +33,7 @@ static void print_seq_line(struct nmi_seq_buf *s, int start, int end) void nmi_trigger_all_cpu_backtrace(bool include_self, void (*raise)(cpumask_t *mask)) { - struct nmi_seq_buf *s; - int i, cpu, this_cpu = get_cpu(); + int i, this_cpu = get_cpu(); if (test_and_set_bit(0, &backtrace_flag)) { /* @@ -68,17 +48,6 @@ void nmi_trigger_all_cpu_backtrace(bool include_self, if (!include_self) cpumask_clear_cpu(this_cpu, to_cpumask(backtrace_mask)); - cpumask_copy(&printtrace_mask, to_cpumask(backtrace_mask)); - - /* - * Set up per_cpu seq_buf buffers that the NMIs running on the other - * CPUs will write to. - */ - for_each_cpu(cpu, to_cpumask(backtrace_mask)) { - s = &per_cpu(nmi_print_seq, cpu); - seq_buf_init(&s->seq, s->buffer, NMI_BUF_SIZE); - } - if (!cpumask_empty(to_cpumask(backtrace_mask))) { pr_info("Sending NMI to %s CPUs:\n", (include_self ? "all" : "other")); @@ -94,73 +63,25 @@ void nmi_trigger_all_cpu_backtrace(bool include_self, } /* - * Now that all the NMIs have triggered, we can dump out their - * back traces safely to the console. + * Force flush any remote buffers that might be stuck in IRQ context + * and therefore could not run their irq_work. */ - for_each_cpu(cpu, &printtrace_mask) { - int len, last_i = 0; + printk_nmi_flush(); - s = &per_cpu(nmi_print_seq, cpu); - len = seq_buf_used(&s->seq); - if (!len) - continue; - - /* Print line by line. */ - for (i = 0; i < len; i++) { - if (s->buffer[i] == '\n') { - print_seq_line(s, last_i, i); - last_i = i + 1; - } - } - /* Check if there was a partial line. */ - if (last_i < len) { - print_seq_line(s, last_i, len - 1); - pr_cont("\n"); - } - } - - clear_bit(0, &backtrace_flag); - smp_mb__after_atomic(); + clear_bit_unlock(0, &backtrace_flag); put_cpu(); } -/* - * It is not safe to call printk() directly from NMI handlers. - * It may be fine if the NMI detected a lock up and we have no choice - * but to do so, but doing a NMI on all other CPUs to get a back trace - * can be done with a sysrq-l. We don't want that to lock up, which - * can happen if the NMI interrupts a printk in progress. - * - * Instead, we redirect the vprintk() to this nmi_vprintk() that writes - * the content into a per cpu seq_buf buffer. Then when the NMIs are - * all done, we can safely dump the contents of the seq_buf to a printk() - * from a non NMI context. - */ -static int nmi_vprintk(const char *fmt, va_list args) -{ - struct nmi_seq_buf *s = this_cpu_ptr(&nmi_print_seq); - unsigned int len = seq_buf_used(&s->seq); - - seq_buf_vprintf(&s->seq, fmt, args); - return seq_buf_used(&s->seq) - len; -} - bool nmi_cpu_backtrace(struct pt_regs *regs) { int cpu = smp_processor_id(); if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) { - printk_func_t printk_func_save = this_cpu_read(printk_func); - - /* Replace printk to write into the NMI seq */ - this_cpu_write(printk_func, nmi_vprintk); pr_warn("NMI backtrace for cpu %d\n", cpu); if (regs) show_regs(regs); else dump_stack(); - this_cpu_write(printk_func, printk_func_save); - cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask)); return true; } -- cgit v0.10.2 From b522deabc6f18e4f938d93a84f345f2cbf3347d1 Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Fri, 20 May 2016 17:00:36 -0700 Subject: printk/nmi: warn when some message has been lost in NMI context We could not resize the temporary buffer in NMI context. Let's warn if a message is lost. This is rather theoretical. printk() should not be used in NMI. The only sensible use is when we want to print backtrace from all CPUs. The current buffer should be enough for this purpose. [akpm@linux-foundation.org: whitespace fixlet] Signed-off-by: Petr Mladek Cc: Jan Kara Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Russell King Cc: Daniel Thompson Cc: Jiri Kosina Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Ralf Baechle Cc: Benjamin Herrenschmidt Cc: Martin Schwidefsky Cc: David Miller Cc: Daniel Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h index 2de99fa..341bedc 100644 --- a/kernel/printk/internal.h +++ b/kernel/printk/internal.h @@ -34,6 +34,12 @@ static inline __printf(1, 0) int vprintk_func(const char *fmt, va_list args) return this_cpu_read(printk_func)(fmt, args); } +extern atomic_t nmi_message_lost; +static inline int get_nmi_message_lost(void) +{ + return atomic_xchg(&nmi_message_lost, 0); +} + #else /* CONFIG_PRINTK_NMI */ static inline __printf(1, 0) int vprintk_func(const char *fmt, va_list args) @@ -41,4 +47,9 @@ static inline __printf(1, 0) int vprintk_func(const char *fmt, va_list args) return vprintk_default(fmt, args); } +static inline int get_nmi_message_lost(void) +{ + return 0; +} + #endif /* CONFIG_PRINTK_NMI */ diff --git a/kernel/printk/nmi.c b/kernel/printk/nmi.c index 303cf0d..572f949 100644 --- a/kernel/printk/nmi.c +++ b/kernel/printk/nmi.c @@ -39,6 +39,7 @@ */ DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default; static int printk_nmi_irq_ready; +atomic_t nmi_message_lost; #define NMI_LOG_BUF_LEN (4096 - sizeof(atomic_t) - sizeof(struct irq_work)) @@ -64,8 +65,10 @@ static int vprintk_nmi(const char *fmt, va_list args) again: len = atomic_read(&s->len); - if (len >= sizeof(s->buffer)) + if (len >= sizeof(s->buffer)) { + atomic_inc(&nmi_message_lost); return 0; + } /* * Make sure that all old data have been read before the buffer was diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 71eba06..e38579d 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1617,6 +1617,7 @@ asmlinkage int vprintk_emit(int facility, int level, unsigned long flags; int this_cpu; int printed_len = 0; + int nmi_message_lost; bool in_sched = false; /* cpu currently holding logbuf_lock in this function */ static unsigned int logbuf_cpu = UINT_MAX; @@ -1667,6 +1668,15 @@ asmlinkage int vprintk_emit(int facility, int level, strlen(recursion_msg)); } + nmi_message_lost = get_nmi_message_lost(); + if (unlikely(nmi_message_lost)) { + text_len = scnprintf(textbuf, sizeof(textbuf), + "BAD LUCK: lost %d message(s) from NMI context!", + nmi_message_lost); + printed_len += log_store(0, 2, LOG_PREFIX|LOG_NEWLINE, 0, + NULL, 0, textbuf, text_len); + } + /* * The printf needs to come first; we need the syslog * prefix which might be passed-in as a parameter. -- cgit v0.10.2 From 427934b8714ec130b068d1c9d88f25b24aaede32 Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Fri, 20 May 2016 17:00:39 -0700 Subject: printk/nmi: increase the size of NMI buffer and make it configurable Testing has shown that the backtrace sometimes does not fit into the 4kB temporary buffer that is used in NMI context. The warnings are gone when I double the temporary buffer size. This patch doubles the buffer size and makes it configurable. Note that this problem existed even in the x86-specific implementation that was added by the commit a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all CPUs"). Nobody noticed it because it did not print any warnings. Signed-off-by: Petr Mladek Cc: Jan Kara Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Russell King Cc: Daniel Thompson Cc: Jiri Kosina Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Ralf Baechle Cc: Benjamin Herrenschmidt Cc: Martin Schwidefsky Cc: David Miller Cc: Daniel Thompson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/init/Kconfig b/init/Kconfig index bccc1d6..a9c4aefd 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -862,6 +862,28 @@ config LOG_CPU_MAX_BUF_SHIFT 13 => 8 KB for each CPU 12 => 4 KB for each CPU +config NMI_LOG_BUF_SHIFT + int "Temporary per-CPU NMI log buffer size (12 => 4KB, 13 => 8KB)" + range 10 21 + default 13 + depends on PRINTK_NMI + help + Select the size of a per-CPU buffer where NMI messages are temporary + stored. They are copied to the main log buffer in a safe context + to avoid a deadlock. The value defines the size as a power of 2. + + NMI messages are rare and limited. The largest one is when + a backtrace is printed. It usually fits into 4KB. Select + 8KB if you want to be on the safe side. + + Examples: + 17 => 128 KB for each CPU + 16 => 64 KB for each CPU + 15 => 32 KB for each CPU + 14 => 16 KB for each CPU + 13 => 8 KB for each CPU + 12 => 4 KB for each CPU + # # Architectures with an unreliable sched_clock() should select this: # diff --git a/kernel/printk/nmi.c b/kernel/printk/nmi.c index 572f949..bf08557 100644 --- a/kernel/printk/nmi.c +++ b/kernel/printk/nmi.c @@ -41,7 +41,8 @@ DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default; static int printk_nmi_irq_ready; atomic_t nmi_message_lost; -#define NMI_LOG_BUF_LEN (4096 - sizeof(atomic_t) - sizeof(struct irq_work)) +#define NMI_LOG_BUF_LEN ((1 << CONFIG_NMI_LOG_BUF_SHIFT) - \ + sizeof(atomic_t) - sizeof(struct irq_work)) struct nmi_seq_buf { atomic_t len; /* length of written data */ -- cgit v0.10.2 From cf9b1106c81c45cde02208fca49d3f3e4ab6ee74 Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Fri, 20 May 2016 17:00:42 -0700 Subject: printk/nmi: flush NMI messages on the system panic In NMI context, printk() messages are stored into per-CPU buffers to avoid a possible deadlock. They are normally flushed to the main ring buffer via an IRQ work. But the work is never called when the system calls panic() in the very same NMI handler. This patch tries to flush NMI buffers before the crash dump is generated. In this case it does not risk a double release and bails out when the logbuf_lock is already taken. The aim is to get the messages into the main ring buffer when possible. It makes them better accessible in the vmcore. Then the patch tries to flush the buffers second time when other CPUs are down. It might be more aggressive and reset logbuf_lock. The aim is to get the messages available for the consequent kmsg_dump() and console_flush_on_panic() calls. The patch causes vprintk_emit() to be called even in NMI context again. But it is done via printk_deferred() so that the console handling is skipped. Consoles use internal locks and we could not prevent a deadlock easily. They are explicitly called later when the crash dump is not generated, see console_flush_on_panic(). Signed-off-by: Petr Mladek Cc: Benjamin Herrenschmidt Cc: Daniel Thompson Cc: David Miller Cc: Ingo Molnar Cc: Jan Kara Cc: Jiri Kosina Cc: Martin Schwidefsky Cc: Peter Zijlstra Cc: Ralf Baechle Cc: Russell King Cc: Steven Rostedt Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/printk.h b/include/linux/printk.h index 51dd6b8..f4da695 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -127,11 +127,13 @@ extern void printk_nmi_init(void); extern void printk_nmi_enter(void); extern void printk_nmi_exit(void); extern void printk_nmi_flush(void); +extern void printk_nmi_flush_on_panic(void); #else static inline void printk_nmi_init(void) { } static inline void printk_nmi_enter(void) { } static inline void printk_nmi_exit(void) { } static inline void printk_nmi_flush(void) { } +static inline void printk_nmi_flush_on_panic(void) { } #endif /* PRINTK_NMI */ #ifdef CONFIG_PRINTK diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 1c03dfb..d5d4082 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -893,6 +893,7 @@ void crash_kexec(struct pt_regs *regs) old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu); if (old_cpu == PANIC_CPU_INVALID) { /* This is the 1st CPU which comes here, so go ahead. */ + printk_nmi_flush_on_panic(); __crash_kexec(regs); /* diff --git a/kernel/panic.c b/kernel/panic.c index 535c965..8aa7449 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -160,8 +160,10 @@ void panic(const char *fmt, ...) * * Bypass the panic_cpu check and call __crash_kexec directly. */ - if (!crash_kexec_post_notifiers) + if (!crash_kexec_post_notifiers) { + printk_nmi_flush_on_panic(); __crash_kexec(NULL); + } /* * Note smp_send_stop is the usual smp shutdown function, which @@ -176,6 +178,8 @@ void panic(const char *fmt, ...) */ atomic_notifier_call_chain(&panic_notifier_list, 0, buf); + /* Call flush even twice. It tries harder with a single online CPU */ + printk_nmi_flush_on_panic(); kmsg_dump(KMSG_DUMP_PANIC); /* diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h index 341bedc..7fd2838 100644 --- a/kernel/printk/internal.h +++ b/kernel/printk/internal.h @@ -22,6 +22,8 @@ int __printf(1, 0) vprintk_default(const char *fmt, va_list args); #ifdef CONFIG_PRINTK_NMI +extern raw_spinlock_t logbuf_lock; + /* * printk() could not take logbuf_lock in NMI context. Instead, * it temporary stores the strings into a per-CPU buffer. diff --git a/kernel/printk/nmi.c b/kernel/printk/nmi.c index bf08557..b69eb8a 100644 --- a/kernel/printk/nmi.c +++ b/kernel/printk/nmi.c @@ -17,6 +17,7 @@ #include #include +#include #include #include #include @@ -106,7 +107,16 @@ static void print_nmi_seq_line(struct nmi_seq_buf *s, int start, int end) { const char *buf = s->buffer + start; - printk("%.*s", (end - start) + 1, buf); + /* + * The buffers are flushed in NMI only on panic. The messages must + * go only into the ring buffer at this stage. Consoles will get + * explicitly called later when a crashdump is not generated. + */ + if (in_nmi()) + printk_deferred("%.*s", (end - start) + 1, buf); + else + printk("%.*s", (end - start) + 1, buf); + } /* @@ -194,6 +204,33 @@ void printk_nmi_flush(void) __printk_nmi_flush(&per_cpu(nmi_print_seq, cpu).work); } +/** + * printk_nmi_flush_on_panic - flush all per-cpu nmi buffers when the system + * goes down. + * + * Similar to printk_nmi_flush() but it can be called even in NMI context when + * the system goes down. It does the best effort to get NMI messages into + * the main ring buffer. + * + * Note that it could try harder when there is only one CPU online. + */ +void printk_nmi_flush_on_panic(void) +{ + /* + * Make sure that we could access the main ring buffer. + * Do not risk a double release when more CPUs are up. + */ + if (in_nmi() && raw_spin_is_locked(&logbuf_lock)) { + if (num_online_cpus() > 1) + return; + + debug_locks_off(); + raw_spin_lock_init(&logbuf_lock); + } + + printk_nmi_flush(); +} + void __init printk_nmi_init(void) { int cpu; diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index e38579d..60cdf63 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -245,7 +245,7 @@ __packed __aligned(4) * within the scheduler's rq lock. It must be released before calling * console_unlock() or anything else that might wake up a process. */ -static DEFINE_RAW_SPINLOCK(logbuf_lock); +DEFINE_RAW_SPINLOCK(logbuf_lock); #ifdef CONFIG_PRINTK DECLARE_WAIT_QUEUE_HEAD(log_wait); -- cgit v0.10.2 From 6c41a3b7b2f8d10eefe5be9109fd4258dd6cf495 Mon Sep 17 00:00:00 2001 From: Jiri Slaby Date: Fri, 20 May 2016 17:00:46 -0700 Subject: MAINTAINERS: remove linux@lists.openrisc.net $ host -t mx lists.openrisc.net Host lists.openrisc.net not found: 3(NXDOMAIN) Signed-off-by: Jiri Slaby Cc: Jonas Bonn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/MAINTAINERS b/MAINTAINERS index 8b92445..e65044d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8258,7 +8258,6 @@ F: drivers/of/resolver.c OPENRISC ARCHITECTURE M: Jonas Bonn W: http://openrisc.net -L: linux@lists.openrisc.net (moderated for non-subscribers) S: Maintained T: git git://openrisc.net/~jonas/linux F: arch/openrisc/ -- cgit v0.10.2 From 2e11246351012a0247307f8a0545e21dd841ba4f Mon Sep 17 00:00:00 2001 From: Eric Engestrom Date: Fri, 20 May 2016 17:00:49 -0700 Subject: MAINTAINERS: remove defunct spear mailing list It looks like the email address for this mailing list doesn't exist anymore: : host mxb-00178001.gslb.pphosted.com[91.207.212.93] said: 550 5.1.1 User Unknown (in reply to RCPT TO command) Signed-off-by: Eric Engestrom Acked-by: Viresh Kumar Cc: "David S. Miller" Cc: Greg Kroah-Hartman Cc: Kalle Valo Cc: Mauro Carvalho Chehab Cc: Guenter Roeck Cc: Jiri Slaby Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/MAINTAINERS b/MAINTAINERS index e65044d..ea79fe6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8812,7 +8812,6 @@ F: drivers/pinctrl/pinctrl-single.c PIN CONTROLLER - ST SPEAR M: Viresh Kumar -L: spear-devel@list.st.com L: linux-arm-kernel@lists.infradead.org (moderated for non-subscribers) W: http://www.st.com/spear S: Maintained @@ -10016,7 +10015,6 @@ F: drivers/mmc/host/sdhci-s3c* SECURE DIGITAL HOST CONTROLLER INTERFACE (SDHCI) ST SPEAR DRIVER M: Viresh Kumar -L: spear-devel@list.st.com L: linux-mmc@vger.kernel.org S: Maintained F: drivers/mmc/host/sdhci-spear.c @@ -10579,7 +10577,6 @@ F: include/linux/compiler.h SPEAR PLATFORM SUPPORT M: Viresh Kumar M: Shiraz Hashim -L: spear-devel@list.st.com L: linux-arm-kernel@lists.infradead.org (moderated for non-subscribers) W: http://www.st.com/spear S: Maintained @@ -10588,7 +10585,6 @@ F: arch/arm/mach-spear/ SPEAR CLOCK FRAMEWORK SUPPORT M: Viresh Kumar -L: spear-devel@list.st.com L: linux-arm-kernel@lists.infradead.org (moderated for non-subscribers) W: http://www.st.com/spear S: Maintained -- cgit v0.10.2 From f4970682d73bba79cba10149d62482338d065086 Mon Sep 17 00:00:00 2001 From: Jiri Slaby Date: Fri, 20 May 2016 17:00:51 -0700 Subject: MAINTAINERS: remove Koichi Yasutake The MTA says: : unknown user: "yasutake.koichi@jp.panasonic.com" Link: http://lkml.kernel.org/r/1462776755-9607-1-git-send-email-jslaby@suse.cz Signed-off-by: Jiri Slaby Cc: David Howells Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/MAINTAINERS b/MAINTAINERS index ea79fe6..edae671 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8378,7 +8378,6 @@ F: drivers/platform/x86/panasonic-laptop.c PANASONIC MN10300/AM33/AM34 PORT M: David Howells -M: Koichi Yasutake L: linux-am33-list@redhat.com (moderated for non-subscribers) W: ftp://ftp.redhat.com/pub/redhat/gnupro/AM33/ S: Maintained -- cgit v0.10.2 From aa4ea1c3b3948d325a6826adf9c367d11fa1ab74 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Fri, 20 May 2016 17:00:54 -0700 Subject: lib/vsprintf: simplify UUID printing There are few functions here and there along with type definitions that provide UUID API. This series consolidates everything under one hood and converts current users. This has been tested for a while internally, however it doesn't mean we covered all possible cases (especially accuracy of UUID constants after conversion). So, please test this as much as you can and provide your tag. We appreciate the effort. The ACPI conversion is postponed for now to sort more generic things out first. This patch (of 9): Since we have hex_byte_pack_upper() we may use it directly and avoid second loop. Signed-off-by: Andy Shevchenko Reviewed-by: Matt Fleming Cc: Dmitry Kasatkin Cc: Mimi Zohar Cc: Rasmus Villemoes Cc: Arnd Bergmann Cc: "Theodore Ts'o" Cc: Al Viro Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/vsprintf.c b/lib/vsprintf.c index ccb664b..be0e7cf 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -1324,7 +1324,10 @@ char *uuid_string(char *buf, char *end, const u8 *addr, } for (i = 0; i < 16; i++) { - p = hex_byte_pack(p, addr[index[i]]); + if (uc) + p = hex_byte_pack_upper(p, addr[index[i]]); + else + p = hex_byte_pack(p, addr[index[i]]); switch (i) { case 3: case 5: @@ -1337,13 +1340,6 @@ char *uuid_string(char *buf, char *end, const u8 *addr, *p = 0; - if (uc) { - p = uuid; - do { - *p = toupper(*p); - } while (*(++p)); - } - return string(buf, end, uuid, spec); } -- cgit v0.10.2 From b8b572789cde5118b2cee49d426d48fcf5b30e47 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Fri, 20 May 2016 17:00:57 -0700 Subject: security/integrity/ima/ima_policy.c: use %pU to output UUID in printable format Instead of open coded variant re-use extension that vsprintf.c provides us for ages. Signed-off-by: Andy Shevchenko Reviewed-by: Matt Fleming Cc: Dmitry Kasatkin Cc: Mimi Zohar Cc: Rasmus Villemoes Cc: Arnd Bergmann Cc: "Theodore Ts'o" Cc: Al Viro Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c index 3cd0a58..0f887a5 100644 --- a/security/integrity/ima/ima_policy.c +++ b/security/integrity/ima/ima_policy.c @@ -972,7 +972,7 @@ static void policy_func_show(struct seq_file *m, enum ima_hooks func) int ima_policy_show(struct seq_file *m, void *v) { struct ima_rule_entry *entry = v; - int i = 0; + int i; char tbuf[64] = {0,}; rcu_read_lock(); @@ -1012,17 +1012,7 @@ int ima_policy_show(struct seq_file *m, void *v) } if (entry->flags & IMA_FSUUID) { - seq_puts(m, "fsuuid="); - for (i = 0; i < ARRAY_SIZE(entry->fsuuid); ++i) { - switch (i) { - case 4: - case 6: - case 8: - case 10: - seq_puts(m, "-"); - } - seq_printf(m, "%x", entry->fsuuid[i]); - } + seq_printf(m, "fsuuid=%pU", entry->fsuuid); seq_puts(m, " "); } -- cgit v0.10.2 From 8da4b8c48e7b43cb16d05e1dbb34ad9f73ab7efd Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Fri, 20 May 2016 17:01:00 -0700 Subject: lib/uuid.c: move generate_random_uuid() to uuid.c Let's gather the UUID related functions under one hood. Signed-off-by: Andy Shevchenko Reviewed-by: Matt Fleming Cc: Dmitry Kasatkin Cc: Mimi Zohar Cc: Rasmus Villemoes Cc: Arnd Bergmann Cc: "Theodore Ts'o" Cc: Al Viro Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/drivers/char/random.c b/drivers/char/random.c index b583e53..0158d3b 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -260,6 +260,7 @@ #include #include #include +#include #include #include @@ -1621,26 +1622,6 @@ SYSCALL_DEFINE3(getrandom, char __user *, buf, size_t, count, return urandom_read(NULL, buf, count, NULL); } -/*************************************************************** - * Random UUID interface - * - * Used here for a Boot ID, but can be useful for other kernel - * drivers. - ***************************************************************/ - -/* - * Generate random UUID - */ -void generate_random_uuid(unsigned char uuid_out[16]) -{ - get_random_bytes(uuid_out, 16); - /* Set UUID version to 4 --- truly random generation */ - uuid_out[6] = (uuid_out[6] & 0x0F) | 0x40; - /* Set the UUID variant to DCE */ - uuid_out[8] = (uuid_out[8] & 0x3F) | 0x80; -} -EXPORT_SYMBOL(generate_random_uuid); - /******************************************************************** * * Sysctl interface diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index bd0f45f..bfb80da 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -20,13 +20,13 @@ #include #include #include -#include #include #include #include #include #include #include +#include #include #include "ctree.h" #include "extent_map.h" diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index eae5917..7497f50 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -13,8 +13,8 @@ #include #include #include -#include #include +#include #include #include "ext4_jbd2.h" #include "ext4.h" diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c index eb9d027..c6b1495 100644 --- a/fs/f2fs/file.c +++ b/fs/f2fs/file.c @@ -20,7 +20,7 @@ #include #include #include -#include +#include #include "f2fs.h" #include "node.h" diff --git a/fs/reiserfs/objectid.c b/fs/reiserfs/objectid.c index 99a5d5d..415d66c 100644 --- a/fs/reiserfs/objectid.c +++ b/fs/reiserfs/objectid.c @@ -3,8 +3,8 @@ */ #include -#include #include +#include #include "reiserfs.h" /* find where objectid map starts */ diff --git a/fs/ubifs/sb.c b/fs/ubifs/sb.c index f4fbc7b..3cbb904 100644 --- a/fs/ubifs/sb.c +++ b/fs/ubifs/sb.c @@ -28,8 +28,8 @@ #include "ubifs.h" #include -#include #include +#include /* * Default journal size in logical eraseblocks as a percent of total diff --git a/include/linux/random.h b/include/linux/random.h index 9c29122..e47e533 100644 --- a/include/linux/random.h +++ b/include/linux/random.h @@ -26,7 +26,6 @@ extern void get_random_bytes(void *buf, int nbytes); extern int add_random_ready_callback(struct random_ready_callback *rdy); extern void del_random_ready_callback(struct random_ready_callback *rdy); extern void get_random_bytes_arch(void *buf, int nbytes); -void generate_random_uuid(unsigned char uuid_out[16]); extern int random_int_secret_init(void); #ifndef MODULE diff --git a/include/linux/uuid.h b/include/linux/uuid.h index 6df2509..91c2b6d 100644 --- a/include/linux/uuid.h +++ b/include/linux/uuid.h @@ -33,6 +33,8 @@ static inline int uuid_be_cmp(const uuid_be u1, const uuid_be u2) return memcmp(&u1, &u2, sizeof(uuid_be)); } +void generate_random_uuid(unsigned char uuid[16]); + extern void uuid_le_gen(uuid_le *u); extern void uuid_be_gen(uuid_be *u); diff --git a/lib/uuid.c b/lib/uuid.c index 398821e..6c81c0b 100644 --- a/lib/uuid.c +++ b/lib/uuid.c @@ -23,6 +23,26 @@ #include #include +/*************************************************************** + * Random UUID interface + * + * Used here for a Boot ID, but can be useful for other kernel + * drivers. + ***************************************************************/ + +/* + * Generate random UUID + */ +void generate_random_uuid(unsigned char uuid[16]) +{ + get_random_bytes(uuid, 16); + /* Set UUID version to 4 --- truly random generation */ + uuid[6] = (uuid[6] & 0x0F) | 0x40; + /* Set the UUID variant to DCE */ + uuid[8] = (uuid[8] & 0x3F) | 0x80; +} +EXPORT_SYMBOL(generate_random_uuid); + static void __uuid_gen_common(__u8 b[16]) { prandom_bytes(b, 16); -- cgit v0.10.2 From 2b1b0d66704a8cafe83be7114ec4c15ab3a314ad Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Fri, 20 May 2016 17:01:04 -0700 Subject: lib/uuid.c: introduce a few more generic helpers There are new helpers in this patch: uuid_is_valid checks if a UUID is valid uuid_be_to_bin converts from string to binary (big endian) uuid_le_to_bin converts from string to binary (little endian) They will be used in future, i.e. in the following patches in the series. This also moves the indices arrays to lib/uuid.c to be shared accross modules. [andriy.shevchenko@linux.intel.com: fix typo] Signed-off-by: Andy Shevchenko Reviewed-by: Matt Fleming Cc: Dmitry Kasatkin Cc: Mimi Zohar Cc: Rasmus Villemoes Cc: Arnd Bergmann Cc: "Theodore Ts'o" Cc: Al Viro Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/uuid.h b/include/linux/uuid.h index 91c2b6d..e0b95e7 100644 --- a/include/linux/uuid.h +++ b/include/linux/uuid.h @@ -22,6 +22,11 @@ #include +/* + * The length of a UUID string ("aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee") + * not including trailing NUL. + */ +#define UUID_STRING_LEN 36 static inline int uuid_le_cmp(const uuid_le u1, const uuid_le u2) { @@ -38,4 +43,12 @@ void generate_random_uuid(unsigned char uuid[16]); extern void uuid_le_gen(uuid_le *u); extern void uuid_be_gen(uuid_be *u); +bool __must_check uuid_is_valid(const char *uuid); + +extern const u8 uuid_le_index[16]; +extern const u8 uuid_be_index[16]; + +int uuid_le_to_bin(const char *uuid, uuid_le *u); +int uuid_be_to_bin(const char *uuid, uuid_be *u); + #endif diff --git a/lib/uuid.c b/lib/uuid.c index 6c81c0b..82787f6 100644 --- a/lib/uuid.c +++ b/lib/uuid.c @@ -19,10 +19,17 @@ */ #include +#include +#include #include #include #include +const u8 uuid_le_index[16] = {3,2,1,0,5,4,7,6,8,9,10,11,12,13,14,15}; +EXPORT_SYMBOL(uuid_le_index); +const u8 uuid_be_index[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; +EXPORT_SYMBOL(uuid_be_index); + /*************************************************************** * Random UUID interface * @@ -65,3 +72,61 @@ void uuid_be_gen(uuid_be *bu) bu->b[6] = (bu->b[6] & 0x0F) | 0x40; } EXPORT_SYMBOL_GPL(uuid_be_gen); + +/** + * uuid_is_valid - checks if UUID string valid + * @uuid: UUID string to check + * + * Description: + * It checks if the UUID string is following the format: + * xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx + * where x is a hex digit. + * + * Return: true if input is valid UUID string. + */ +bool uuid_is_valid(const char *uuid) +{ + unsigned int i; + + for (i = 0; i < UUID_STRING_LEN; i++) { + if (i == 8 || i == 13 || i == 18 || i == 23) { + if (uuid[i] != '-') + return false; + } else if (!isxdigit(uuid[i])) { + return false; + } + } + + return true; +} +EXPORT_SYMBOL(uuid_is_valid); + +static int __uuid_to_bin(const char *uuid, __u8 b[16], const u8 ei[16]) +{ + static const u8 si[16] = {0,2,4,6,9,11,14,16,19,21,24,26,28,30,32,34}; + unsigned int i; + + if (!uuid_is_valid(uuid)) + return -EINVAL; + + for (i = 0; i < 16; i++) { + int hi = hex_to_bin(uuid[si[i]] + 0); + int lo = hex_to_bin(uuid[si[i]] + 1); + + b[ei[i]] = (hi << 4) | lo; + } + + return 0; +} + +int uuid_le_to_bin(const char *uuid, uuid_le *u) +{ + return __uuid_to_bin(uuid, u->b, uuid_le_index); +} +EXPORT_SYMBOL(uuid_le_to_bin); + +int uuid_be_to_bin(const char *uuid, uuid_be *u) +{ + return __uuid_to_bin(uuid, u->b, uuid_be_index); +} +EXPORT_SYMBOL(uuid_be_to_bin); diff --git a/lib/vsprintf.c b/lib/vsprintf.c index be0e7cf..0967771 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #ifdef CONFIG_BLOCK #include @@ -1304,19 +1305,17 @@ static noinline_for_stack char *uuid_string(char *buf, char *end, const u8 *addr, struct printf_spec spec, const char *fmt) { - char uuid[sizeof("xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx")]; + char uuid[UUID_STRING_LEN + 1]; char *p = uuid; int i; - static const u8 be[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; - static const u8 le[16] = {3,2,1,0,5,4,7,6,8,9,10,11,12,13,14,15}; - const u8 *index = be; + const u8 *index = uuid_be_index; bool uc = false; switch (*(++fmt)) { case 'L': uc = true; /* fall-through */ case 'l': - index = le; + index = uuid_le_index; break; case 'B': uc = true; -- cgit v0.10.2 From e3a93bce69ad3e2c38927abe311b8cb4f17abbaf Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Fri, 20 May 2016 17:01:07 -0700 Subject: lib/uuid.c: remove FSF address There is no point in keeping an address in the file since it's subject to change. While here, update Intel Copyright years. Signed-off-by: Andy Shevchenko Reviewed-by: Matt Fleming Cc: Dmitry Kasatkin Cc: Mimi Zohar Cc: Rasmus Villemoes Cc: Arnd Bergmann Cc: "Theodore Ts'o" Cc: Al Viro Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/uuid.h b/include/linux/uuid.h index e0b95e7..2d095fc 100644 --- a/include/linux/uuid.h +++ b/include/linux/uuid.h @@ -1,7 +1,7 @@ /* * UUID/GUID definition * - * Copyright (C) 2010, Intel Corp. + * Copyright (C) 2010, 2016 Intel Corp. * Huang Ying * * This program is free software; you can redistribute it and/or @@ -12,10 +12,6 @@ * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ #ifndef _LINUX_UUID_H_ #define _LINUX_UUID_H_ diff --git a/include/uapi/linux/uuid.h b/include/uapi/linux/uuid.h index 786f077..3738e5f 100644 --- a/include/uapi/linux/uuid.h +++ b/include/uapi/linux/uuid.h @@ -12,10 +12,6 @@ * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ #ifndef _UAPI_LINUX_UUID_H_ diff --git a/lib/uuid.c b/lib/uuid.c index 82787f6..e116ae5 100644 --- a/lib/uuid.c +++ b/lib/uuid.c @@ -1,7 +1,7 @@ /* * Unified UUID/GUID definition * - * Copyright (C) 2009, Intel Corp. + * Copyright (C) 2009, 2016 Intel Corp. * Huang Ying * * This program is free software; you can redistribute it and/or @@ -12,10 +12,6 @@ * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ #include -- cgit v0.10.2 From ede9c27749b9b35efdffa4f63a39f819d7913752 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Fri, 20 May 2016 17:01:10 -0700 Subject: kernel/sysctl_binary.c: use generic UUID library UUID library provides uuid_be type and uuid_be_to_bin() function. This substitutes open coded variant by generic library calls. Signed-off-by: Andy Shevchenko Reviewed-by: Matt Fleming Cc: Dmitry Kasatkin Cc: Mimi Zohar Cc: Rasmus Villemoes Cc: Arnd Bergmann Cc: "Theodore Ts'o" Cc: Al Viro Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c index 10a1d7d..6eb99c1 100644 --- a/kernel/sysctl_binary.c +++ b/kernel/sysctl_binary.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include @@ -1117,9 +1118,8 @@ static ssize_t bin_uuid(struct file *file, /* Only supports reads */ if (oldval && oldlen) { - char buf[40], *str = buf; - unsigned char uuid[16]; - int i; + char buf[UUID_STRING_LEN + 1]; + uuid_be uuid; result = kernel_read(file, 0, buf, sizeof(buf) - 1); if (result < 0) @@ -1127,24 +1127,15 @@ static ssize_t bin_uuid(struct file *file, buf[result] = '\0'; - /* Convert the uuid to from a string to binary */ - for (i = 0; i < 16; i++) { - result = -EIO; - if (!isxdigit(str[0]) || !isxdigit(str[1])) - goto out; - - uuid[i] = (hex_to_bin(str[0]) << 4) | - hex_to_bin(str[1]); - str += 2; - if (*str == '-') - str++; - } + result = -EIO; + if (uuid_be_to_bin(buf, &uuid)) + goto out; if (oldlen > 16) oldlen = 16; result = -EFAULT; - if (copy_to_user(oldval, uuid, oldlen)) + if (copy_to_user(oldval, &uuid, oldlen)) goto out; copied = oldlen; -- cgit v0.10.2 From ba7e34b1bbd2722685bbc75d168672d5154d8614 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Fri, 20 May 2016 17:01:18 -0700 Subject: include/linux/efi.h: redefine type, constant, macro from generic code Generic UUID library defines structure type, macro to define UUID, and the length of the UUID string. This patch removes duplicate data structure definition, UUID string length constant as well as macro for UUID handling. Signed-off-by: Andy Shevchenko Reviewed-by: Matt Fleming Cc: Dmitry Kasatkin Cc: Mimi Zohar Cc: Rasmus Villemoes Cc: Arnd Bergmann Cc: "Theodore Ts'o" Cc: Al Viro Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/efi.h b/include/linux/efi.h index df7acb5..c2db3ca 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -21,6 +21,7 @@ #include #include #include +#include #include #include @@ -44,17 +45,10 @@ typedef u16 efi_char16_t; /* UNICODE character */ typedef u64 efi_physical_addr_t; typedef void *efi_handle_t; - -typedef struct { - u8 b[16]; -} efi_guid_t; +typedef uuid_le efi_guid_t; #define EFI_GUID(a,b,c,d0,d1,d2,d3,d4,d5,d6,d7) \ -((efi_guid_t) \ -{{ (a) & 0xff, ((a) >> 8) & 0xff, ((a) >> 16) & 0xff, ((a) >> 24) & 0xff, \ - (b) & 0xff, ((b) >> 8) & 0xff, \ - (c) & 0xff, ((c) >> 8) & 0xff, \ - (d0), (d1), (d2), (d3), (d4), (d5), (d6), (d7) }}) + UUID_LE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7) /* * Generic EFI table header @@ -1117,7 +1111,7 @@ extern int efi_status_to_err(efi_status_t status); * Length of a GUID string (strlen("aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee")) * not including trailing NUL */ -#define EFI_VARIABLE_GUID_LEN 36 +#define EFI_VARIABLE_GUID_LEN UUID_STRING_LEN /* * The type of search to perform when calling boottime->locate_handle -- cgit v0.10.2 From 8236431d8d09eee70e6cbc506426a7c97778a2e6 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Fri, 20 May 2016 17:01:21 -0700 Subject: fs/efivarfs/inode.c: use generic UUID library Instead of opencoding let's use generic UUID library functions here. Signed-off-by: Andy Shevchenko Reviewed-by: Matt Fleming Cc: Dmitry Kasatkin Cc: Mimi Zohar Cc: Rasmus Villemoes Cc: Arnd Bergmann Cc: "Theodore Ts'o" Cc: Al Viro Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/efivarfs/inode.c b/fs/efivarfs/inode.c index e2ab6d0..1d73fc6 100644 --- a/fs/efivarfs/inode.c +++ b/fs/efivarfs/inode.c @@ -11,6 +11,7 @@ #include #include #include +#include #include "internal.h" @@ -46,11 +47,7 @@ struct inode *efivarfs_get_inode(struct super_block *sb, */ bool efivarfs_valid_name(const char *str, int len) { - static const char dashes[EFI_VARIABLE_GUID_LEN] = { - [8] = 1, [13] = 1, [18] = 1, [23] = 1 - }; const char *s = str + len - EFI_VARIABLE_GUID_LEN; - int i; /* * We need a GUID, plus at least one letter for the variable name, @@ -68,37 +65,7 @@ bool efivarfs_valid_name(const char *str, int len) * * 12345678-1234-1234-1234-123456789abc */ - for (i = 0; i < EFI_VARIABLE_GUID_LEN; i++) { - if (dashes[i]) { - if (*s++ != '-') - return false; - } else { - if (!isxdigit(*s++)) - return false; - } - } - - return true; -} - -static void efivarfs_hex_to_guid(const char *str, efi_guid_t *guid) -{ - guid->b[0] = hex_to_bin(str[6]) << 4 | hex_to_bin(str[7]); - guid->b[1] = hex_to_bin(str[4]) << 4 | hex_to_bin(str[5]); - guid->b[2] = hex_to_bin(str[2]) << 4 | hex_to_bin(str[3]); - guid->b[3] = hex_to_bin(str[0]) << 4 | hex_to_bin(str[1]); - guid->b[4] = hex_to_bin(str[11]) << 4 | hex_to_bin(str[12]); - guid->b[5] = hex_to_bin(str[9]) << 4 | hex_to_bin(str[10]); - guid->b[6] = hex_to_bin(str[16]) << 4 | hex_to_bin(str[17]); - guid->b[7] = hex_to_bin(str[14]) << 4 | hex_to_bin(str[15]); - guid->b[8] = hex_to_bin(str[19]) << 4 | hex_to_bin(str[20]); - guid->b[9] = hex_to_bin(str[21]) << 4 | hex_to_bin(str[22]); - guid->b[10] = hex_to_bin(str[24]) << 4 | hex_to_bin(str[25]); - guid->b[11] = hex_to_bin(str[26]) << 4 | hex_to_bin(str[27]); - guid->b[12] = hex_to_bin(str[28]) << 4 | hex_to_bin(str[29]); - guid->b[13] = hex_to_bin(str[30]) << 4 | hex_to_bin(str[31]); - guid->b[14] = hex_to_bin(str[32]) << 4 | hex_to_bin(str[33]); - guid->b[15] = hex_to_bin(str[34]) << 4 | hex_to_bin(str[35]); + return uuid_is_valid(s); } static int efivarfs_create(struct inode *dir, struct dentry *dentry, @@ -119,8 +86,7 @@ static int efivarfs_create(struct inode *dir, struct dentry *dentry, /* length of the variable name itself: remove GUID and separator */ namelen = dentry->d_name.len - EFI_VARIABLE_GUID_LEN - 1; - efivarfs_hex_to_guid(dentry->d_name.name + namelen + 1, - &var->var.VendorGuid); + uuid_le_to_bin(dentry->d_name.name + namelen + 1, &var->var.VendorGuid); if (efivar_variable_is_removable(var->var.VendorGuid, dentry->d_name.name, namelen)) -- cgit v0.10.2 From 63579785752ba7d0e842078ec6b2875367046f06 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Fri, 20 May 2016 17:01:24 -0700 Subject: include/linux/genhd.h: move to use generic UUID library UUID library provides uuid_be type and uuid_be_to_bin() function. This substitutes open coded variant by generic library calls. Signed-off-by: Andy Shevchenko Reviewed-by: Matt Fleming Cc: Dmitry Kasatkin Cc: Mimi Zohar Cc: Rasmus Villemoes Cc: Arnd Bergmann Cc: "Theodore Ts'o" Cc: Al Viro Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 5c70676..359a8e4 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -14,6 +14,7 @@ #include #include #include +#include #ifdef CONFIG_BLOCK @@ -93,7 +94,7 @@ struct disk_stats { * Enough for the string representation of any kind of UUID plus NULL. * EFI UUID is 36 characters. MSDOS UUID is 11 characters. */ -#define PARTITION_META_INFO_UUIDLTH 37 +#define PARTITION_META_INFO_UUIDLTH (UUID_STRING_LEN + 1) struct partition_meta_info { char uuid[PARTITION_META_INFO_UUIDLTH]; @@ -228,27 +229,9 @@ static inline struct gendisk *part_to_disk(struct hd_struct *part) return NULL; } -static inline void part_pack_uuid(const u8 *uuid_str, u8 *to) -{ - int i; - for (i = 0; i < 16; ++i) { - *to++ = (hex_to_bin(*uuid_str) << 4) | - (hex_to_bin(*(uuid_str + 1))); - uuid_str += 2; - switch (i) { - case 3: - case 5: - case 7: - case 9: - uuid_str++; - continue; - } - } -} - static inline int blk_part_pack_uuid(const u8 *uuid_str, u8 *to) { - part_pack_uuid(uuid_str, to); + uuid_be_to_bin(uuid_str, (uuid_be *)to); return 0; } -- cgit v0.10.2 From 7244ad69cc1382fd4dfb18a4fea5613acd67a2e9 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Fri, 20 May 2016 17:01:27 -0700 Subject: block/partitions/ldm.c: use generic UUID library Instead of opencoding let's use generic UUID library functions here. Signed-off-by: Andy Shevchenko Cc: "Richard Russon (FlatCap)" Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/block/partitions/ldm.c b/block/partitions/ldm.c index e507cfb..edcea70 100644 --- a/block/partitions/ldm.c +++ b/block/partitions/ldm.c @@ -27,6 +27,8 @@ #include #include #include +#include + #include "ldm.h" #include "check.h" #include "msdos.h" @@ -66,60 +68,6 @@ void _ldm_printk(const char *level, const char *function, const char *fmt, ...) } /** - * ldm_parse_hexbyte - Convert a ASCII hex number to a byte - * @src: Pointer to at least 2 characters to convert. - * - * Convert a two character ASCII hex string to a number. - * - * Return: 0-255 Success, the byte was parsed correctly - * -1 Error, an invalid character was supplied - */ -static int ldm_parse_hexbyte (const u8 *src) -{ - unsigned int x; /* For correct wrapping */ - int h; - - /* high part */ - x = h = hex_to_bin(src[0]); - if (h < 0) - return -1; - - /* low part */ - h = hex_to_bin(src[1]); - if (h < 0) - return -1; - - return (x << 4) + h; -} - -/** - * ldm_parse_guid - Convert GUID from ASCII to binary - * @src: 36 char string of the form fa50ff2b-f2e8-45de-83fa-65417f2f49ba - * @dest: Memory block to hold binary GUID (16 bytes) - * - * N.B. The GUID need not be NULL terminated. - * - * Return: 'true' @dest contains binary GUID - * 'false' @dest contents are undefined - */ -static bool ldm_parse_guid (const u8 *src, u8 *dest) -{ - static const int size[] = { 4, 2, 2, 2, 6 }; - int i, j, v; - - if (src[8] != '-' || src[13] != '-' || - src[18] != '-' || src[23] != '-') - return false; - - for (j = 0; j < 5; j++, src++) - for (i = 0; i < size[j]; i++, src+=2, *dest++ = v) - if ((v = ldm_parse_hexbyte (src)) < 0) - return false; - - return true; -} - -/** * ldm_parse_privhead - Read the LDM Database PRIVHEAD structure * @data: Raw database PRIVHEAD structure loaded from the device * @ph: In-memory privhead structure in which to return parsed information @@ -167,7 +115,7 @@ static bool ldm_parse_privhead(const u8 *data, struct privhead *ph) ldm_error("PRIVHEAD disk size doesn't match real disk size"); return false; } - if (!ldm_parse_guid(data + 0x0030, ph->disk_id)) { + if (uuid_be_to_bin(data + 0x0030, (uuid_be *)ph->disk_id)) { ldm_error("PRIVHEAD contains an invalid GUID."); return false; } @@ -944,7 +892,7 @@ static bool ldm_parse_dsk3 (const u8 *buffer, int buflen, struct vblk *vb) disk = &vb->vblk.disk; ldm_get_vstr (buffer + 0x18 + r_diskid, disk->alt_name, sizeof (disk->alt_name)); - if (!ldm_parse_guid (buffer + 0x19 + r_name, disk->disk_id)) + if (uuid_be_to_bin(buffer + 0x19 + r_name, (uuid_be *)disk->disk_id)) return false; return true; -- cgit v0.10.2 From 538d7eb86d58b3d7d73f3bb3ff960c4bdf411c1a Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Fri, 20 May 2016 17:01:30 -0700 Subject: drivers/platform/x86/wmi.c: use generic UUID library Instead of opencoding let's use generic UUID library functions here. Signed-off-by: Andy Shevchenko Cc: Darren Hart Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/drivers/platform/x86/wmi.c b/drivers/platform/x86/wmi.c index eb391a2..ceeb8c1 100644 --- a/drivers/platform/x86/wmi.c +++ b/drivers/platform/x86/wmi.c @@ -37,6 +37,7 @@ #include #include #include +#include ACPI_MODULE_NAME("wmi"); MODULE_AUTHOR("Carlos Corbacho"); @@ -115,100 +116,21 @@ static struct acpi_driver acpi_wmi_driver = { * GUID parsing functions */ -/** - * wmi_parse_hexbyte - Convert a ASCII hex number to a byte - * @src: Pointer to at least 2 characters to convert. - * - * Convert a two character ASCII hex string to a number. - * - * Return: 0-255 Success, the byte was parsed correctly - * -1 Error, an invalid character was supplied - */ -static int wmi_parse_hexbyte(const u8 *src) -{ - int h; - int value; - - /* high part */ - h = value = hex_to_bin(src[0]); - if (value < 0) - return -1; - - /* low part */ - value = hex_to_bin(src[1]); - if (value >= 0) - return (h << 4) | value; - return -1; -} - -/** - * wmi_swap_bytes - Rearrange GUID bytes to match GUID binary - * @src: Memory block holding binary GUID (16 bytes) - * @dest: Memory block to hold byte swapped binary GUID (16 bytes) - * - * Byte swap a binary GUID to match it's real GUID value - */ -static void wmi_swap_bytes(u8 *src, u8 *dest) -{ - int i; - - for (i = 0; i <= 3; i++) - memcpy(dest + i, src + (3 - i), 1); - - for (i = 0; i <= 1; i++) - memcpy(dest + 4 + i, src + (5 - i), 1); - - for (i = 0; i <= 1; i++) - memcpy(dest + 6 + i, src + (7 - i), 1); - - memcpy(dest + 8, src + 8, 8); -} - -/** - * wmi_parse_guid - Convert GUID from ASCII to binary - * @src: 36 char string of the form fa50ff2b-f2e8-45de-83fa-65417f2f49ba - * @dest: Memory block to hold binary GUID (16 bytes) - * - * N.B. The GUID need not be NULL terminated. - * - * Return: 'true' @dest contains binary GUID - * 'false' @dest contents are undefined - */ -static bool wmi_parse_guid(const u8 *src, u8 *dest) -{ - static const int size[] = { 4, 2, 2, 2, 6 }; - int i, j, v; - - if (src[8] != '-' || src[13] != '-' || - src[18] != '-' || src[23] != '-') - return false; - - for (j = 0; j < 5; j++, src++) { - for (i = 0; i < size[j]; i++, src += 2, *dest++ = v) { - v = wmi_parse_hexbyte(src); - if (v < 0) - return false; - } - } - - return true; -} - static bool find_guid(const char *guid_string, struct wmi_block **out) { - char tmp[16], guid_input[16]; + uuid_le guid_input; struct wmi_block *wblock; struct guid_block *block; struct list_head *p; - wmi_parse_guid(guid_string, tmp); - wmi_swap_bytes(tmp, guid_input); + if (uuid_le_to_bin(guid_string, &guid_input)) + return false; list_for_each(p, &wmi_block_list) { wblock = list_entry(p, struct wmi_block, list); block = &wblock->gblock; - if (memcmp(block->guid, guid_input, 16) == 0) { + if (memcmp(block->guid, &guid_input, 16) == 0) { if (out) *out = wblock; return true; @@ -498,20 +420,20 @@ wmi_notify_handler handler, void *data) { struct wmi_block *block; acpi_status status = AE_NOT_EXIST; - char tmp[16], guid_input[16]; + uuid_le guid_input; struct list_head *p; if (!guid || !handler) return AE_BAD_PARAMETER; - wmi_parse_guid(guid, tmp); - wmi_swap_bytes(tmp, guid_input); + if (uuid_le_to_bin(guid, &guid_input)) + return AE_BAD_PARAMETER; list_for_each(p, &wmi_block_list) { acpi_status wmi_status; block = list_entry(p, struct wmi_block, list); - if (memcmp(block->gblock.guid, guid_input, 16) == 0) { + if (memcmp(block->gblock.guid, &guid_input, 16) == 0) { if (block->handler && block->handler != wmi_notify_debug) return AE_ALREADY_ACQUIRED; @@ -539,20 +461,20 @@ acpi_status wmi_remove_notify_handler(const char *guid) { struct wmi_block *block; acpi_status status = AE_NOT_EXIST; - char tmp[16], guid_input[16]; + uuid_le guid_input; struct list_head *p; if (!guid) return AE_BAD_PARAMETER; - wmi_parse_guid(guid, tmp); - wmi_swap_bytes(tmp, guid_input); + if (uuid_le_to_bin(guid, &guid_input)) + return AE_BAD_PARAMETER; list_for_each(p, &wmi_block_list) { acpi_status wmi_status; block = list_entry(p, struct wmi_block, list); - if (memcmp(block->gblock.guid, guid_input, 16) == 0) { + if (memcmp(block->gblock.guid, &guid_input, 16) == 0) { if (!block->handler || block->handler == wmi_notify_debug) return AE_NULL_ENTRY; -- cgit v0.10.2 From e9256efcc8e390fa4fcf796a0c0b47d642d77d32 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:01:33 -0700 Subject: radix-tree: introduce radix_tree_empty Commit e61452365372 ("radix_tree: add support for multi-order entries") left the impression that the support for multiorder radix tree entries was functional. As soon as Ross tried to use it, it became apparent that my testing was completely inadequate, and it didn't even work a little bit for orders that were not a multiple of shift. This series of patches is the result of about 6 weeks of redesign, reimplementation, testing, arguing and hair-pulling. The great news is that the test-suite is now far better than it was. That's reflected in the diffstat for the test-suite alone: 12 files changed, 436 insertions(+), 28 deletions(-) The highlight for users of the tree is that the restriction on the order of inserted entries being >= RADIX_TREE_MAP_SHIFT is now gone; the radix tree now supports any order between 0 and 64. For those who are interested in how the tree works, patch 9 is probably the most interesting one as it introduces the new machinery for handling sibling entries. I've tried to be fair in attributing authorship to the person who contributed the majority of the code in each patch; Ross has been an invaluable partner in the development of this support and it's fair to say that each of us has code in every commit. I should also express my appreciation of the 0day testing. It prompted me that I was bloating the tinyconfig in an unacceptable way, and it bisected to a commit which contained a rather nasty memory-corruption bug. This patch (of 29): The irqdomain code was checking for 0 or 1 entries, not 0 entries like the comment said they were. Introduce a new helper that will actually check for an empty tree. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Reviewed-by: Jan Kara Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 51a97ac..83f708e 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -136,6 +136,11 @@ do { \ (root)->rnode = NULL; \ } while (0) +static inline bool radix_tree_empty(struct radix_tree_root *root) +{ + return root->rnode == NULL; +} + /** * Radix-tree synchronization * diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index d65f6f3..8798b6c 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -139,12 +139,7 @@ void irq_domain_remove(struct irq_domain *domain) { mutex_lock(&irq_domain_mutex); - /* - * radix_tree_delete() takes care of destroying the root - * node when all entries are removed. Shout if there are - * any mappings left. - */ - WARN_ON(domain->revmap_tree.height); + WARN_ON(!radix_tree_empty(&domain->revmap_tree)); list_del(&domain->link); -- cgit v0.10.2 From f518b1607e128a8dcfa75f539864c1321c5a18ea Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:01:36 -0700 Subject: radix tree test suite: fix build Add an empty linux/init.h, and definitions for a few parts of the kernel API either in use now, or to be used in the near future. Start using the common definitions in tools/include/linux, although more work needs to be done here. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h index ae013b0..6d0cdf6 100644 --- a/tools/testing/radix-tree/linux/kernel.h +++ b/tools/testing/radix-tree/linux/kernel.h @@ -7,19 +7,25 @@ #include #include +#include "../../include/linux/compiler.h" + #ifndef NULL #define NULL 0 #endif #define BUG_ON(expr) assert(!(expr)) +#define WARN_ON(expr) assert(!(expr)) #define __init #define __must_check #define panic(expr) #define printk printf #define __force -#define likely(c) (c) -#define unlikely(c) (c) #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d)) +#define pr_debug printk + +#define smp_rmb() barrier() +#define smp_wmb() barrier() +#define cpu_relax() barrier() #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) @@ -28,6 +34,8 @@ (type *)( (char *)__mptr - offsetof(type, member) );}) #define min(a, b) ((a) < (b) ? (a) : (b)) +#define cond_resched() sched_yield() + static inline int in_interrupt(void) { return 0; diff --git a/tools/testing/radix-tree/linux/slab.h b/tools/testing/radix-tree/linux/slab.h index 57282506..6d5a347 100644 --- a/tools/testing/radix-tree/linux/slab.h +++ b/tools/testing/radix-tree/linux/slab.h @@ -3,7 +3,6 @@ #include -#define GFP_KERNEL 1 #define SLAB_HWCACHE_ALIGN 1 #define SLAB_PANIC 2 #define SLAB_RECLAIM_ACCOUNT 0x00020000UL /* Objects are reclaimable */ diff --git a/tools/testing/radix-tree/linux/types.h b/tools/testing/radix-tree/linux/types.h index 72a9d85..faa0b6f 100644 --- a/tools/testing/radix-tree/linux/types.h +++ b/tools/testing/radix-tree/linux/types.h @@ -1,15 +1,13 @@ #ifndef _TYPES_H #define _TYPES_H +#include "../../include/linux/types.h" + #define __rcu #define __read_mostly #define BITS_PER_LONG (sizeof(long) * 8) -struct list_head { - struct list_head *next, *prev; -}; - static inline void INIT_LIST_HEAD(struct list_head *list) { list->next = list; @@ -22,7 +20,6 @@ typedef struct { #define uninitialized_var(x) x = x -typedef unsigned gfp_t; #include #endif -- cgit v0.10.2 From d42cb1a9fffa9dc760c13302f00cdec25106e2f1 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:01:39 -0700 Subject: radix tree test suite: add tests for radix_tree_locate_item() Fairly simple tests; add various items to the tree, then make sure we can find them again. Also check that a pointer that we know isn't in the tree is not found. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h index 6d0cdf6..76a88f3 100644 --- a/tools/testing/radix-tree/linux/kernel.h +++ b/tools/testing/radix-tree/linux/kernel.h @@ -9,6 +9,9 @@ #include "../../include/linux/compiler.h" +#define CONFIG_SHMEM +#define CONFIG_SWAP + #ifndef NULL #define NULL 0 #endif diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c index 0e83cad..71c5272 100644 --- a/tools/testing/radix-tree/main.c +++ b/tools/testing/radix-tree/main.c @@ -232,10 +232,51 @@ void copy_tag_check(void) item_kill_tree(&tree); } +void __locate_check(struct radix_tree_root *tree, unsigned long index) +{ + struct item *item; + unsigned long index2; + + item_insert(tree, index); + item = item_lookup(tree, index); + index2 = radix_tree_locate_item(tree, item); + if (index != index2) { + printf("index %ld inserted; found %ld\n", + index, index2); + abort(); + } +} + +static void locate_check(void) +{ + RADIX_TREE(tree, GFP_KERNEL); + unsigned long offset, index; + + for (offset = 0; offset < (1 << 3); offset++) { + for (index = 0; index < (1UL << 5); index++) { + __locate_check(&tree, index + offset); + } + if (radix_tree_locate_item(&tree, &tree) != -1) + abort(); + + item_kill_tree(&tree); + } + + if (radix_tree_locate_item(&tree, &tree) != -1) + abort(); + __locate_check(&tree, -1); + if (radix_tree_locate_item(&tree, &tree) != -1) + abort(); + item_kill_tree(&tree); +} + static void single_thread_tests(void) { int i; + printf("starting single_thread_tests: %d allocated\n", nr_allocated); + locate_check(); + printf("after locate_check: %d allocated\n", nr_allocated); tag_check(); printf("after tag_check: %d allocated\n", nr_allocated); gang_check(); -- cgit v0.10.2 From 97d778b2de9213c7a7483dad0f533c1af9f0810f Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Fri, 20 May 2016 17:01:42 -0700 Subject: radix tree test suite: allow testing other fan-out values The defines in regression2.c are already in radix-tree.h and duplicating them in the test case makes experimenting with other values for the fan-out harder than necessary. Allow the user of the radix tree to decide what the fan-out should be rather than fixing it to 8 for non-kernel uses. Signed-off-by: Ross Zwisler Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 83f708e..5ce5a1e 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -70,10 +70,8 @@ static inline int radix_tree_is_indirect_ptr(void *ptr) #define RADIX_TREE_MAX_TAGS 3 -#ifdef __KERNEL__ +#ifndef RADIX_TREE_MAP_SHIFT #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6) -#else -#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */ #endif #define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT) diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h index 76a88f3..31fe2c77 100644 --- a/tools/testing/radix-tree/linux/kernel.h +++ b/tools/testing/radix-tree/linux/kernel.h @@ -12,6 +12,8 @@ #define CONFIG_SHMEM #define CONFIG_SWAP +#define RADIX_TREE_MAP_SHIFT 3 + #ifndef NULL #define NULL 0 #endif diff --git a/tools/testing/radix-tree/regression2.c b/tools/testing/radix-tree/regression2.c index 5d2fa28..63bf347 100644 --- a/tools/testing/radix-tree/regression2.c +++ b/tools/testing/radix-tree/regression2.c @@ -51,13 +51,6 @@ #include "regression.h" -#ifdef __KERNEL__ -#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6) -#else -#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */ -#endif - -#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT) #define PAGECACHE_TAG_DIRTY 0 #define PAGECACHE_TAG_WRITEBACK 1 #define PAGECACHE_TAG_TOWRITE 2 -- cgit v0.10.2 From aa1d62d8530d5adf158dd633d360108466f93fcd Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Fri, 20 May 2016 17:01:45 -0700 Subject: radix tree test suite: keep regression test runs short Currently the full suite of regression tests take upwards of 30 minutes to run on my development machine. The vast majority of this time is taken by the big_gang_check() and copy_tag_check() tests, which each run their tests through thousands of iterations...does this have value? Without big_gang_check() and copy_tag_check(), the test suite runs in around 15 seconds on my box. Honestly the first time I ever ran through the entire test suite was to gather the timings for this email - it simply takes too long to be useful on a normal basis. Instead, hide the excessive iterations through big_gang_check() and copy_tag_check() tests behind an '-l' flag (for "long run") in case they are still useful, but allow the regression test suite to complete in a reasonable amount of time. We still run each of these tests a few times (3 at present) to try and keep the test coverage. Signed-off-by: Ross Zwisler Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c index 71c5272..122c8b9 100644 --- a/tools/testing/radix-tree/main.c +++ b/tools/testing/radix-tree/main.c @@ -61,11 +61,11 @@ void __big_gang_check(void) } while (!wrapped); } -void big_gang_check(void) +void big_gang_check(bool long_run) { int i; - for (i = 0; i < 1000; i++) { + for (i = 0; i < (long_run ? 1000 : 3); i++) { __big_gang_check(); srand(time(0)); printf("%d ", i); @@ -270,7 +270,7 @@ static void locate_check(void) item_kill_tree(&tree); } -static void single_thread_tests(void) +static void single_thread_tests(bool long_run) { int i; @@ -285,9 +285,9 @@ static void single_thread_tests(void) printf("after add_and_check: %d allocated\n", nr_allocated); dynamic_height_check(); printf("after dynamic_height_check: %d allocated\n", nr_allocated); - big_gang_check(); + big_gang_check(long_run); printf("after big_gang_check: %d allocated\n", nr_allocated); - for (i = 0; i < 2000; i++) { + for (i = 0; i < (long_run ? 2000 : 3); i++) { copy_tag_check(); printf("%d ", i); fflush(stdout); @@ -295,15 +295,23 @@ static void single_thread_tests(void) printf("after copy_tag_check: %d allocated\n", nr_allocated); } -int main(void) +int main(int argc, char **argv) { + bool long_run = false; + int opt; + + while ((opt = getopt(argc, argv, "l")) != -1) { + if (opt == 'l') + long_run = true; + } + rcu_register_thread(); radix_tree_init(); regression1_test(); regression2_test(); regression3_test(); - single_thread_tests(); + single_thread_tests(long_run); sleep(1); printf("after sleep(1): %d allocated\n", nr_allocated); -- cgit v0.10.2 From 7f308671c79899c1b4e275867d3647f64e896e78 Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Fri, 20 May 2016 17:01:48 -0700 Subject: radix tree test suite: rebuild when headers change When we make changes to radix-tree.h in the regular kernel source (include/linux/radix-tree.h), we really want our test code to be rebuilt. We also include a few other headers from tools/include and probably want to rebuild if these have been changed. Update the makefile so that all of our objects will be rebuilt when any of the headers we depend on are changed. Signed-off-by: Ross Zwisler Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile index 604212d..43febba 100644 --- a/tools/testing/radix-tree/Makefile +++ b/tools/testing/radix-tree/Makefile @@ -13,7 +13,7 @@ main: $(OFILES) clean: $(RM) -f $(TARGETS) *.o radix-tree.c -$(OFILES): *.h */*.h +$(OFILES): *.h */*.h ../../../include/linux/radix-tree.h ../../include/linux/*.h radix-tree.c: ../../../lib/radix-tree.c sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@ -- cgit v0.10.2 From 6c4bd68a2962c03423a226d949caf64216d013cc Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Fri, 20 May 2016 17:01:51 -0700 Subject: radix-tree: remove unused looping macros radix_tree_for_each_chunk() and radix_tree_for_each_chunk_slot() have never been used in the kernel since their introduction in 2012, so remove them. Signed-off-by: Ross Zwisler Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 5ce5a1e..e1512a6 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -479,34 +479,6 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags) } /** - * radix_tree_for_each_chunk - iterate over chunks - * - * @slot: the void** variable for pointer to chunk first slot - * @root: the struct radix_tree_root pointer - * @iter: the struct radix_tree_iter pointer - * @start: iteration starting index - * @flags: RADIX_TREE_ITER_* and tag index - * - * Locks can be released and reacquired between iterations. - */ -#define radix_tree_for_each_chunk(slot, root, iter, start, flags) \ - for (slot = radix_tree_iter_init(iter, start) ; \ - (slot = radix_tree_next_chunk(root, iter, flags)) ;) - -/** - * radix_tree_for_each_chunk_slot - iterate over slots in one chunk - * - * @slot: the void** variable, at the beginning points to chunk first slot - * @iter: the struct radix_tree_iter pointer - * @flags: RADIX_TREE_ITER_*, should be constant - * - * This macro is designed to be nested inside radix_tree_for_each_chunk(). - * @slot points to the radix tree slot, @iter->index contains its index. - */ -#define radix_tree_for_each_chunk_slot(slot, iter, flags) \ - for (; slot ; slot = radix_tree_next_slot(slot, iter, flags)) - -/** * radix_tree_for_each_slot - iterate over non-empty slots * * @slot: the void** variable for pointer to slot -- cgit v0.10.2 From 57578c2ea2cb2e0d362a9212ac83cf90221d4883 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:01:54 -0700 Subject: raxix-tree: introduce CONFIG_RADIX_TREE_MULTIORDER I've been receiving increasingly concerned notes from 0day about how much my recent changes have been bloating the radix tree. Make it happier by only including multiorder support if CONFIG_TRANSPARENT_HUGEPAGES is set. This is an independent Kconfig option, so other radix tree users can also set it if they have a need. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/Kconfig b/lib/Kconfig index 61d55bd..d79909d 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -362,6 +362,9 @@ config INTERVAL_TREE for more information. +config RADIX_TREE_MULTIORDER + bool + config ASSOCIATIVE_ARRAY bool help diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 1624c41..799f341 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -484,6 +484,7 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, slot = node->slots[offset]; } +#ifdef CONFIG_RADIX_TREE_MULTIORDER /* Insert pointers to the canonical entry */ if ((shift - order) > 0) { int i, n = 1 << (shift - order); @@ -499,6 +500,7 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, node->count++; } } +#endif if (nodep) *nodep = node; @@ -1469,6 +1471,20 @@ bool __radix_tree_delete_node(struct radix_tree_root *root, return deleted; } +static inline void delete_sibling_entries(struct radix_tree_node *node, + void *ptr, unsigned offset) +{ +#ifdef CONFIG_RADIX_TREE_MULTIORDER + int i; + for (i = 1; offset + i < RADIX_TREE_MAP_SIZE; i++) { + if (node->slots[offset + i] != ptr) + break; + node->slots[offset + i] = NULL; + node->count--; + } +#endif +} + /** * radix_tree_delete_item - delete an item from a radix tree * @root: radix tree root @@ -1484,7 +1500,7 @@ void *radix_tree_delete_item(struct radix_tree_root *root, unsigned long index, void *item) { struct radix_tree_node *node; - unsigned int offset, i; + unsigned int offset; void **slot; void *entry; int tag; @@ -1513,13 +1529,7 @@ void *radix_tree_delete_item(struct radix_tree_root *root, radix_tree_tag_clear(root, index, tag); } - /* Delete any sibling slots pointing to this slot */ - for (i = 1; offset + i < RADIX_TREE_MAP_SIZE; i++) { - if (node->slots[offset + i] != ptr_to_indirect(slot)) - break; - node->slots[offset + i] = NULL; - node->count--; - } + delete_sibling_entries(node, ptr_to_indirect(slot), offset); node->slots[offset] = NULL; node->count--; diff --git a/mm/Kconfig b/mm/Kconfig index 1a6a28e..2664c11 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -404,6 +404,7 @@ config TRANSPARENT_HUGEPAGE bool "Transparent Hugepage Support" depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE select COMPACTION + select RADIX_TREE_MULTIORDER help Transparent Hugepages allows the kernel to use huge pages and huge tlb transparently to the applications whenever possible. diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h index 31fe2c77..8ea0ed4 100644 --- a/tools/testing/radix-tree/linux/kernel.h +++ b/tools/testing/radix-tree/linux/kernel.h @@ -9,6 +9,7 @@ #include "../../include/linux/compiler.h" +#define CONFIG_RADIX_TREE_MULTIORDER #define CONFIG_SHMEM #define CONFIG_SWAP -- cgit v0.10.2 From db050f2924fcf39428bdadf28970a32cfaf256ef Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:01:57 -0700 Subject: radix-tree: add missing sibling entry functionality The code I previously added to enable multiorder radix tree entries was untested and therefore buggy. This commit adds the support functions that Ross and I decided were necessary over a four-week period of iterating various designs. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 799f341..585965a 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -80,6 +80,46 @@ static inline void *indirect_to_ptr(void *ptr) return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR); } +#ifdef CONFIG_RADIX_TREE_MULTIORDER +/* Sibling slots point directly to another slot in the same node */ +static inline bool is_sibling_entry(struct radix_tree_node *parent, void *node) +{ + void **ptr = node; + return (parent->slots <= ptr) && + (ptr < parent->slots + RADIX_TREE_MAP_SIZE); +} +#else +static inline bool is_sibling_entry(struct radix_tree_node *parent, void *node) +{ + return false; +} +#endif + +static inline unsigned long get_slot_offset(struct radix_tree_node *parent, + void **slot) +{ + return slot - parent->slots; +} + +static unsigned radix_tree_descend(struct radix_tree_node *parent, + struct radix_tree_node **nodep, unsigned offset) +{ + void **entry = rcu_dereference_raw(parent->slots[offset]); + +#ifdef CONFIG_RADIX_TREE_MULTIORDER + if (radix_tree_is_indirect_ptr(entry)) { + unsigned long siboff = get_slot_offset(parent, entry); + if (siboff < RADIX_TREE_MAP_SIZE) { + offset = siboff; + entry = rcu_dereference_raw(parent->slots[offset]); + } + } +#endif + + *nodep = (void *)entry; + return offset; +} + static inline gfp_t root_gfp_mask(struct radix_tree_root *root) { return root->gfp_mask & __GFP_BITS_MASK; -- cgit v0.10.2 From 3b8c00f68405e9c037a6d8ae0d5d9da7f8a34e6a Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:01:59 -0700 Subject: radix-tree: fix sibling entry insertion The subtraction was the wrong way round, leading to undefined behaviour (shift by an amount larger than the size of the type). Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 585965a..c0366d1 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -526,8 +526,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, #ifdef CONFIG_RADIX_TREE_MULTIORDER /* Insert pointers to the canonical entry */ - if ((shift - order) > 0) { - int i, n = 1 << (shift - order); + if (order > shift) { + int i, n = 1 << (order - shift); offset = offset & ~(n - 1); slot = ptr_to_indirect(&node->slots[offset]); for (i = 0; i < n; i++) { -- cgit v0.10.2 From 29e0967c2f66771f654cef7168c90a53737abcdf Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:02:02 -0700 Subject: radix-tree: fix deleting a multi-order entry through an alias If we deleted an entry through an index which looked up a sibling pointer, we'd end up zeroing out the wrong slots in the node. Use get_slot_offset() to find the right slot. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index c0366d1..b3364b9 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -1558,7 +1558,7 @@ void *radix_tree_delete_item(struct radix_tree_root *root, return entry; } - offset = index & RADIX_TREE_MAP_MASK; + offset = get_slot_offset(node, slot); /* * Clear all tags associated with the item to be deleted. -- cgit v0.10.2 From aa5475760235672f316fbf29cdfb82a75016dbdf Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:02:05 -0700 Subject: radix-tree: remove restriction on multi-order entries Now that sibling pointers are handled explicitly, there is no purpose served by restricting the order to be >= RADIX_TREE_MAP_SHIFT. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index b3364b9..6900f7b 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -483,8 +483,6 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, unsigned int height, shift, offset; int error; - BUG_ON((0 < order) && (order < RADIX_TREE_MAP_SHIFT)); - /* Make sure the tree is high enough. */ if (index > radix_tree_maxindex(root->height)) { error = radix_tree_extend(root, index, order); -- cgit v0.10.2 From 1456a439fc2dcc0c3d1a2d7af1fd83962813aaee Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:02:08 -0700 Subject: radix-tree: introduce radix_tree_load_root() All the tree walking functions start with some variant of this code; centralise it in one place so we're not chasing subtly different bugs everywhere. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 6900f7b..272ce81 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -405,6 +405,29 @@ static inline unsigned long radix_tree_maxindex(unsigned int height) return height_to_maxindex[height]; } +static inline unsigned long node_maxindex(struct radix_tree_node *node) +{ + return radix_tree_maxindex(node->path & RADIX_TREE_HEIGHT_MASK); +} + +static unsigned radix_tree_load_root(struct radix_tree_root *root, + struct radix_tree_node **nodep, unsigned long *maxindex) +{ + struct radix_tree_node *node = rcu_dereference_raw(root->rnode); + + *nodep = node; + + if (likely(radix_tree_is_indirect_ptr(node))) { + node = indirect_to_ptr(node); + *maxindex = node_maxindex(node); + return (node->path & RADIX_TREE_HEIGHT_MASK) * + RADIX_TREE_MAP_SHIFT; + } + + *maxindex = 0; + return 0; +} + /* * Extend a radix tree so it can store key @index. */ -- cgit v0.10.2 From 49ea6ebcd3080ebf2c589b5f1437dd8bb2f90395 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:02:11 -0700 Subject: radix-tree: fix extending the tree for multi-order entries at offset 0 The current code will insert entries at each level, as if we're going to add a new entry at the bottom level, so we then get an -EEXIST when we try to insert the entry into the tree. The best way to fix this is to not check 'order' when inserting into an empty tree. We still need to 'extend' the tree to the height necessary for the maximum index corresponding to this entry, so pass that value to radix_tree_extend() rather than the index we're asked to create, or we won't create a tree that's deep enough. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 272ce81..f13ddbb 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -432,7 +432,7 @@ static unsigned radix_tree_load_root(struct radix_tree_root *root, * Extend a radix tree so it can store key @index. */ static int radix_tree_extend(struct radix_tree_root *root, - unsigned long index, unsigned order) + unsigned long index) { struct radix_tree_node *node; struct radix_tree_node *slot; @@ -444,7 +444,7 @@ static int radix_tree_extend(struct radix_tree_root *root, while (index > radix_tree_maxindex(height)) height++; - if ((root->rnode == NULL) && (order == 0)) { + if (root->rnode == NULL) { root->height = height; goto out; } @@ -467,7 +467,7 @@ static int radix_tree_extend(struct radix_tree_root *root, node->count = 1; node->parent = NULL; slot = root->rnode; - if (radix_tree_is_indirect_ptr(slot) && newheight > 1) { + if (radix_tree_is_indirect_ptr(slot)) { slot = indirect_to_ptr(slot); slot->parent = node; slot = ptr_to_indirect(slot); @@ -478,7 +478,7 @@ static int radix_tree_extend(struct radix_tree_root *root, root->height = newheight; } while (height > root->height); out: - return 0; + return height * RADIX_TREE_MAP_SHIFT; } /** @@ -503,20 +503,26 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, void ***slotp) { struct radix_tree_node *node = NULL, *slot; + unsigned long maxindex; unsigned int height, shift, offset; - int error; + unsigned long max = index | ((1UL << order) - 1); + + shift = radix_tree_load_root(root, &slot, &maxindex); /* Make sure the tree is high enough. */ - if (index > radix_tree_maxindex(root->height)) { - error = radix_tree_extend(root, index, order); - if (error) + if (max > maxindex) { + int error = radix_tree_extend(root, max); + if (error < 0) return error; + shift = error; + slot = root->rnode; + if (order == shift) { + shift += RADIX_TREE_MAP_SHIFT; + root->height++; + } } - slot = root->rnode; - height = root->height; - shift = height * RADIX_TREE_MAP_SHIFT; offset = 0; /* uninitialised var warning */ while (shift > order) { -- cgit v0.10.2 From 4f3755d1ae3cd856a5c7da3dea12cced8dc51fbf Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:02:14 -0700 Subject: radix tree test suite: start adding multiorder tests Test suite infrastructure for working with multiorder entries. The test itself is pretty basic: Add an entry, check that all expected indices return that entry and that indices around that entry don't return an entry. Then delete the entry and check no index returns that entry. Tests a few edge conditions including the multiorder entry at index 0 and at a higher index. Also tests deleting through an alias as well as through the canonical index. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile index 43febba..3b53046 100644 --- a/tools/testing/radix-tree/Makefile +++ b/tools/testing/radix-tree/Makefile @@ -3,7 +3,7 @@ CFLAGS += -I. -g -Wall -D_LGPL_SOURCE LDFLAGS += -lpthread -lurcu TARGETS = main OFILES = main.o radix-tree.o linux.o test.o tag_check.o find_next_bit.o \ - regression1.o regression2.o regression3.o + regression1.o regression2.o regression3.o multiorder.o targets: $(TARGETS) diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c index 122c8b9..b6a700b 100644 --- a/tools/testing/radix-tree/main.c +++ b/tools/testing/radix-tree/main.c @@ -275,6 +275,8 @@ static void single_thread_tests(bool long_run) int i; printf("starting single_thread_tests: %d allocated\n", nr_allocated); + multiorder_checks(); + printf("after multiorder_check: %d allocated\n", nr_allocated); locate_check(); printf("after locate_check: %d allocated\n", nr_allocated); tag_check(); diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c new file mode 100644 index 0000000..cfe718c --- /dev/null +++ b/tools/testing/radix-tree/multiorder.c @@ -0,0 +1,58 @@ +/* + * multiorder.c: Multi-order radix tree entry testing + * Copyright (c) 2016 Intel Corporation + * Author: Ross Zwisler + * Author: Matthew Wilcox + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ +#include +#include +#include + +#include "test.h" + +static void multiorder_check(unsigned long index, int order) +{ + unsigned long i; + unsigned long min = index & ~((1UL << order) - 1); + unsigned long max = min + (1UL << order); + RADIX_TREE(tree, GFP_KERNEL); + + printf("Multiorder index %ld, order %d\n", index, order); + + assert(item_insert_order(&tree, index, order) == 0); + + for (i = min; i < max; i++) { + struct item *item = item_lookup(&tree, i); + assert(item != 0); + assert(item->index == index); + } + for (i = 0; i < min; i++) + item_check_absent(&tree, i); + for (i = max; i < 2*max; i++) + item_check_absent(&tree, i); + + assert(item_delete(&tree, index) != 0); + + for (i = 0; i < 2*max; i++) + item_check_absent(&tree, i); +} + +void multiorder_checks(void) +{ + int i; + + for (i = 0; i < 20; i++) { + multiorder_check(200, i); + multiorder_check(0, i); + multiorder_check((1UL << i) + 1, i); + } +} diff --git a/tools/testing/radix-tree/test.c b/tools/testing/radix-tree/test.c index 2bebf34..da54f11 100644 --- a/tools/testing/radix-tree/test.c +++ b/tools/testing/radix-tree/test.c @@ -24,14 +24,21 @@ int item_tag_get(struct radix_tree_root *root, unsigned long index, int tag) return radix_tree_tag_get(root, index, tag); } -int __item_insert(struct radix_tree_root *root, struct item *item) +int __item_insert(struct radix_tree_root *root, struct item *item, + unsigned order) { - return radix_tree_insert(root, item->index, item); + return __radix_tree_insert(root, item->index, order, item); } int item_insert(struct radix_tree_root *root, unsigned long index) { - return __item_insert(root, item_create(index)); + return __item_insert(root, item_create(index), 0); +} + +int item_insert_order(struct radix_tree_root *root, unsigned long index, + unsigned order) +{ + return __item_insert(root, item_create(index), order); } int item_delete(struct radix_tree_root *root, unsigned long index) diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h index 4e1d95f..53cb595 100644 --- a/tools/testing/radix-tree/test.h +++ b/tools/testing/radix-tree/test.h @@ -8,8 +8,11 @@ struct item { }; struct item *item_create(unsigned long index); -int __item_insert(struct radix_tree_root *root, struct item *item); +int __item_insert(struct radix_tree_root *root, struct item *item, + unsigned order); int item_insert(struct radix_tree_root *root, unsigned long index); +int item_insert_order(struct radix_tree_root *root, unsigned long index, + unsigned order); int item_delete(struct radix_tree_root *root, unsigned long index); struct item *item_lookup(struct radix_tree_root *root, unsigned long index); @@ -23,6 +26,7 @@ void item_full_scan(struct radix_tree_root *root, unsigned long start, void item_kill_tree(struct radix_tree_root *root); void tag_check(void); +void multiorder_checks(void); struct item * item_tag_set(struct radix_tree_root *root, unsigned long index, int tag); -- cgit v0.10.2 From afe0e395b6d1817fa5393f1ad6fcbf71406b016d Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:02:17 -0700 Subject: radix-tree: fix several shrinking bugs with multiorder entries Setting the indirect bit on the user data entry used to be unambiguous because the tree walking code knew not to expect internal nodes in the last level of the tree. Multiorder entries can appear at any level of the tree, and a leaf with the indirect bit set is indistinguishable from a pointer to a node. Introduce a special entry (RADIX_TREE_RETRY) which is neither a valid user entry, nor a valid pointer to a node. The radix_tree_deref_retry() function continues to work the same way, but tree walking code can distinguish it from a pointer to a node. Also fix the condition for setting slot->parent to NULL; it does not matter what height the tree is, it only matters whether slot is an indirect pointer. Move this code above the comment which is referring to the assignment to root->rnode. Also fix the condition for preventing the tree from shrinking to a single entry if it's a multiorder entry. Add a test-case to the test suite that checks that the tree goes back down to its original height after an item is inserted & deleted from a higher index in the tree. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index f13ddbb..a1ba417 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -80,6 +80,8 @@ static inline void *indirect_to_ptr(void *ptr) return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR); } +#define RADIX_TREE_RETRY ptr_to_indirect(NULL) + #ifdef CONFIG_RADIX_TREE_MULTIORDER /* Sibling slots point directly to another slot in the same node */ static inline bool is_sibling_entry(struct radix_tree_node *parent, void *node) @@ -1443,6 +1445,14 @@ static inline void radix_tree_shrink(struct radix_tree_root *root) slot = to_free->slots[0]; if (!slot) break; + if (!radix_tree_is_indirect_ptr(slot) && (root->height > 1)) + break; + + if (radix_tree_is_indirect_ptr(slot)) { + slot = indirect_to_ptr(slot); + slot->parent = NULL; + slot = ptr_to_indirect(slot); + } /* * We don't need rcu_assign_pointer(), since we are simply @@ -1451,14 +1461,6 @@ static inline void radix_tree_shrink(struct radix_tree_root *root) * (to_free->slots[0]), it will be safe to dereference the new * one (root->rnode) as far as dependent read barriers go. */ - if (root->height > 1) { - if (!radix_tree_is_indirect_ptr(slot)) - break; - - slot = indirect_to_ptr(slot); - slot->parent = NULL; - slot = ptr_to_indirect(slot); - } root->rnode = slot; root->height--; @@ -1480,9 +1482,8 @@ static inline void radix_tree_shrink(struct radix_tree_root *root) * also results in a stale slot). So tag the slot as indirect * to force callers to retry. */ - if (root->height == 0) - *((unsigned long *)&to_free->slots[0]) |= - RADIX_TREE_INDIRECT_PTR; + if (!radix_tree_is_indirect_ptr(slot)) + to_free->slots[0] = RADIX_TREE_RETRY; radix_tree_node_free(to_free); } diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c index cfe718c..71f34a0 100644 --- a/tools/testing/radix-tree/multiorder.c +++ b/tools/testing/radix-tree/multiorder.c @@ -46,6 +46,41 @@ static void multiorder_check(unsigned long index, int order) item_check_absent(&tree, i); } +static void multiorder_shrink(unsigned long index, int order) +{ + unsigned long i; + unsigned long max = 1 << order; + RADIX_TREE(tree, GFP_KERNEL); + struct radix_tree_node *node; + + printf("Multiorder shrink index %ld, order %d\n", index, order); + + assert(item_insert_order(&tree, 0, order) == 0); + + node = tree.rnode; + + assert(item_insert(&tree, index) == 0); + assert(node != tree.rnode); + + assert(item_delete(&tree, index) != 0); + assert(node == tree.rnode); + + for (i = 0; i < max; i++) { + struct item *item = item_lookup(&tree, i); + assert(item != 0); + assert(item->index == 0); + } + for (i = max; i < 2*max; i++) + item_check_absent(&tree, i); + + if (!item_delete(&tree, 0)) { + printf("failed to delete index %ld (order %d)\n", index, order); abort(); + } + + for (i = 0; i < 2*max; i++) + item_check_absent(&tree, i); +} + void multiorder_checks(void) { int i; @@ -55,4 +90,8 @@ void multiorder_checks(void) multiorder_check(0, i); multiorder_check((1UL << i) + 1, i); } + + for (i = 0; i < 15; i++) + multiorder_shrink((1UL << (i + RADIX_TREE_MAP_SHIFT)), i); + } -- cgit v0.10.2 From 858299544efcf2577511bb018b9e73637ac71a2f Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:02:20 -0700 Subject: radix-tree: rewrite __radix_tree_lookup Use the new multi-order support functions to rewrite __radix_tree_lookup() Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index a1ba417..f14ada9 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -634,44 +634,28 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index, struct radix_tree_node **nodep, void ***slotp) { struct radix_tree_node *node, *parent; - unsigned int height, shift; + unsigned long maxindex; + unsigned int shift; void **slot; - node = rcu_dereference_raw(root->rnode); - if (node == NULL) - return NULL; - - if (!radix_tree_is_indirect_ptr(node)) { - if (index > 0) - return NULL; - - if (nodep) - *nodep = NULL; - if (slotp) - *slotp = (void **)&root->rnode; - return node; - } - node = indirect_to_ptr(node); - - height = node->path & RADIX_TREE_HEIGHT_MASK; - if (index > radix_tree_maxindex(height)) + restart: + parent = NULL; + slot = (void **)&root->rnode; + shift = radix_tree_load_root(root, &node, &maxindex); + if (index > maxindex) return NULL; - shift = (height-1) * RADIX_TREE_MAP_SHIFT; - - do { - parent = node; - slot = node->slots + ((index >> shift) & RADIX_TREE_MAP_MASK); - node = rcu_dereference_raw(*slot); - if (node == NULL) - return NULL; - if (!radix_tree_is_indirect_ptr(node)) - break; - node = indirect_to_ptr(node); + while (radix_tree_is_indirect_ptr(node)) { + unsigned offset; + if (node == RADIX_TREE_RETRY) + goto restart; + parent = indirect_to_ptr(node); shift -= RADIX_TREE_MAP_SHIFT; - height--; - } while (height > 0); + offset = (index >> shift) & RADIX_TREE_MAP_MASK; + offset = radix_tree_descend(parent, &node, offset); + slot = parent->slots + offset; + } if (nodep) *nodep = parent; -- cgit v0.10.2 From 7b60e9ad59a31dd98c2f7ef841e2882c2b0e0f3b Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:02:23 -0700 Subject: radix-tree: fix multiorder BUG_ON in radix_tree_insert These BUG_ON tests are to ensure that all the tags are clear when inserting a new entry. If we insert a multiorder entry, we'll end up looking at the tags for a different node, and so the BUG_ON can end up triggering spuriously. Also, we now have three tags, not two, so check all three are clear, and check all the root tags with a single call to BUG_ON since the bits are stored contiguously. Include a test-case to ensure this problem does not reoccur. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index f14ada9..ff46042 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -165,6 +165,11 @@ static inline int root_tag_get(struct radix_tree_root *root, unsigned int tag) return (__force unsigned)root->gfp_mask & (1 << (tag + __GFP_BITS_SHIFT)); } +static inline unsigned root_tags_get(struct radix_tree_root *root) +{ + return (__force unsigned)root->gfp_mask >> __GFP_BITS_SHIFT; +} + /* * Returns 1 if any slot in the node has this tag set. * Otherwise returns 0. @@ -604,12 +609,13 @@ int __radix_tree_insert(struct radix_tree_root *root, unsigned long index, rcu_assign_pointer(*slot, item); if (node) { + unsigned offset = get_slot_offset(node, slot); node->count++; - BUG_ON(tag_get(node, 0, index & RADIX_TREE_MAP_MASK)); - BUG_ON(tag_get(node, 1, index & RADIX_TREE_MAP_MASK)); + BUG_ON(tag_get(node, 0, offset)); + BUG_ON(tag_get(node, 1, offset)); + BUG_ON(tag_get(node, 2, offset)); } else { - BUG_ON(root_tag_get(root, 0)); - BUG_ON(root_tag_get(root, 1)); + BUG_ON(root_tags_get(root)); } return 0; diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c index 71f34a0..0a311a5 100644 --- a/tools/testing/radix-tree/multiorder.c +++ b/tools/testing/radix-tree/multiorder.c @@ -81,6 +81,17 @@ static void multiorder_shrink(unsigned long index, int order) item_check_absent(&tree, i); } +static void multiorder_insert_bug(void) +{ + RADIX_TREE(tree, GFP_KERNEL); + + item_insert(&tree, 0); + radix_tree_tag_set(&tree, 0, 0); + item_insert_order(&tree, 3 << 6, 6); + + item_kill_tree(&tree); +} + void multiorder_checks(void) { int i; @@ -94,4 +105,5 @@ void multiorder_checks(void) for (i = 0; i < 15; i++) multiorder_shrink((1UL << (i + RADIX_TREE_MAP_SHIFT)), i); + multiorder_insert_bug(); } -- cgit v0.10.2 From 21ef533931f73a8e963a6107aa5ec51b192f28be Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Fri, 20 May 2016 17:02:26 -0700 Subject: radix-tree: add support for multi-order iterating This enables the macros radix_tree_for_each_slot() and friends to be used with multi-order entries. The way that this works is that we treat all entries in a given slots[] array as a single chunk. If the index given to radix_tree_next_chunk() happens to point us to a sibling entry, we will back up iter->index so that it points to the canonical entry, and that will be the place where we start our iteration. As we're processing a chunk in radix_tree_next_slot(), we process canonical entries, skip over sibling entries, and restart the chunk lookup if we find a non-sibling indirect pointer. This drops back to the radix_tree_next_chunk() code, which will re-walk the tree and look for another chunk. This allows us to properly handle multi-order entries mixed with other entries that are at various heights in the radix tree. Signed-off-by: Ross Zwisler Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index e1512a6..8558d52 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -330,8 +330,9 @@ static inline void radix_tree_preload_end(void) * struct radix_tree_iter - radix tree iterator state * * @index: index of current slot - * @next_index: next-to-last index for this chunk + * @next_index: one beyond the last index for this chunk * @tags: bit-mask for tag-iterating + * @shift: shift for the node that holds our slots * * This radix tree iterator works in terms of "chunks" of slots. A chunk is a * subinterval of slots contained within one radix tree leaf node. It is @@ -344,8 +345,20 @@ struct radix_tree_iter { unsigned long index; unsigned long next_index; unsigned long tags; +#ifdef CONFIG_RADIX_TREE_MULTIORDER + unsigned int shift; +#endif }; +static inline unsigned int iter_shift(struct radix_tree_iter *iter) +{ +#ifdef CONFIG_RADIX_TREE_MULTIORDER + return iter->shift; +#else + return 0; +#endif +} + #define RADIX_TREE_ITER_TAG_MASK 0x00FF /* tag index in lower byte */ #define RADIX_TREE_ITER_TAGGED 0x0100 /* lookup tagged slots */ #define RADIX_TREE_ITER_CONTIG 0x0200 /* stop at first hole */ @@ -405,6 +418,12 @@ void **radix_tree_iter_retry(struct radix_tree_iter *iter) return NULL; } +static inline unsigned long +__radix_tree_iter_add(struct radix_tree_iter *iter, unsigned long slots) +{ + return iter->index + (slots << iter_shift(iter)); +} + /** * radix_tree_iter_next - resume iterating when the chunk may be invalid * @iter: iterator state @@ -416,7 +435,7 @@ void **radix_tree_iter_retry(struct radix_tree_iter *iter) static inline __must_check void **radix_tree_iter_next(struct radix_tree_iter *iter) { - iter->next_index = iter->index + 1; + iter->next_index = __radix_tree_iter_add(iter, 1); iter->tags = 0; return NULL; } @@ -430,7 +449,12 @@ void **radix_tree_iter_next(struct radix_tree_iter *iter) static __always_inline long radix_tree_chunk_size(struct radix_tree_iter *iter) { - return iter->next_index - iter->index; + return (iter->next_index - iter->index) >> iter_shift(iter); +} + +static inline void *indirect_to_ptr(void *ptr) +{ + return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR); } /** @@ -448,24 +472,51 @@ static __always_inline void ** radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags) { if (flags & RADIX_TREE_ITER_TAGGED) { + void *canon = slot; + iter->tags >>= 1; + if (unlikely(!iter->tags)) + return NULL; + while (IS_ENABLED(CONFIG_RADIX_TREE_MULTIORDER) && + radix_tree_is_indirect_ptr(slot[1])) { + if (indirect_to_ptr(slot[1]) == canon) { + iter->tags >>= 1; + iter->index = __radix_tree_iter_add(iter, 1); + slot++; + continue; + } + iter->next_index = __radix_tree_iter_add(iter, 1); + return NULL; + } if (likely(iter->tags & 1ul)) { - iter->index++; + iter->index = __radix_tree_iter_add(iter, 1); return slot + 1; } - if (!(flags & RADIX_TREE_ITER_CONTIG) && likely(iter->tags)) { + if (!(flags & RADIX_TREE_ITER_CONTIG)) { unsigned offset = __ffs(iter->tags); iter->tags >>= offset; - iter->index += offset + 1; + iter->index = __radix_tree_iter_add(iter, offset + 1); return slot + offset + 1; } } else { - long size = radix_tree_chunk_size(iter); + long count = radix_tree_chunk_size(iter); + void *canon = slot; - while (--size > 0) { + while (--count > 0) { slot++; - iter->index++; + iter->index = __radix_tree_iter_add(iter, 1); + + if (IS_ENABLED(CONFIG_RADIX_TREE_MULTIORDER) && + radix_tree_is_indirect_ptr(*slot)) { + if (indirect_to_ptr(*slot) == canon) + continue; + else { + iter->next_index = iter->index; + break; + } + } + if (likely(*slot)) return slot; if (flags & RADIX_TREE_ITER_CONTIG) { diff --git a/lib/radix-tree.c b/lib/radix-tree.c index ff46042..a4da86e 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -75,11 +75,6 @@ static inline void *ptr_to_indirect(void *ptr) return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR); } -static inline void *indirect_to_ptr(void *ptr) -{ - return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR); -} - #define RADIX_TREE_RETRY ptr_to_indirect(NULL) #ifdef CONFIG_RADIX_TREE_MULTIORDER @@ -885,6 +880,14 @@ int radix_tree_tag_get(struct radix_tree_root *root, } EXPORT_SYMBOL(radix_tree_tag_get); +static inline void __set_iter_shift(struct radix_tree_iter *iter, + unsigned int shift) +{ +#ifdef CONFIG_RADIX_TREE_MULTIORDER + iter->shift = shift; +#endif +} + /** * radix_tree_next_chunk - find next chunk of slots for iteration * @@ -898,7 +901,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root, { unsigned shift, tag = flags & RADIX_TREE_ITER_TAG_MASK; struct radix_tree_node *rnode, *node; - unsigned long index, offset, height; + unsigned long index, offset, maxindex; if ((flags & RADIX_TREE_ITER_TAGGED) && !root_tag_get(root, tag)) return NULL; @@ -916,33 +919,39 @@ void **radix_tree_next_chunk(struct radix_tree_root *root, if (!index && iter->index) return NULL; - rnode = rcu_dereference_raw(root->rnode); + restart: + shift = radix_tree_load_root(root, &rnode, &maxindex); + if (index > maxindex) + return NULL; + if (radix_tree_is_indirect_ptr(rnode)) { rnode = indirect_to_ptr(rnode); - } else if (rnode && !index) { + } else if (rnode) { /* Single-slot tree */ - iter->index = 0; - iter->next_index = 1; + iter->index = index; + iter->next_index = maxindex + 1; iter->tags = 1; + __set_iter_shift(iter, shift); return (void **)&root->rnode; } else return NULL; -restart: - height = rnode->path & RADIX_TREE_HEIGHT_MASK; - shift = (height - 1) * RADIX_TREE_MAP_SHIFT; + shift -= RADIX_TREE_MAP_SHIFT; offset = index >> shift; - /* Index outside of the tree */ - if (offset >= RADIX_TREE_MAP_SIZE) - return NULL; - node = rnode; while (1) { struct radix_tree_node *slot; + unsigned new_off = radix_tree_descend(node, &slot, offset); + + if (new_off < offset) { + offset = new_off; + index &= ~((RADIX_TREE_MAP_SIZE << shift) - 1); + index |= offset << shift; + } + if ((flags & RADIX_TREE_ITER_TAGGED) ? - !test_bit(offset, node->tags[tag]) : - !node->slots[offset]) { + !tag_get(node, tag, offset) : !slot) { /* Hole detected */ if (flags & RADIX_TREE_ITER_CONTIG) return NULL; @@ -954,7 +963,10 @@ restart: offset + 1); else while (++offset < RADIX_TREE_MAP_SIZE) { - if (node->slots[offset]) + void *slot = node->slots[offset]; + if (is_sibling_entry(node, slot)) + continue; + if (slot) break; } index &= ~((RADIX_TREE_MAP_SIZE << shift) - 1); @@ -964,25 +976,23 @@ restart: return NULL; if (offset == RADIX_TREE_MAP_SIZE) goto restart; + slot = rcu_dereference_raw(node->slots[offset]); } - /* This is leaf-node */ - if (!shift) - break; - - slot = rcu_dereference_raw(node->slots[offset]); - if (slot == NULL) + if ((slot == NULL) || (slot == RADIX_TREE_RETRY)) goto restart; if (!radix_tree_is_indirect_ptr(slot)) break; + node = indirect_to_ptr(slot); shift -= RADIX_TREE_MAP_SHIFT; offset = (index >> shift) & RADIX_TREE_MAP_MASK; } /* Update the iterator state */ - iter->index = index; - iter->next_index = (index | RADIX_TREE_MAP_MASK) + 1; + iter->index = index & ~((1 << shift) - 1); + iter->next_index = (index | ((RADIX_TREE_MAP_SIZE << shift) - 1)) + 1; + __set_iter_shift(iter, shift); /* Construct iter->tags bit-mask from node->tags[tag] array */ if (flags & RADIX_TREE_ITER_TAGGED) { diff --git a/tools/testing/radix-tree/generated/autoconf.h b/tools/testing/radix-tree/generated/autoconf.h new file mode 100644 index 0000000..ad18cf5 --- /dev/null +++ b/tools/testing/radix-tree/generated/autoconf.h @@ -0,0 +1,3 @@ +#define CONFIG_RADIX_TREE_MULTIORDER 1 +#define CONFIG_SHMEM 1 +#define CONFIG_SWAP 1 diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h index 8ea0ed4..be98a47 100644 --- a/tools/testing/radix-tree/linux/kernel.h +++ b/tools/testing/radix-tree/linux/kernel.h @@ -8,10 +8,7 @@ #include #include "../../include/linux/compiler.h" - -#define CONFIG_RADIX_TREE_MULTIORDER -#define CONFIG_SHMEM -#define CONFIG_SWAP +#include "../../../include/linux/kconfig.h" #define RADIX_TREE_MAP_SHIFT 3 -- cgit v0.10.2 From 643b57d0a9bd4c93625a2f5da4cebc3ceb402b9b Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Fri, 20 May 2016 17:02:29 -0700 Subject: radix tree test suite: multi-order iteration test Add a unit test to verify that we can iterate over multi-order entries properly via a radix_tree_for_each_slot() loop. This was done with a single, somewhat complicated configuration that was meant to test many of the various corner cases having to do with multi-order entries: - An iteration could begin at a sibling entry, and we need to return the canonical entry. - We could have entries of various orders in the same slots[] array. - We could have multi-order entries at a nonzero height, followed by indirect pointers to more radix tree nodes later in that same slots[] array. Signed-off-by: Ross Zwisler Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c index 0a311a5..ba27fe0 100644 --- a/tools/testing/radix-tree/multiorder.c +++ b/tools/testing/radix-tree/multiorder.c @@ -92,6 +92,96 @@ static void multiorder_insert_bug(void) item_kill_tree(&tree); } +void multiorder_iteration(void) +{ + RADIX_TREE(tree, GFP_KERNEL); + struct radix_tree_iter iter; + void **slot; + int i, err; + + printf("Multiorder iteration test\n"); + +#define NUM_ENTRIES 11 + int index[NUM_ENTRIES] = {0, 2, 4, 8, 16, 32, 34, 36, 64, 72, 128}; + int order[NUM_ENTRIES] = {1, 1, 2, 3, 4, 1, 0, 1, 3, 0, 7}; + + for (i = 0; i < NUM_ENTRIES; i++) { + err = item_insert_order(&tree, index[i], order[i]); + assert(!err); + } + + i = 0; + /* start from index 1 to verify we find the multi-order entry at 0 */ + radix_tree_for_each_slot(slot, &tree, &iter, 1) { + int height = order[i] / RADIX_TREE_MAP_SHIFT; + int shift = height * RADIX_TREE_MAP_SHIFT; + + assert(iter.index == index[i]); + assert(iter.shift == shift); + i++; + } + + /* + * Now iterate through the tree starting at an elevated multi-order + * entry, beginning at an index in the middle of the range. + */ + i = 8; + radix_tree_for_each_slot(slot, &tree, &iter, 70) { + int height = order[i] / RADIX_TREE_MAP_SHIFT; + int shift = height * RADIX_TREE_MAP_SHIFT; + + assert(iter.index == index[i]); + assert(iter.shift == shift); + i++; + } + + item_kill_tree(&tree); +} + +void multiorder_tagged_iteration(void) +{ + RADIX_TREE(tree, GFP_KERNEL); + struct radix_tree_iter iter; + void **slot; + int i; + + printf("Multiorder tagged iteration test\n"); + +#define MT_NUM_ENTRIES 9 + int index[MT_NUM_ENTRIES] = {0, 2, 4, 16, 32, 40, 64, 72, 128}; + int order[MT_NUM_ENTRIES] = {1, 0, 2, 4, 3, 1, 3, 0, 7}; + +#define TAG_ENTRIES 7 + int tag_index[TAG_ENTRIES] = {0, 4, 16, 40, 64, 72, 128}; + + for (i = 0; i < MT_NUM_ENTRIES; i++) + assert(!item_insert_order(&tree, index[i], order[i])); + + assert(!radix_tree_tagged(&tree, 1)); + + for (i = 0; i < TAG_ENTRIES; i++) + assert(radix_tree_tag_set(&tree, tag_index[i], 1)); + + i = 0; + /* start from index 1 to verify we find the multi-order entry at 0 */ + radix_tree_for_each_tagged(slot, &tree, &iter, 1, 1) { + assert(iter.index == tag_index[i]); + i++; + } + + /* + * Now iterate through the tree starting at an elevated multi-order + * entry, beginning at an index in the middle of the range. + */ + i = 4; + radix_tree_for_each_slot(slot, &tree, &iter, 70) { + assert(iter.index == tag_index[i]); + i++; + } + + item_kill_tree(&tree); +} + void multiorder_checks(void) { int i; @@ -106,4 +196,6 @@ void multiorder_checks(void) multiorder_shrink((1UL << (i + RADIX_TREE_MAP_SHIFT)), i); multiorder_insert_bug(); + multiorder_iteration(); + multiorder_tagged_iteration(); } -- cgit v0.10.2 From fb969909dd18930ad68d8f8b8e7895816cf74b0a Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Fri, 20 May 2016 17:02:32 -0700 Subject: radix-tree: rewrite radix_tree_tag_set Use the new multi-order support functions to rewrite radix_tree_tag_set() Signed-off-by: Ross Zwisler Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index a4da86e..5234a95 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -722,35 +722,32 @@ EXPORT_SYMBOL(radix_tree_lookup); void *radix_tree_tag_set(struct radix_tree_root *root, unsigned long index, unsigned int tag) { - unsigned int height, shift; - struct radix_tree_node *slot; - - height = root->height; - BUG_ON(index > radix_tree_maxindex(height)); + struct radix_tree_node *node, *parent; + unsigned long maxindex; + unsigned int shift; - slot = indirect_to_ptr(root->rnode); - shift = (height - 1) * RADIX_TREE_MAP_SHIFT; + shift = radix_tree_load_root(root, &node, &maxindex); + BUG_ON(index > maxindex); - while (height > 0) { - int offset; + while (radix_tree_is_indirect_ptr(node)) { + unsigned offset; - offset = (index >> shift) & RADIX_TREE_MAP_MASK; - if (!tag_get(slot, tag, offset)) - tag_set(slot, tag, offset); - slot = slot->slots[offset]; - BUG_ON(slot == NULL); - if (!radix_tree_is_indirect_ptr(slot)) - break; - slot = indirect_to_ptr(slot); shift -= RADIX_TREE_MAP_SHIFT; - height--; + offset = (index >> shift) & RADIX_TREE_MAP_MASK; + + parent = indirect_to_ptr(node); + offset = radix_tree_descend(parent, &node, offset); + BUG_ON(!node); + + if (!tag_get(parent, tag, offset)) + tag_set(parent, tag, offset); } /* set the root's tag bit */ - if (slot && !root_tag_get(root, tag)) + if (!root_tag_get(root, tag)) root_tag_set(root, tag); - return slot; + return node; } EXPORT_SYMBOL(radix_tree_tag_set); -- cgit v0.10.2 From 00f47b581105b91f8f18edd3074322ae609a8bc5 Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Fri, 20 May 2016 17:02:35 -0700 Subject: radix-tree: rewrite radix_tree_tag_clear Use the new multi-order support functions to rewrite radix_tree_tag_clear() Signed-off-by: Ross Zwisler Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 5234a95..ee56562 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -768,44 +768,40 @@ EXPORT_SYMBOL(radix_tree_tag_set); void *radix_tree_tag_clear(struct radix_tree_root *root, unsigned long index, unsigned int tag) { - struct radix_tree_node *node = NULL; - struct radix_tree_node *slot = NULL; - unsigned int height, shift; + struct radix_tree_node *node, *parent; + unsigned long maxindex; + unsigned int shift; int uninitialized_var(offset); - height = root->height; - if (index > radix_tree_maxindex(height)) - goto out; - - shift = height * RADIX_TREE_MAP_SHIFT; - slot = root->rnode; + shift = radix_tree_load_root(root, &node, &maxindex); + if (index > maxindex) + return NULL; - while (shift) { - if (slot == NULL) - goto out; - if (!radix_tree_is_indirect_ptr(slot)) - break; - slot = indirect_to_ptr(slot); + parent = NULL; + while (radix_tree_is_indirect_ptr(node)) { shift -= RADIX_TREE_MAP_SHIFT; offset = (index >> shift) & RADIX_TREE_MAP_MASK; - node = slot; - slot = slot->slots[offset]; + + parent = indirect_to_ptr(node); + offset = radix_tree_descend(parent, &node, offset); } - if (slot == NULL) + if (node == NULL) goto out; - while (node) { - if (!tag_get(node, tag, offset)) + index >>= shift; + + while (parent) { + if (!tag_get(parent, tag, offset)) goto out; - tag_clear(node, tag, offset); - if (any_tag_set(node, tag)) + tag_clear(parent, tag, offset); + if (any_tag_set(parent, tag)) goto out; index >>= RADIX_TREE_MAP_SHIFT; offset = index & RADIX_TREE_MAP_MASK; - node = node->parent; + parent = parent->parent; } /* clear the root's tag bit */ @@ -813,7 +809,7 @@ void *radix_tree_tag_clear(struct radix_tree_root *root, root_tag_clear(root, tag); out: - return slot; + return node; } EXPORT_SYMBOL(radix_tree_tag_clear); -- cgit v0.10.2 From 4589ba6d0f439e4673830fdc65099179832ddde5 Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Fri, 20 May 2016 17:02:38 -0700 Subject: radix-tree: rewrite radix_tree_tag_get Use the new multi-order support functions to rewrite radix_tree_tag_get() Signed-off-by: Ross Zwisler Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index ee56562..b1ca744 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -831,45 +831,37 @@ EXPORT_SYMBOL(radix_tree_tag_clear); int radix_tree_tag_get(struct radix_tree_root *root, unsigned long index, unsigned int tag) { - unsigned int height, shift; - struct radix_tree_node *node; + struct radix_tree_node *node, *parent; + unsigned long maxindex; + unsigned int shift; - /* check the root's tag bit */ if (!root_tag_get(root, tag)) return 0; - node = rcu_dereference_raw(root->rnode); + shift = radix_tree_load_root(root, &node, &maxindex); + if (index > maxindex) + return 0; if (node == NULL) return 0; - if (!radix_tree_is_indirect_ptr(node)) - return (index == 0); - node = indirect_to_ptr(node); - - height = node->path & RADIX_TREE_HEIGHT_MASK; - if (index > radix_tree_maxindex(height)) - return 0; + while (radix_tree_is_indirect_ptr(node)) { + int offset; - shift = (height - 1) * RADIX_TREE_MAP_SHIFT; + shift -= RADIX_TREE_MAP_SHIFT; + offset = (index >> shift) & RADIX_TREE_MAP_MASK; - for ( ; ; ) { - int offset; + parent = indirect_to_ptr(node); + offset = radix_tree_descend(parent, &node, offset); - if (node == NULL) + if (!node) return 0; - node = indirect_to_ptr(node); - - offset = (index >> shift) & RADIX_TREE_MAP_MASK; - if (!tag_get(node, tag, offset)) + if (!tag_get(parent, tag, offset)) return 0; - if (height == 1) - return 1; - node = rcu_dereference_raw(node->slots[offset]); - if (!radix_tree_is_indirect_ptr(node)) - return 1; - shift -= RADIX_TREE_MAP_SHIFT; - height--; + if (node == RADIX_TREE_RETRY) + break; } + + return 1; } EXPORT_SYMBOL(radix_tree_tag_get); -- cgit v0.10.2 From 0fc9b8ca2b1df4948e9516697b1cf12f030968bd Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Fri, 20 May 2016 17:02:41 -0700 Subject: radix-tree test suite: add multi-order tag test Add a generic test for multi-order tag verification, and call it using several different configurations. This test creates a multi-order radix tree using the given index and order, and then sets, checks and clears tags using the indices covered by the single multi-order radix tree entry. With the various calls done by this test we verify root multi-order entries without siblings, multi-order entries without siblings in a radix tree node, as well as multi-order entries with siblings of various sizes. Signed-off-by: Ross Zwisler Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c index ba27fe0..1b6fc9b 100644 --- a/tools/testing/radix-tree/multiorder.c +++ b/tools/testing/radix-tree/multiorder.c @@ -19,6 +19,102 @@ #include "test.h" +#define for_each_index(i, base, order) \ + for (i = base; i < base + (1 << order); i++) + +static void __multiorder_tag_test(int index, int order) +{ + RADIX_TREE(tree, GFP_KERNEL); + int base, err, i; + + /* our canonical entry */ + base = index & ~((1 << order) - 1); + + printf("Multiorder tag test with index %d, canonical entry %d\n", + index, base); + + err = item_insert_order(&tree, index, order); + assert(!err); + + /* + * Verify we get collisions for covered indices. We try and fail to + * insert an exceptional entry so we don't leak memory via + * item_insert_order(). + */ + for_each_index(i, base, order) { + err = __radix_tree_insert(&tree, i, order, + (void *)(0xA0 | RADIX_TREE_EXCEPTIONAL_ENTRY)); + assert(err == -EEXIST); + } + + for_each_index(i, base, order) { + assert(!radix_tree_tag_get(&tree, i, 0)); + assert(!radix_tree_tag_get(&tree, i, 1)); + } + + assert(radix_tree_tag_set(&tree, index, 0)); + + for_each_index(i, base, order) { + assert(radix_tree_tag_get(&tree, i, 0)); + assert(!radix_tree_tag_get(&tree, i, 1)); + } + + assert(radix_tree_tag_clear(&tree, index, 0)); + + for_each_index(i, base, order) { + assert(!radix_tree_tag_get(&tree, i, 0)); + assert(!radix_tree_tag_get(&tree, i, 1)); + } + + assert(!radix_tree_tagged(&tree, 0)); + assert(!radix_tree_tagged(&tree, 1)); + + item_kill_tree(&tree); +} + +static void multiorder_tag_tests(void) +{ + /* test multi-order entry for indices 0-7 with no sibling pointers */ + __multiorder_tag_test(0, 3); + __multiorder_tag_test(5, 3); + + /* test multi-order entry for indices 8-15 with no sibling pointers */ + __multiorder_tag_test(8, 3); + __multiorder_tag_test(15, 3); + + /* + * Our order 5 entry covers indices 0-31 in a tree with height=2. + * This is broken up as follows: + * 0-7: canonical entry + * 8-15: sibling 1 + * 16-23: sibling 2 + * 24-31: sibling 3 + */ + __multiorder_tag_test(0, 5); + __multiorder_tag_test(29, 5); + + /* same test, but with indices 32-63 */ + __multiorder_tag_test(32, 5); + __multiorder_tag_test(44, 5); + + /* + * Our order 8 entry covers indices 0-255 in a tree with height=3. + * This is broken up as follows: + * 0-63: canonical entry + * 64-127: sibling 1 + * 128-191: sibling 2 + * 192-255: sibling 3 + */ + __multiorder_tag_test(0, 8); + __multiorder_tag_test(190, 8); + + /* same test, but with indices 256-511 */ + __multiorder_tag_test(256, 8); + __multiorder_tag_test(300, 8); + + __multiorder_tag_test(0x12345678UL, 8); +} + static void multiorder_check(unsigned long index, int order) { unsigned long i; @@ -196,6 +292,7 @@ void multiorder_checks(void) multiorder_shrink((1UL << (i + RADIX_TREE_MAP_SHIFT)), i); multiorder_insert_bug(); + multiorder_tag_tests(); multiorder_iteration(); multiorder_tagged_iteration(); } -- cgit v0.10.2 From 8a14f4d8328cc8615f8a5487c4173f36a8314796 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:02:44 -0700 Subject: radix-tree: fix radix_tree_create for sibling entries If the radix tree user attempted to insert a colliding entry with an existing multiorder entry, then radix_tree_create() could encounter a sibling entry when walking down the tree to look for a slot. Use radix_tree_descend() to fix the problem, and add a test-case to make sure the problem doesn't come back in future. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index b1ca744..9b5d8a9 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -548,9 +548,9 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, /* Go a level down */ height--; shift -= RADIX_TREE_MAP_SHIFT; - offset = (index >> shift) & RADIX_TREE_MAP_MASK; node = indirect_to_ptr(slot); - slot = node->slots[offset]; + offset = (index >> shift) & RADIX_TREE_MAP_MASK; + offset = radix_tree_descend(node, &slot, offset); } #ifdef CONFIG_RADIX_TREE_MULTIORDER diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c index 1b6fc9b..fc93457 100644 --- a/tools/testing/radix-tree/multiorder.c +++ b/tools/testing/radix-tree/multiorder.c @@ -135,6 +135,11 @@ static void multiorder_check(unsigned long index, int order) item_check_absent(&tree, i); for (i = max; i < 2*max; i++) item_check_absent(&tree, i); + for (i = min; i < max; i++) { + static void *entry = (void *) + (0xA0 | RADIX_TREE_EXCEPTIONAL_ENTRY); + assert(radix_tree_insert(&tree, i, entry) == -EEXIST); + } assert(item_delete(&tree, index) != 0); -- cgit v0.10.2 From 0a2efc6c809b01872321d9c7e7d82d59ac6fde10 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:02:46 -0700 Subject: radix-tree: rewrite radix_tree_locate_item Use the new multi-order support functions to rewrite radix_tree_locate_item(). Modify the locate tests to test multiorder entries too. [hughd@google.com: radix_tree_locate_item() is often returning the wrong index] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1605012108490.1166@eggly.anvils Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 9b5d8a9..8329a2e 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -1303,58 +1303,54 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_tag_slot); #if defined(CONFIG_SHMEM) && defined(CONFIG_SWAP) #include /* for cond_resched() */ +struct locate_info { + unsigned long found_index; + bool stop; +}; + /* * This linear search is at present only useful to shmem_unuse_inode(). */ static unsigned long __locate(struct radix_tree_node *slot, void *item, - unsigned long index, unsigned long *found_index) + unsigned long index, struct locate_info *info) { unsigned int shift, height; unsigned long i; height = slot->path & RADIX_TREE_HEIGHT_MASK; - shift = (height-1) * RADIX_TREE_MAP_SHIFT; + shift = height * RADIX_TREE_MAP_SHIFT; - for ( ; height > 1; height--) { - i = (index >> shift) & RADIX_TREE_MAP_MASK; - for (;;) { - if (slot->slots[i] != NULL) - break; - index &= ~((1UL << shift) - 1); - index += 1UL << shift; - if (index == 0) - goto out; /* 32-bit wraparound */ - i++; - if (i == RADIX_TREE_MAP_SIZE) - goto out; - } + do { + shift -= RADIX_TREE_MAP_SHIFT; - slot = rcu_dereference_raw(slot->slots[i]); - if (slot == NULL) - goto out; - if (!radix_tree_is_indirect_ptr(slot)) { - if (slot == item) { - *found_index = index + i; - index = 0; - } else { - index += shift; + for (i = (index >> shift) & RADIX_TREE_MAP_MASK; + i < RADIX_TREE_MAP_SIZE; + i++, index += (1UL << shift)) { + struct radix_tree_node *node = + rcu_dereference_raw(slot->slots[i]); + if (node == RADIX_TREE_RETRY) + goto out; + if (!radix_tree_is_indirect_ptr(node)) { + if (node == item) { + info->found_index = index; + info->stop = true; + goto out; + } + continue; } - goto out; + node = indirect_to_ptr(node); + if (is_sibling_entry(slot, node)) + continue; + slot = node; + break; } - slot = indirect_to_ptr(slot); - shift -= RADIX_TREE_MAP_SHIFT; - } + if (i == RADIX_TREE_MAP_SIZE) + break; + } while (shift); - /* Bottom level: check items */ - for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) { - if (slot->slots[i] == item) { - *found_index = index + i; - index = 0; - goto out; - } - } - index += RADIX_TREE_MAP_SIZE; out: + if ((index == 0) && (i == RADIX_TREE_MAP_SIZE)) + info->stop = true; return index; } @@ -1372,7 +1368,10 @@ unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item) struct radix_tree_node *node; unsigned long max_index; unsigned long cur_index = 0; - unsigned long found_index = -1; + struct locate_info info = { + .found_index = -1, + .stop = false, + }; do { rcu_read_lock(); @@ -1380,24 +1379,24 @@ unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item) if (!radix_tree_is_indirect_ptr(node)) { rcu_read_unlock(); if (node == item) - found_index = 0; + info.found_index = 0; break; } node = indirect_to_ptr(node); - max_index = radix_tree_maxindex(node->path & - RADIX_TREE_HEIGHT_MASK); + + max_index = node_maxindex(node); if (cur_index > max_index) { rcu_read_unlock(); break; } - cur_index = __locate(node, item, cur_index, &found_index); + cur_index = __locate(node, item, cur_index, &info); rcu_read_unlock(); cond_resched(); - } while (cur_index != 0 && cur_index <= max_index); + } while (!info.stop && cur_index <= max_index); - return found_index; + return info.found_index; } #else unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item) diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c index b6a700b..65231e9 100644 --- a/tools/testing/radix-tree/main.c +++ b/tools/testing/radix-tree/main.c @@ -232,17 +232,18 @@ void copy_tag_check(void) item_kill_tree(&tree); } -void __locate_check(struct radix_tree_root *tree, unsigned long index) +void __locate_check(struct radix_tree_root *tree, unsigned long index, + unsigned order) { struct item *item; unsigned long index2; - item_insert(tree, index); + item_insert_order(tree, index, order); item = item_lookup(tree, index); index2 = radix_tree_locate_item(tree, item); if (index != index2) { - printf("index %ld inserted; found %ld\n", - index, index2); + printf("index %ld order %d inserted; found %ld\n", + index, order, index2); abort(); } } @@ -250,21 +251,26 @@ void __locate_check(struct radix_tree_root *tree, unsigned long index) static void locate_check(void) { RADIX_TREE(tree, GFP_KERNEL); + unsigned order; unsigned long offset, index; - for (offset = 0; offset < (1 << 3); offset++) { - for (index = 0; index < (1UL << 5); index++) { - __locate_check(&tree, index + offset); - } - if (radix_tree_locate_item(&tree, &tree) != -1) - abort(); + for (order = 0; order < 20; order++) { + for (offset = 0; offset < (1 << (order + 3)); + offset += (1UL << order)) { + for (index = 0; index < (1UL << (order + 5)); + index += (1UL << order)) { + __locate_check(&tree, index + offset, order); + } + if (radix_tree_locate_item(&tree, &tree) != -1) + abort(); - item_kill_tree(&tree); + item_kill_tree(&tree); + } } if (radix_tree_locate_item(&tree, &tree) != -1) abort(); - __locate_check(&tree, -1); + __locate_check(&tree, -1, 0); if (radix_tree_locate_item(&tree, &tree) != -1) abort(); item_kill_tree(&tree); -- cgit v0.10.2 From eb73f7f3300c144c4b886dd56ea4c3d2b2d58249 Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Fri, 20 May 2016 17:02:49 -0700 Subject: radix-tree: add test for radix_tree_locate_item() Add a unit test that provides coverage for the bug fixed in the commit entitled "radix-tree: rewrite radix_tree_locate_item fix" from Hugh Dickins. I've verified that this test fails before his patch due to miscalculated 'index' values in __locate() in lib/radix-tree.c, and passes with his fix. Link: http://lkml.kernel.org/r/1462307263-20623-1-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/tools/testing/radix-tree/linux/init.h b/tools/testing/radix-tree/linux/init.h new file mode 100644 index 0000000..360cabb --- /dev/null +++ b/tools/testing/radix-tree/linux/init.h @@ -0,0 +1 @@ +/* An empty file stub that allows radix-tree.c to compile. */ diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c index 65231e9..b7619ff 100644 --- a/tools/testing/radix-tree/main.c +++ b/tools/testing/radix-tree/main.c @@ -232,7 +232,7 @@ void copy_tag_check(void) item_kill_tree(&tree); } -void __locate_check(struct radix_tree_root *tree, unsigned long index, +static void __locate_check(struct radix_tree_root *tree, unsigned long index, unsigned order) { struct item *item; @@ -248,12 +248,25 @@ void __locate_check(struct radix_tree_root *tree, unsigned long index, } } +static void __order_0_locate_check(void) +{ + RADIX_TREE(tree, GFP_KERNEL); + int i; + + for (i = 0; i < 50; i++) + __locate_check(&tree, rand() % INT_MAX, 0); + + item_kill_tree(&tree); +} + static void locate_check(void) { RADIX_TREE(tree, GFP_KERNEL); unsigned order; unsigned long offset, index; + __order_0_locate_check(); + for (order = 0; order < 20; order++) { for (offset = 0; offset < (1 << (order + 3)); offset += (1UL << order)) { -- cgit v0.10.2 From 070c5ac2740b5db89d381a09fb03b2480b2f7a74 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:02:52 -0700 Subject: radix-tree: fix radix_tree_range_tag_if_tagged() for multiorder entries I had previously decided that tagging a single multiorder entry would count as tagging 2^order entries for the purposes of 'nr_to_tag'. I now believe that decision to be a mistake, and it should count as a single entry. That's more likely to be what callers expect. When walking back up the tree from a newly-tagged entry, the current code assumed we were starting from the lowest level of the tree; if we have a multiorder entry with an order at least RADIX_TREE_MAP_SHIFT in size then we need to shift the index by 'shift' before we start walking back up the tree, or we will end up not setting tags on higher entries, and then mistakenly thinking that entries below a certain point in the tree are not tagged. If the first index we examine is a sibling entry of a tagged multiorder entry, we were not tagging it. We need to examine the canonical entry, and the easiest way to do that is to use radix_tree_descend(). We then have to skip over sibling slots when looking for the next entry in the tree or we will end up walking back to the canonical entry. Add several tests for radix_tree_range_tag_if_tagged(). Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 8329a2e..8df0df2 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -1033,14 +1033,13 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, unsigned long nr_to_tag, unsigned int iftag, unsigned int settag) { - unsigned int height = root->height; - struct radix_tree_node *node = NULL; - struct radix_tree_node *slot; - unsigned int shift; + struct radix_tree_node *slot, *node = NULL; + unsigned long maxindex; + unsigned int shift = radix_tree_load_root(root, &slot, &maxindex); unsigned long tagged = 0; unsigned long index = *first_indexp; - last_index = min(last_index, radix_tree_maxindex(height)); + last_index = min(last_index, maxindex); if (index > last_index) return 0; if (!nr_to_tag) @@ -1049,80 +1048,71 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, *first_indexp = last_index + 1; return 0; } - if (height == 0) { + if (!radix_tree_is_indirect_ptr(slot)) { *first_indexp = last_index + 1; root_tag_set(root, settag); return 1; } - shift = (height - 1) * RADIX_TREE_MAP_SHIFT; - slot = indirect_to_ptr(root->rnode); + node = indirect_to_ptr(slot); + shift -= RADIX_TREE_MAP_SHIFT; for (;;) { unsigned long upindex; - int offset; + unsigned offset; offset = (index >> shift) & RADIX_TREE_MAP_MASK; - if (!slot->slots[offset]) + offset = radix_tree_descend(node, &slot, offset); + if (!slot) goto next; - if (!tag_get(slot, iftag, offset)) + if (!tag_get(node, iftag, offset)) goto next; - if (shift) { - node = slot; - slot = slot->slots[offset]; - if (radix_tree_is_indirect_ptr(slot)) { - slot = indirect_to_ptr(slot); - shift -= RADIX_TREE_MAP_SHIFT; - continue; - } else { - slot = node; - node = node->parent; - } + /* Sibling slots never have tags set on them */ + if (radix_tree_is_indirect_ptr(slot)) { + node = indirect_to_ptr(slot); + shift -= RADIX_TREE_MAP_SHIFT; + continue; } /* tag the leaf */ - tagged += 1 << shift; - tag_set(slot, settag, offset); + tagged++; + tag_set(node, settag, offset); + slot = node->parent; /* walk back up the path tagging interior nodes */ - upindex = index; - while (node) { + upindex = index >> shift; + while (slot) { upindex >>= RADIX_TREE_MAP_SHIFT; offset = upindex & RADIX_TREE_MAP_MASK; /* stop if we find a node with the tag already set */ - if (tag_get(node, settag, offset)) + if (tag_get(slot, settag, offset)) break; - tag_set(node, settag, offset); - node = node->parent; + tag_set(slot, settag, offset); + slot = slot->parent; } - /* - * Small optimization: now clear that node pointer. - * Since all of this slot's ancestors now have the tag set - * from setting it above, we have no further need to walk - * back up the tree setting tags, until we update slot to - * point to another radix_tree_node. - */ - node = NULL; - -next: + next: /* Go to next item at level determined by 'shift' */ index = ((index >> shift) + 1) << shift; /* Overflow can happen when last_index is ~0UL... */ if (index > last_index || !index) break; - if (tagged >= nr_to_tag) - break; - while (((index >> shift) & RADIX_TREE_MAP_MASK) == 0) { + offset = (index >> shift) & RADIX_TREE_MAP_MASK; + while (offset == 0) { /* * We've fully scanned this node. Go up. Because * last_index is guaranteed to be in the tree, what * we do below cannot wander astray. */ - slot = slot->parent; + node = node->parent; shift += RADIX_TREE_MAP_SHIFT; + offset = (index >> shift) & RADIX_TREE_MAP_MASK; } + if (is_sibling_entry(node, node->slots[offset])) + goto next; + if (tagged >= nr_to_tag) + break; } /* * We need not to tag the root tag if there is no tag which is set with diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c index fc93457..c061f4b 100644 --- a/tools/testing/radix-tree/multiorder.c +++ b/tools/testing/radix-tree/multiorder.c @@ -26,6 +26,7 @@ static void __multiorder_tag_test(int index, int order) { RADIX_TREE(tree, GFP_KERNEL); int base, err, i; + unsigned long first = 0; /* our canonical entry */ base = index & ~((1 << order) - 1); @@ -59,13 +60,16 @@ static void __multiorder_tag_test(int index, int order) assert(!radix_tree_tag_get(&tree, i, 1)); } + assert(radix_tree_range_tag_if_tagged(&tree, &first, ~0UL, 10, 0, 1) == 1); assert(radix_tree_tag_clear(&tree, index, 0)); for_each_index(i, base, order) { assert(!radix_tree_tag_get(&tree, i, 0)); - assert(!radix_tree_tag_get(&tree, i, 1)); + assert(radix_tree_tag_get(&tree, i, 1)); } + assert(radix_tree_tag_clear(&tree, index, 1)); + assert(!radix_tree_tagged(&tree, 0)); assert(!radix_tree_tagged(&tree, 1)); @@ -244,6 +248,7 @@ void multiorder_tagged_iteration(void) RADIX_TREE(tree, GFP_KERNEL); struct radix_tree_iter iter; void **slot; + unsigned long first = 0; int i; printf("Multiorder tagged iteration test\n"); @@ -280,6 +285,24 @@ void multiorder_tagged_iteration(void) i++; } + radix_tree_range_tag_if_tagged(&tree, &first, ~0UL, + MT_NUM_ENTRIES, 1, 2); + + i = 0; + radix_tree_for_each_tagged(slot, &tree, &iter, 1, 2) { + assert(iter.index == tag_index[i]); + i++; + } + + first = 1; + radix_tree_range_tag_if_tagged(&tree, &first, ~0UL, + MT_NUM_ENTRIES, 1, 0); + i = 0; + radix_tree_for_each_tagged(slot, &tree, &iter, 0, 0) { + assert(iter.index == tag_index[i]); + i++; + } + item_kill_tree(&tree); } diff --git a/tools/testing/radix-tree/tag_check.c b/tools/testing/radix-tree/tag_check.c index 83136be..b7447ce 100644 --- a/tools/testing/radix-tree/tag_check.c +++ b/tools/testing/radix-tree/tag_check.c @@ -12,6 +12,7 @@ static void __simple_checks(struct radix_tree_root *tree, unsigned long index, int tag) { + unsigned long first = 0; int ret; item_check_absent(tree, index); @@ -22,6 +23,10 @@ __simple_checks(struct radix_tree_root *tree, unsigned long index, int tag) item_tag_set(tree, index, tag); ret = item_tag_get(tree, index, tag); assert(ret != 0); + ret = radix_tree_range_tag_if_tagged(tree, &first, ~0UL, 10, tag, !tag); + assert(ret == 1); + ret = item_tag_get(tree, index, !tag); + assert(ret != 0); ret = item_delete(tree, index); assert(ret != 0); item_insert(tree, index); @@ -304,6 +309,7 @@ static void single_check(void) struct item *items[BATCH]; RADIX_TREE(tree, GFP_KERNEL); int ret; + unsigned long first = 0; item_insert(&tree, 0); item_tag_set(&tree, 0, 0); @@ -313,6 +319,10 @@ static void single_check(void) assert(ret == 0); verify_tag_consistency(&tree, 0); verify_tag_consistency(&tree, 1); + ret = radix_tree_range_tag_if_tagged(&tree, &first, 10, 10, 0, 1); + assert(ret == 1); + ret = radix_tree_gang_lookup_tag(&tree, (void **)items, 0, BATCH, 1); + assert(ret == 1); item_kill_tree(&tree); } -- cgit v0.10.2 From 0796c58325533f87c00949a545eb607baa8441cb Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Fri, 20 May 2016 17:02:55 -0700 Subject: radix-tree: fix radix_tree_dump() for multi-order entries - Print which indices are covered by every leaf entry - Print sibling entries - Print the node pointer instead of the slot entry - Build by default in userspace, and make it accessible to the test-suite Signed-off-by: Ross Zwisler Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 8df0df2..a1a44f9 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -215,27 +215,36 @@ radix_tree_find_next_bit(const unsigned long *addr, return size; } -#if 0 -static void dump_node(void *slot, int height, int offset) +#ifndef __KERNEL__ +static void dump_node(struct radix_tree_node *node, unsigned offset, + unsigned shift, unsigned long index) { - struct radix_tree_node *node; - int i; - - if (!slot) - return; - - if (height == 0) { - pr_debug("radix entry %p offset %d\n", slot, offset); - return; - } + unsigned long i; - node = indirect_to_ptr(slot); pr_debug("radix node: %p offset %d tags %lx %lx %lx path %x count %d parent %p\n", - slot, offset, node->tags[0][0], node->tags[1][0], - node->tags[2][0], node->path, node->count, node->parent); - - for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) - dump_node(node->slots[i], height - 1, i); + node, offset, + node->tags[0][0], node->tags[1][0], node->tags[2][0], + node->path, node->count, node->parent); + + for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) { + unsigned long first = index | (i << shift); + unsigned long last = first | ((1UL << shift) - 1); + void *entry = node->slots[i]; + if (!entry) + continue; + if (is_sibling_entry(node, entry)) { + pr_debug("radix sblng %p offset %ld val %p indices %ld-%ld\n", + entry, i, + *(void **)indirect_to_ptr(entry), + first, last); + } else if (!radix_tree_is_indirect_ptr(entry)) { + pr_debug("radix entry %p offset %ld indices %ld-%ld\n", + entry, i, first, last); + } else { + dump_node(indirect_to_ptr(entry), i, + shift - RADIX_TREE_MAP_SHIFT, first); + } + } } /* For debug */ @@ -246,7 +255,8 @@ static void radix_tree_dump(struct radix_tree_root *root) root->gfp_mask >> __GFP_BITS_SHIFT); if (!radix_tree_is_indirect_ptr(root->rnode)) return; - dump_node(root->rnode, root->height, 0); + dump_node(indirect_to_ptr(root->rnode), 0, + (root->height - 1) * RADIX_TREE_MAP_SHIFT, 0); } #endif diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h index 53cb595..67217c9 100644 --- a/tools/testing/radix-tree/test.h +++ b/tools/testing/radix-tree/test.h @@ -40,5 +40,6 @@ extern int nr_allocated; /* Normally private parts of lib/radix-tree.c */ void *indirect_to_ptr(void *ptr); +void radix_tree_dump(struct radix_tree_root *root); int root_tag_get(struct radix_tree_root *root, unsigned int tag); unsigned long radix_tree_maxindex(unsigned int height); -- cgit v0.10.2 From 6b053b8e5f5e12089fbbc127167199746bb5b4c1 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:02:58 -0700 Subject: radix-tree: add copyright statements The multiorder support is a sufficiently large feature to be worth adding copyrigt lines for. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index a1a44f9..1e75813 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -4,6 +4,8 @@ * Copyright (C) 2005 SGI, Christoph Lameter * Copyright (C) 2006 Nick Piggin * Copyright (C) 2012 Konstantin Khlebnikov + * Copyright (C) 2016 Intel, Matthew Wilcox + * Copyright (C) 2016 Intel, Ross Zwisler * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License as -- cgit v0.10.2 From b76ba4af4ddd6a06f7f65769e7be1bc56556cdf5 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:01 -0700 Subject: drivers/hwspinlock: use correct radix tree API radix_tree_is_indirect_ptr() is an internal API. The correct call to use is radix_tree_deref_retry() which has the appropriate unlikely() annotation. Fixes: c6400ba7e13a ("drivers/hwspinlock: fix race between radix tree insertion and lookup") Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Cc: Ross Zwisler Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/drivers/hwspinlock/hwspinlock_core.c b/drivers/hwspinlock/hwspinlock_core.c index d50c701..4074441 100644 --- a/drivers/hwspinlock/hwspinlock_core.c +++ b/drivers/hwspinlock/hwspinlock_core.c @@ -313,7 +313,7 @@ int of_hwspin_lock_get_id(struct device_node *np, int index) hwlock = radix_tree_deref_slot(slot); if (unlikely(!hwlock)) continue; - if (radix_tree_is_indirect_ptr(hwlock)) { + if (radix_tree_deref_retry(hwlock)) { slot = radix_tree_iter_retry(&iter); continue; } -- cgit v0.10.2 From 2fcd9005cc03ab09ea2a940515ed728d43df66c4 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:04 -0700 Subject: radix-tree: miscellaneous fixes Typos, whitespace, grammar, line length, using the correct types, etc. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 1e75813..75944e4 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -66,7 +66,7 @@ static struct kmem_cache *radix_tree_node_cachep; * Per-cpu pool of preloaded nodes */ struct radix_tree_preload { - int nr; + unsigned nr; /* nodes->private_data points to next preallocated node */ struct radix_tree_node *nodes; }; @@ -147,7 +147,7 @@ static inline void root_tag_set(struct radix_tree_root *root, unsigned int tag) root->gfp_mask |= (__force gfp_t)(1 << (tag + __GFP_BITS_SHIFT)); } -static inline void root_tag_clear(struct radix_tree_root *root, unsigned int tag) +static inline void root_tag_clear(struct radix_tree_root *root, unsigned tag) { root->gfp_mask &= (__force gfp_t)~(1 << (tag + __GFP_BITS_SHIFT)); } @@ -159,7 +159,7 @@ static inline void root_tag_clear_all(struct radix_tree_root *root) static inline int root_tag_get(struct radix_tree_root *root, unsigned int tag) { - return (__force unsigned)root->gfp_mask & (1 << (tag + __GFP_BITS_SHIFT)); + return (__force int)root->gfp_mask & (1 << (tag + __GFP_BITS_SHIFT)); } static inline unsigned root_tags_get(struct radix_tree_root *root) @@ -173,7 +173,7 @@ static inline unsigned root_tags_get(struct radix_tree_root *root) */ static inline int any_tag_set(struct radix_tree_node *node, unsigned int tag) { - int idx; + unsigned idx; for (idx = 0; idx < RADIX_TREE_TAG_LONGS; idx++) { if (node->tags[tag][idx]) return 1; @@ -273,9 +273,9 @@ radix_tree_node_alloc(struct radix_tree_root *root) gfp_t gfp_mask = root_gfp_mask(root); /* - * Preload code isn't irq safe and it doesn't make sence to use - * preloading in the interrupt anyway as all the allocations have to - * be atomic. So just do normal allocation when in interrupt. + * Preload code isn't irq safe and it doesn't make sense to use + * preloading during an interrupt anyway as all the allocations have + * to be atomic. So just do normal allocation when in interrupt. */ if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) { struct radix_tree_preload *rtp; @@ -448,7 +448,6 @@ static unsigned radix_tree_load_root(struct radix_tree_root *root, static int radix_tree_extend(struct radix_tree_root *root, unsigned long index) { - struct radix_tree_node *node; struct radix_tree_node *slot; unsigned int height; int tag; @@ -465,7 +464,9 @@ static int radix_tree_extend(struct radix_tree_root *root, do { unsigned int newheight; - if (!(node = radix_tree_node_alloc(root))) + struct radix_tree_node *node = radix_tree_node_alloc(root); + + if (!node) return -ENOMEM; /* Propagate the aggregated tag info into the new root */ @@ -542,7 +543,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, while (shift > order) { if (slot == NULL) { /* Have to add a child node. */ - if (!(slot = radix_tree_node_alloc(root))) + slot = radix_tree_node_alloc(root); + if (!slot) return -ENOMEM; slot->path = height; slot->parent = node; @@ -722,13 +724,13 @@ EXPORT_SYMBOL(radix_tree_lookup); * radix_tree_tag_set - set a tag on a radix tree node * @root: radix tree root * @index: index key - * @tag: tag index + * @tag: tag index * * Set the search tag (which must be < RADIX_TREE_MAX_TAGS) * corresponding to @index in the radix tree. From * the root all the way down to the leaf node. * - * Returns the address of the tagged item. Setting a tag on a not-present + * Returns the address of the tagged item. Setting a tag on a not-present * item is a bug. */ void *radix_tree_tag_set(struct radix_tree_root *root, @@ -767,11 +769,11 @@ EXPORT_SYMBOL(radix_tree_tag_set); * radix_tree_tag_clear - clear a tag on a radix tree node * @root: radix tree root * @index: index key - * @tag: tag index + * @tag: tag index * * Clear the search tag (which must be < RADIX_TREE_MAX_TAGS) - * corresponding to @index in the radix tree. If - * this causes the leaf node to have no tags set then clear the tag in the + * corresponding to @index in the radix tree. If this causes + * the leaf node to have no tags set then clear the tag in the * next-to-leaf node, etc. * * Returns the address of the tagged item on success, else NULL. ie: @@ -829,7 +831,7 @@ EXPORT_SYMBOL(radix_tree_tag_clear); * radix_tree_tag_get - get a tag on a radix tree node * @root: radix tree root * @index: index key - * @tag: tag index (< RADIX_TREE_MAX_TAGS) + * @tag: tag index (< RADIX_TREE_MAX_TAGS) * * Return values: * @@ -1035,7 +1037,7 @@ EXPORT_SYMBOL(radix_tree_next_chunk); * set is outside the range we are scanning. This reults in dangling tags and * can lead to problems with later tag operations (e.g. livelocks on lookups). * - * The function returns number of leaves where the tag was set and sets + * The function returns the number of leaves where the tag was set and sets * *first_indexp to the first unscanned index. * WARNING! *first_indexp can wrap if last_index is ULONG_MAX. Caller must * be prepared to handle that. @@ -1153,9 +1155,10 @@ EXPORT_SYMBOL(radix_tree_range_tag_if_tagged); * * Like radix_tree_lookup, radix_tree_gang_lookup may be called under * rcu_read_lock. In this case, rather than the returned results being - * an atomic snapshot of the tree at a single point in time, the semantics - * of an RCU protected gang lookup are as though multiple radix_tree_lookups - * have been issued in individual locks, and results stored in 'results'. + * an atomic snapshot of the tree at a single point in time, the + * semantics of an RCU protected gang lookup are as though multiple + * radix_tree_lookups have been issued in individual locks, and results + * stored in 'results'. */ unsigned int radix_tree_gang_lookup(struct radix_tree_root *root, void **results, @@ -1460,7 +1463,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root) * their slot to become empty sooner or later. * * For example, lockless pagecache will look up a slot, deref - * the page pointer, and if the page is 0 refcount it means it + * the page pointer, and if the page has 0 refcount it means it * was concurrently deleted from pagecache so try the deref * again. Fortunately there is already a requirement for logic * to retry the entire slot lookup -- the indirect pointer @@ -1649,24 +1652,23 @@ static __init void radix_tree_init_maxindex(void) } static int radix_tree_callback(struct notifier_block *nfb, - unsigned long action, - void *hcpu) + unsigned long action, void *hcpu) { - int cpu = (long)hcpu; - struct radix_tree_preload *rtp; - struct radix_tree_node *node; - - /* Free per-cpu pool of perloaded nodes */ - if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) { - rtp = &per_cpu(radix_tree_preloads, cpu); - while (rtp->nr) { + int cpu = (long)hcpu; + struct radix_tree_preload *rtp; + struct radix_tree_node *node; + + /* Free per-cpu pool of preloaded nodes */ + if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) { + rtp = &per_cpu(radix_tree_preloads, cpu); + while (rtp->nr) { node = rtp->nodes; rtp->nodes = node->private_data; kmem_cache_free(radix_tree_node_cachep, node); rtp->nr--; - } - } - return NOTIFY_OK; + } + } + return NOTIFY_OK; } void __init radix_tree_init(void) -- cgit v0.10.2 From 0c7fa0a8418cbe0e8963fe36db9575d03b8589f7 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:07 -0700 Subject: radix-tree: split node->path into offset and height Neither piece of information we're storing in node->path can be larger than 64, so store each in its own unsigned char instead of shifting and masking to store them both in an unsigned int. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 8558d52..2d2ad9d 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -84,16 +84,13 @@ static inline int radix_tree_is_indirect_ptr(void *ptr) #define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \ RADIX_TREE_MAP_SHIFT)) -/* Height component in node->path */ -#define RADIX_TREE_HEIGHT_SHIFT (RADIX_TREE_MAX_PATH + 1) -#define RADIX_TREE_HEIGHT_MASK ((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1) - /* Internally used bits of node->count */ #define RADIX_TREE_COUNT_SHIFT (RADIX_TREE_MAP_SHIFT + 1) #define RADIX_TREE_COUNT_MASK ((1UL << RADIX_TREE_COUNT_SHIFT) - 1) struct radix_tree_node { - unsigned int path; /* Offset in parent & height from the bottom */ + unsigned char height; /* From the bottom */ + unsigned char offset; /* Slot offset in parent */ unsigned int count; union { struct { diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 75944e4..dd04b51 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -218,15 +218,15 @@ radix_tree_find_next_bit(const unsigned long *addr, } #ifndef __KERNEL__ -static void dump_node(struct radix_tree_node *node, unsigned offset, +static void dump_node(struct radix_tree_node *node, unsigned shift, unsigned long index) { unsigned long i; - pr_debug("radix node: %p offset %d tags %lx %lx %lx path %x count %d parent %p\n", - node, offset, + pr_debug("radix node: %p offset %d tags %lx %lx %lx height %d count %d parent %p\n", + node, node->offset, node->tags[0][0], node->tags[1][0], node->tags[2][0], - node->path, node->count, node->parent); + node->height, node->count, node->parent); for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) { unsigned long first = index | (i << shift); @@ -243,7 +243,7 @@ static void dump_node(struct radix_tree_node *node, unsigned offset, pr_debug("radix entry %p offset %ld indices %ld-%ld\n", entry, i, first, last); } else { - dump_node(indirect_to_ptr(entry), i, + dump_node(indirect_to_ptr(entry), shift - RADIX_TREE_MAP_SHIFT, first); } } @@ -257,7 +257,7 @@ static void radix_tree_dump(struct radix_tree_root *root) root->gfp_mask >> __GFP_BITS_SHIFT); if (!radix_tree_is_indirect_ptr(root->rnode)) return; - dump_node(indirect_to_ptr(root->rnode), 0, + dump_node(indirect_to_ptr(root->rnode), (root->height - 1) * RADIX_TREE_MAP_SHIFT, 0); } #endif @@ -421,7 +421,7 @@ static inline unsigned long radix_tree_maxindex(unsigned int height) static inline unsigned long node_maxindex(struct radix_tree_node *node) { - return radix_tree_maxindex(node->path & RADIX_TREE_HEIGHT_MASK); + return radix_tree_maxindex(node->height); } static unsigned radix_tree_load_root(struct radix_tree_root *root, @@ -434,8 +434,7 @@ static unsigned radix_tree_load_root(struct radix_tree_root *root, if (likely(radix_tree_is_indirect_ptr(node))) { node = indirect_to_ptr(node); *maxindex = node_maxindex(node); - return (node->path & RADIX_TREE_HEIGHT_MASK) * - RADIX_TREE_MAP_SHIFT; + return node->height * RADIX_TREE_MAP_SHIFT; } *maxindex = 0; @@ -476,9 +475,10 @@ static int radix_tree_extend(struct radix_tree_root *root, } /* Increase the height. */ - newheight = root->height+1; - BUG_ON(newheight & ~RADIX_TREE_HEIGHT_MASK); - node->path = newheight; + newheight = root->height + 1; + BUG_ON(newheight > BITS_PER_LONG); + node->height = newheight; + node->offset = 0; node->count = 1; node->parent = NULL; slot = root->rnode; @@ -546,13 +546,13 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, slot = radix_tree_node_alloc(root); if (!slot) return -ENOMEM; - slot->path = height; + slot->height = height; + slot->offset = offset; slot->parent = node; if (node) { rcu_assign_pointer(node->slots[offset], ptr_to_indirect(slot)); node->count++; - slot->path |= offset << RADIX_TREE_HEIGHT_SHIFT; } else rcu_assign_pointer(root->rnode, ptr_to_indirect(slot)); @@ -1319,11 +1319,10 @@ struct locate_info { static unsigned long __locate(struct radix_tree_node *slot, void *item, unsigned long index, struct locate_info *info) { - unsigned int shift, height; + unsigned int shift; unsigned long i; - height = slot->path & RADIX_TREE_HEIGHT_MASK; - shift = height * RADIX_TREE_MAP_SHIFT; + shift = slot->height * RADIX_TREE_MAP_SHIFT; do { shift -= RADIX_TREE_MAP_SHIFT; @@ -1508,10 +1507,7 @@ bool __radix_tree_delete_node(struct radix_tree_root *root, parent = node->parent; if (parent) { - unsigned int offset; - - offset = node->path >> RADIX_TREE_HEIGHT_SHIFT; - parent->slots[offset] = NULL; + parent->slots[node->offset] = NULL; parent->count--; } else { root_tag_clear_all(root); -- cgit v0.10.2 From c12e51b07b3ac4c188fd91a82f96840fdb9cca6f Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:10 -0700 Subject: radix-tree: replace node->height with node->shift node->shift represents the shift necessary for looking in the slots array at this level. It is equal to the old (node->height - 1) * RADIX_TREE_MAP_SHIFT. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 2d2ad9d..0374582 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -89,7 +89,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr) #define RADIX_TREE_COUNT_MASK ((1UL << RADIX_TREE_COUNT_SHIFT) - 1) struct radix_tree_node { - unsigned char height; /* From the bottom */ + unsigned char shift; /* Bits remaining in each slot */ unsigned char offset; /* Slot offset in parent */ unsigned int count; union { diff --git a/lib/radix-tree.c b/lib/radix-tree.c index dd04b51..648da908 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -223,10 +223,10 @@ static void dump_node(struct radix_tree_node *node, { unsigned long i; - pr_debug("radix node: %p offset %d tags %lx %lx %lx height %d count %d parent %p\n", + pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d parent %p\n", node, node->offset, node->tags[0][0], node->tags[1][0], node->tags[2][0], - node->height, node->count, node->parent); + node->shift, node->count, node->parent); for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) { unsigned long first = index | (i << shift); @@ -419,9 +419,14 @@ static inline unsigned long radix_tree_maxindex(unsigned int height) return height_to_maxindex[height]; } +static inline unsigned long shift_maxindex(unsigned int shift) +{ + return (RADIX_TREE_MAP_SIZE << shift) - 1; +} + static inline unsigned long node_maxindex(struct radix_tree_node *node) { - return radix_tree_maxindex(node->height); + return shift_maxindex(node->shift); } static unsigned radix_tree_load_root(struct radix_tree_root *root, @@ -434,7 +439,7 @@ static unsigned radix_tree_load_root(struct radix_tree_root *root, if (likely(radix_tree_is_indirect_ptr(node))) { node = indirect_to_ptr(node); *maxindex = node_maxindex(node); - return node->height * RADIX_TREE_MAP_SHIFT; + return node->shift + RADIX_TREE_MAP_SHIFT; } *maxindex = 0; @@ -475,9 +480,9 @@ static int radix_tree_extend(struct radix_tree_root *root, } /* Increase the height. */ - newheight = root->height + 1; + newheight = root->height; BUG_ON(newheight > BITS_PER_LONG); - node->height = newheight; + node->shift = newheight * RADIX_TREE_MAP_SHIFT; node->offset = 0; node->count = 1; node->parent = NULL; @@ -490,7 +495,7 @@ static int radix_tree_extend(struct radix_tree_root *root, node->slots[0] = slot; node = ptr_to_indirect(node); rcu_assign_pointer(root->rnode, node); - root->height = newheight; + root->height = ++newheight; } while (height > root->height); out: return height * RADIX_TREE_MAP_SHIFT; @@ -519,7 +524,7 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, { struct radix_tree_node *node = NULL, *slot; unsigned long maxindex; - unsigned int height, shift, offset; + unsigned int shift, offset; unsigned long max = index | ((1UL << order) - 1); shift = radix_tree_load_root(root, &slot, &maxindex); @@ -537,16 +542,15 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, } } - height = root->height; - offset = 0; /* uninitialised var warning */ while (shift > order) { + shift -= RADIX_TREE_MAP_SHIFT; if (slot == NULL) { /* Have to add a child node. */ slot = radix_tree_node_alloc(root); if (!slot) return -ENOMEM; - slot->height = height; + slot->shift = shift; slot->offset = offset; slot->parent = node; if (node) { @@ -560,8 +564,6 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, break; /* Go a level down */ - height--; - shift -= RADIX_TREE_MAP_SHIFT; node = indirect_to_ptr(slot); offset = (index >> shift) & RADIX_TREE_MAP_MASK; offset = radix_tree_descend(node, &slot, offset); @@ -1322,7 +1324,7 @@ static unsigned long __locate(struct radix_tree_node *slot, void *item, unsigned int shift; unsigned long i; - shift = slot->height * RADIX_TREE_MAP_SHIFT; + shift = slot->shift + RADIX_TREE_MAP_SHIFT; do { shift -= RADIX_TREE_MAP_SHIFT; -- cgit v0.10.2 From fb209019c92a9141fd73f3c4928edc1b299b3490 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:13 -0700 Subject: radix-tree: remove a use of root->height from delete_node If radix_tree_shrink returns whether it managed to shrink, then __radix_tree_delete_node doesn't ned to query the tree to find out whether it did any work or not. Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Cc: Ross Zwisler Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 648da908..75c9e61 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -1415,8 +1415,10 @@ unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item) * radix_tree_shrink - shrink height of a radix tree to minimal * @root radix tree root */ -static inline void radix_tree_shrink(struct radix_tree_root *root) +static inline bool radix_tree_shrink(struct radix_tree_root *root) { + bool shrunk = false; + /* try to shrink tree height */ while (root->height > 0) { struct radix_tree_node *to_free = root->rnode; @@ -1476,7 +1478,10 @@ static inline void radix_tree_shrink(struct radix_tree_root *root) to_free->slots[0] = RADIX_TREE_RETRY; radix_tree_node_free(to_free); + shrunk = true; } + + return shrunk; } /** @@ -1499,11 +1504,8 @@ bool __radix_tree_delete_node(struct radix_tree_root *root, struct radix_tree_node *parent; if (node->count) { - if (node == indirect_to_ptr(root->rnode)) { - radix_tree_shrink(root); - if (root->height == 0) - deleted = true; - } + if (node == indirect_to_ptr(root->rnode)) + deleted |= radix_tree_shrink(root); return deleted; } -- cgit v0.10.2 From 0694f0c9e20c47063e4237e5f6649ae5ce5a369a Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:16 -0700 Subject: radix tree test suite: remove dependencies on height verify_node() can use node->shift instead of the height. tree_verify_min_height() can be converted over to using node_maxindex() and shift_maxindex() instead of radix_tree_maxindex(). Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Cc: Ross Zwisler Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/tools/testing/radix-tree/test.c b/tools/testing/radix-tree/test.c index da54f11..3004c58 100644 --- a/tools/testing/radix-tree/test.c +++ b/tools/testing/radix-tree/test.c @@ -143,7 +143,7 @@ void item_full_scan(struct radix_tree_root *root, unsigned long start, } static int verify_node(struct radix_tree_node *slot, unsigned int tag, - unsigned int height, int tagged) + int tagged) { int anyset = 0; int i; @@ -159,7 +159,8 @@ static int verify_node(struct radix_tree_node *slot, unsigned int tag, } } if (tagged != anyset) { - printf("tag: %u, height %u, tagged: %d, anyset: %d\n", tag, height, tagged, anyset); + printf("tag: %u, shift %u, tagged: %d, anyset: %d\n", + tag, slot->shift, tagged, anyset); for (j = 0; j < RADIX_TREE_MAX_TAGS; j++) { printf("tag %d: ", j); for (i = 0; i < RADIX_TREE_TAG_LONGS; i++) @@ -171,10 +172,10 @@ static int verify_node(struct radix_tree_node *slot, unsigned int tag, assert(tagged == anyset); /* Go for next level */ - if (height > 1) { + if (slot->shift > 0) { for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) if (slot->slots[i]) - if (verify_node(slot->slots[i], tag, height - 1, + if (verify_node(slot->slots[i], tag, !!test_bit(i, slot->tags[tag]))) { printf("Failure at off %d\n", i); for (j = 0; j < RADIX_TREE_MAX_TAGS; j++) { @@ -191,9 +192,10 @@ static int verify_node(struct radix_tree_node *slot, unsigned int tag, void verify_tag_consistency(struct radix_tree_root *root, unsigned int tag) { - if (!root->height) + struct radix_tree_node *node = root->rnode; + if (!radix_tree_is_indirect_ptr(node)) return; - verify_node(root->rnode, tag, root->height, !!root_tag_get(root, tag)); + verify_node(node, tag, !!root_tag_get(root, tag)); } void item_kill_tree(struct radix_tree_root *root) @@ -218,9 +220,19 @@ void item_kill_tree(struct radix_tree_root *root) void tree_verify_min_height(struct radix_tree_root *root, int maxindex) { - assert(radix_tree_maxindex(root->height) >= maxindex); - if (root->height > 1) - assert(radix_tree_maxindex(root->height-1) < maxindex); - else if (root->height == 1) - assert(radix_tree_maxindex(root->height-1) <= maxindex); + unsigned shift; + struct radix_tree_node *node = root->rnode; + if (!radix_tree_is_indirect_ptr(node)) { + assert(maxindex == 0); + return; + } + + node = indirect_to_ptr(node); + assert(maxindex <= node_maxindex(node)); + + shift = node->shift; + if (shift > 0) + assert(maxindex > shift_maxindex(shift - RADIX_TREE_MAP_SHIFT)); + else + assert(maxindex > 0); } diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h index 67217c9..866c8c6 100644 --- a/tools/testing/radix-tree/test.h +++ b/tools/testing/radix-tree/test.h @@ -42,4 +42,5 @@ extern int nr_allocated; void *indirect_to_ptr(void *ptr); void radix_tree_dump(struct radix_tree_root *root); int root_tag_get(struct radix_tree_root *root, unsigned int tag); -unsigned long radix_tree_maxindex(unsigned int height); +unsigned long node_maxindex(struct radix_tree_node *); +unsigned long shift_maxindex(unsigned int shift); -- cgit v0.10.2 From d0891265bbc988dc91ed8580b38eb3dac128581b Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:19 -0700 Subject: radix-tree: remove root->height The only remaining references to root->height were in extend and shrink, where it was updated. Now we can remove it entirely. Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 0374582..c0d223c 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -110,13 +110,11 @@ struct radix_tree_node { /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */ struct radix_tree_root { - unsigned int height; gfp_t gfp_mask; struct radix_tree_node __rcu *rnode; }; #define RADIX_TREE_INIT(mask) { \ - .height = 0, \ .gfp_mask = (mask), \ .rnode = NULL, \ } @@ -126,7 +124,6 @@ struct radix_tree_root { #define INIT_RADIX_TREE(root, mask) \ do { \ - (root)->height = 0; \ (root)->gfp_mask = (mask); \ (root)->rnode = NULL; \ } while (0) diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 75c9e61..58f79fe 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -39,12 +39,6 @@ /* - * The height_to_maxindex array needs to be one deeper than the maximum - * path as height 0 holds only 1 entry. - */ -static unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH + 1] __read_mostly; - -/* * Radix tree node cache. */ static struct kmem_cache *radix_tree_node_cachep; @@ -218,8 +212,7 @@ radix_tree_find_next_bit(const unsigned long *addr, } #ifndef __KERNEL__ -static void dump_node(struct radix_tree_node *node, - unsigned shift, unsigned long index) +static void dump_node(struct radix_tree_node *node, unsigned long index) { unsigned long i; @@ -229,8 +222,8 @@ static void dump_node(struct radix_tree_node *node, node->shift, node->count, node->parent); for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) { - unsigned long first = index | (i << shift); - unsigned long last = first | ((1UL << shift) - 1); + unsigned long first = index | (i << node->shift); + unsigned long last = first | ((1UL << node->shift) - 1); void *entry = node->slots[i]; if (!entry) continue; @@ -243,8 +236,7 @@ static void dump_node(struct radix_tree_node *node, pr_debug("radix entry %p offset %ld indices %ld-%ld\n", entry, i, first, last); } else { - dump_node(indirect_to_ptr(entry), - shift - RADIX_TREE_MAP_SHIFT, first); + dump_node(indirect_to_ptr(entry), first); } } } @@ -252,13 +244,12 @@ static void dump_node(struct radix_tree_node *node, /* For debug */ static void radix_tree_dump(struct radix_tree_root *root) { - pr_debug("radix root: %p height %d rnode %p tags %x\n", - root, root->height, root->rnode, + pr_debug("radix root: %p rnode %p tags %x\n", + root, root->rnode, root->gfp_mask >> __GFP_BITS_SHIFT); if (!radix_tree_is_indirect_ptr(root->rnode)) return; - dump_node(indirect_to_ptr(root->rnode), - (root->height - 1) * RADIX_TREE_MAP_SHIFT, 0); + dump_node(indirect_to_ptr(root->rnode), 0); } #endif @@ -411,14 +402,8 @@ int radix_tree_maybe_preload(gfp_t gfp_mask) EXPORT_SYMBOL(radix_tree_maybe_preload); /* - * Return the maximum key which can be store into a - * radix tree with height HEIGHT. + * The maximum index which can be stored in a radix tree */ -static inline unsigned long radix_tree_maxindex(unsigned int height) -{ - return height_to_maxindex[height]; -} - static inline unsigned long shift_maxindex(unsigned int shift) { return (RADIX_TREE_MAP_SIZE << shift) - 1; @@ -450,24 +435,22 @@ static unsigned radix_tree_load_root(struct radix_tree_root *root, * Extend a radix tree so it can store key @index. */ static int radix_tree_extend(struct radix_tree_root *root, - unsigned long index) + unsigned long index, unsigned int shift) { struct radix_tree_node *slot; - unsigned int height; + unsigned int maxshift; int tag; - /* Figure out what the height should be. */ - height = root->height + 1; - while (index > radix_tree_maxindex(height)) - height++; + /* Figure out what the shift should be. */ + maxshift = shift; + while (index > shift_maxindex(maxshift)) + maxshift += RADIX_TREE_MAP_SHIFT; - if (root->rnode == NULL) { - root->height = height; + slot = root->rnode; + if (!slot) goto out; - } do { - unsigned int newheight; struct radix_tree_node *node = radix_tree_node_alloc(root); if (!node) @@ -479,14 +462,11 @@ static int radix_tree_extend(struct radix_tree_root *root, tag_set(node, tag, 0); } - /* Increase the height. */ - newheight = root->height; - BUG_ON(newheight > BITS_PER_LONG); - node->shift = newheight * RADIX_TREE_MAP_SHIFT; + BUG_ON(shift > BITS_PER_LONG); + node->shift = shift; node->offset = 0; node->count = 1; node->parent = NULL; - slot = root->rnode; if (radix_tree_is_indirect_ptr(slot)) { slot = indirect_to_ptr(slot); slot->parent = node; @@ -495,10 +475,11 @@ static int radix_tree_extend(struct radix_tree_root *root, node->slots[0] = slot; node = ptr_to_indirect(node); rcu_assign_pointer(root->rnode, node); - root->height = ++newheight; - } while (height > root->height); + shift += RADIX_TREE_MAP_SHIFT; + slot = node; + } while (shift <= maxshift); out: - return height * RADIX_TREE_MAP_SHIFT; + return maxshift + RADIX_TREE_MAP_SHIFT; } /** @@ -531,15 +512,13 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, /* Make sure the tree is high enough. */ if (max > maxindex) { - int error = radix_tree_extend(root, max); + int error = radix_tree_extend(root, max, shift); if (error < 0) return error; shift = error; slot = root->rnode; - if (order == shift) { + if (order == shift) shift += RADIX_TREE_MAP_SHIFT; - root->height++; - } } offset = 0; /* uninitialised var warning */ @@ -1412,32 +1391,32 @@ unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item) #endif /* CONFIG_SHMEM && CONFIG_SWAP */ /** - * radix_tree_shrink - shrink height of a radix tree to minimal + * radix_tree_shrink - shrink radix tree to minimum height * @root radix tree root */ static inline bool radix_tree_shrink(struct radix_tree_root *root) { bool shrunk = false; - /* try to shrink tree height */ - while (root->height > 0) { + for (;;) { struct radix_tree_node *to_free = root->rnode; struct radix_tree_node *slot; - BUG_ON(!radix_tree_is_indirect_ptr(to_free)); + if (!radix_tree_is_indirect_ptr(to_free)) + break; to_free = indirect_to_ptr(to_free); /* * The candidate node has more than one child, or its child - * is not at the leftmost slot, or it is a multiorder entry, - * we cannot shrink. + * is not at the leftmost slot, or the child is a multiorder + * entry, we cannot shrink. */ if (to_free->count != 1) break; slot = to_free->slots[0]; if (!slot) break; - if (!radix_tree_is_indirect_ptr(slot) && (root->height > 1)) + if (!radix_tree_is_indirect_ptr(slot) && to_free->shift) break; if (radix_tree_is_indirect_ptr(slot)) { @@ -1454,7 +1433,6 @@ static inline bool radix_tree_shrink(struct radix_tree_root *root) * one (root->rnode) as far as dependent read barriers go. */ root->rnode = slot; - root->height--; /* * We have a dilemma here. The node's slot[0] must not be @@ -1515,7 +1493,6 @@ bool __radix_tree_delete_node(struct radix_tree_root *root, parent->count--; } else { root_tag_clear_all(root); - root->height = 0; root->rnode = NULL; } @@ -1631,26 +1608,6 @@ radix_tree_node_ctor(void *arg) INIT_LIST_HEAD(&node->private_list); } -static __init unsigned long __maxindex(unsigned int height) -{ - unsigned int width = height * RADIX_TREE_MAP_SHIFT; - int shift = RADIX_TREE_INDEX_BITS - width; - - if (shift < 0) - return ~0UL; - if (shift >= BITS_PER_LONG) - return 0UL; - return ~0UL >> shift; -} - -static __init void radix_tree_init_maxindex(void) -{ - unsigned int i; - - for (i = 0; i < ARRAY_SIZE(height_to_maxindex); i++) - height_to_maxindex[i] = __maxindex(i); -} - static int radix_tree_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) { @@ -1677,6 +1634,5 @@ void __init radix_tree_init(void) sizeof(struct radix_tree_node), 0, SLAB_PANIC | SLAB_RECLAIM_ACCOUNT, radix_tree_node_ctor); - radix_tree_init_maxindex(); hotcpu_notifier(radix_tree_callback, 0); } -- cgit v0.10.2 From 30ff46ccb303fb6f6c28b9aa9f2cdc4ba900ed3f Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:22 -0700 Subject: radix-tree: rename INDIRECT_PTR to INTERNAL_NODE The name RADIX_TREE_INDIRECT_PTR doesn't really match the meaning. RADIX_TREE_INTERNAL_NODE is a better name. Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Cc: Ross Zwisler Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index c0d223c..c8cc879 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -29,20 +29,16 @@ #include /* - * An indirect pointer (root->rnode pointing to a radix_tree_node, rather - * than a data item) is signalled by the low bit set in the root->rnode - * pointer. - * - * In this case root->height is > 0, but the indirect pointer tests are - * needed for RCU lookups (because root->height is unreliable). The only - * time callers need worry about this is when doing a lookup_slot under - * RCU. - * - * Indirect pointer in fact is also used to tag the last pointer of a node - * when it is shrunk, before we rcu free the node. See shrink code for - * details. + * Entries in the radix tree have the low bit set if they refer to a + * radix_tree_node. If the low bit is clear then the entry is user data. + * + * We also use the low bit to indicate that the slot will be freed in the + * next RCU idle period, and users need to re-walk the tree to find the + * new slot for the index that they were looking for. See the comment in + * radix_tree_shrink() for details. */ -#define RADIX_TREE_INDIRECT_PTR 1 +#define RADIX_TREE_INTERNAL_NODE 1 + /* * A common use of the radix tree is to store pointers to struct pages; * but shmem/tmpfs needs also to store swap entries in the same tree: @@ -63,7 +59,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr) { - return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR); + return (int)((unsigned long)ptr & RADIX_TREE_INTERNAL_NODE); } /*** radix-tree API starts here ***/ @@ -228,7 +224,7 @@ static inline void *radix_tree_deref_slot_protected(void **pslot, */ static inline int radix_tree_deref_retry(void *arg) { - return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR); + return unlikely(radix_tree_is_indirect_ptr(arg)); } /** @@ -250,7 +246,7 @@ static inline int radix_tree_exceptional_entry(void *arg) static inline int radix_tree_exception(void *arg) { return unlikely((unsigned long)arg & - (RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY)); + (RADIX_TREE_INTERNAL_NODE | RADIX_TREE_EXCEPTIONAL_ENTRY)); } /** @@ -448,7 +444,7 @@ radix_tree_chunk_size(struct radix_tree_iter *iter) static inline void *indirect_to_ptr(void *ptr) { - return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR); + return (void *)((unsigned long)ptr & ~RADIX_TREE_INTERNAL_NODE); } /** diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 58f79fe..31d5929 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -68,7 +68,7 @@ static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, }; static inline void *ptr_to_indirect(void *ptr) { - return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR); + return (void *)((unsigned long)ptr | RADIX_TREE_INTERNAL_NODE); } #define RADIX_TREE_RETRY ptr_to_indirect(NULL) -- cgit v0.10.2 From a4db4dcea1b3990e8c5dc8a03d11f36a3c0c6d8b Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:24 -0700 Subject: radix-tree: rename ptr_to_indirect() to node_to_entry() ptr_to_indirect() was a bad name. What it really means is "Convert this pointer to a node into an entry suitable for storing in the radix tree". So node_to_entry() seemed like a better name. Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Cc: Ross Zwisler Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 31d5929..f66bb39 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -66,12 +66,12 @@ struct radix_tree_preload { }; static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, }; -static inline void *ptr_to_indirect(void *ptr) +static inline void *node_to_entry(void *ptr) { return (void *)((unsigned long)ptr | RADIX_TREE_INTERNAL_NODE); } -#define RADIX_TREE_RETRY ptr_to_indirect(NULL) +#define RADIX_TREE_RETRY node_to_entry(NULL) #ifdef CONFIG_RADIX_TREE_MULTIORDER /* Sibling slots point directly to another slot in the same node */ @@ -470,13 +470,12 @@ static int radix_tree_extend(struct radix_tree_root *root, if (radix_tree_is_indirect_ptr(slot)) { slot = indirect_to_ptr(slot); slot->parent = node; - slot = ptr_to_indirect(slot); + slot = node_to_entry(slot); } node->slots[0] = slot; - node = ptr_to_indirect(node); - rcu_assign_pointer(root->rnode, node); + slot = node_to_entry(node); + rcu_assign_pointer(root->rnode, slot); shift += RADIX_TREE_MAP_SHIFT; - slot = node; } while (shift <= maxshift); out: return maxshift + RADIX_TREE_MAP_SHIFT; @@ -534,11 +533,11 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, slot->parent = node; if (node) { rcu_assign_pointer(node->slots[offset], - ptr_to_indirect(slot)); + node_to_entry(slot)); node->count++; } else rcu_assign_pointer(root->rnode, - ptr_to_indirect(slot)); + node_to_entry(slot)); } else if (!radix_tree_is_indirect_ptr(slot)) break; @@ -553,7 +552,7 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, if (order > shift) { int i, n = 1 << (order - shift); offset = offset & ~(n - 1); - slot = ptr_to_indirect(&node->slots[offset]); + slot = node_to_entry(&node->slots[offset]); for (i = 0; i < n; i++) { if (node->slots[offset + i]) return -EEXIST; @@ -1422,7 +1421,7 @@ static inline bool radix_tree_shrink(struct radix_tree_root *root) if (radix_tree_is_indirect_ptr(slot)) { slot = indirect_to_ptr(slot); slot->parent = NULL; - slot = ptr_to_indirect(slot); + slot = node_to_entry(slot); } /* @@ -1563,7 +1562,7 @@ void *radix_tree_delete_item(struct radix_tree_root *root, radix_tree_tag_clear(root, index, tag); } - delete_sibling_entries(node, ptr_to_indirect(slot), offset); + delete_sibling_entries(node, node_to_entry(slot), offset); node->slots[offset] = NULL; node->count--; -- cgit v0.10.2 From 4dd6c0987ca43d6544f4f0a3f86f6ea3bfc60fc1 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:27 -0700 Subject: radix-tree: rename indirect_to_ptr() to entry_to_node() Mirrors the earlier commit introducing node_to_entry(). Also change the type returned to be a struct radix_tree_node pointer. That lets us simplify a couple of places in the radix tree shrink & extend paths where we could convert an entry into a pointer, modify the node, then convert the pointer back into an entry. Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Cc: Ross Zwisler Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index c8cc879..b94aa19 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -442,7 +442,7 @@ radix_tree_chunk_size(struct radix_tree_iter *iter) return (iter->next_index - iter->index) >> iter_shift(iter); } -static inline void *indirect_to_ptr(void *ptr) +static inline struct radix_tree_node *entry_to_node(void *ptr) { return (void *)((unsigned long)ptr & ~RADIX_TREE_INTERNAL_NODE); } @@ -469,7 +469,7 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags) return NULL; while (IS_ENABLED(CONFIG_RADIX_TREE_MULTIORDER) && radix_tree_is_indirect_ptr(slot[1])) { - if (indirect_to_ptr(slot[1]) == canon) { + if (entry_to_node(slot[1]) == canon) { iter->tags >>= 1; iter->index = __radix_tree_iter_add(iter, 1); slot++; @@ -499,12 +499,10 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags) if (IS_ENABLED(CONFIG_RADIX_TREE_MULTIORDER) && radix_tree_is_indirect_ptr(*slot)) { - if (indirect_to_ptr(*slot) == canon) + if (entry_to_node(*slot) == canon) continue; - else { - iter->next_index = iter->index; - break; - } + iter->next_index = iter->index; + break; } if (likely(*slot)) diff --git a/lib/radix-tree.c b/lib/radix-tree.c index f66bb39..3c3fdd9 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -230,13 +230,13 @@ static void dump_node(struct radix_tree_node *node, unsigned long index) if (is_sibling_entry(node, entry)) { pr_debug("radix sblng %p offset %ld val %p indices %ld-%ld\n", entry, i, - *(void **)indirect_to_ptr(entry), + *(void **)entry_to_node(entry), first, last); } else if (!radix_tree_is_indirect_ptr(entry)) { pr_debug("radix entry %p offset %ld indices %ld-%ld\n", entry, i, first, last); } else { - dump_node(indirect_to_ptr(entry), first); + dump_node(entry_to_node(entry), first); } } } @@ -249,7 +249,7 @@ static void radix_tree_dump(struct radix_tree_root *root) root->gfp_mask >> __GFP_BITS_SHIFT); if (!radix_tree_is_indirect_ptr(root->rnode)) return; - dump_node(indirect_to_ptr(root->rnode), 0); + dump_node(entry_to_node(root->rnode), 0); } #endif @@ -422,7 +422,7 @@ static unsigned radix_tree_load_root(struct radix_tree_root *root, *nodep = node; if (likely(radix_tree_is_indirect_ptr(node))) { - node = indirect_to_ptr(node); + node = entry_to_node(node); *maxindex = node_maxindex(node); return node->shift + RADIX_TREE_MAP_SHIFT; } @@ -467,11 +467,8 @@ static int radix_tree_extend(struct radix_tree_root *root, node->offset = 0; node->count = 1; node->parent = NULL; - if (radix_tree_is_indirect_ptr(slot)) { - slot = indirect_to_ptr(slot); - slot->parent = node; - slot = node_to_entry(slot); - } + if (radix_tree_is_indirect_ptr(slot)) + entry_to_node(slot)->parent = node; node->slots[0] = slot; slot = node_to_entry(node); rcu_assign_pointer(root->rnode, slot); @@ -542,7 +539,7 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, break; /* Go a level down */ - node = indirect_to_ptr(slot); + node = entry_to_node(slot); offset = (index >> shift) & RADIX_TREE_MAP_MASK; offset = radix_tree_descend(node, &slot, offset); } @@ -645,7 +642,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index, if (node == RADIX_TREE_RETRY) goto restart; - parent = indirect_to_ptr(node); + parent = entry_to_node(node); shift -= RADIX_TREE_MAP_SHIFT; offset = (index >> shift) & RADIX_TREE_MAP_MASK; offset = radix_tree_descend(parent, &node, offset); @@ -729,7 +726,7 @@ void *radix_tree_tag_set(struct radix_tree_root *root, shift -= RADIX_TREE_MAP_SHIFT; offset = (index >> shift) & RADIX_TREE_MAP_MASK; - parent = indirect_to_ptr(node); + parent = entry_to_node(node); offset = radix_tree_descend(parent, &node, offset); BUG_ON(!node); @@ -777,7 +774,7 @@ void *radix_tree_tag_clear(struct radix_tree_root *root, shift -= RADIX_TREE_MAP_SHIFT; offset = (index >> shift) & RADIX_TREE_MAP_MASK; - parent = indirect_to_ptr(node); + parent = entry_to_node(node); offset = radix_tree_descend(parent, &node, offset); } @@ -844,7 +841,7 @@ int radix_tree_tag_get(struct radix_tree_root *root, shift -= RADIX_TREE_MAP_SHIFT; offset = (index >> shift) & RADIX_TREE_MAP_MASK; - parent = indirect_to_ptr(node); + parent = entry_to_node(node); offset = radix_tree_descend(parent, &node, offset); if (!node) @@ -904,7 +901,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root, return NULL; if (radix_tree_is_indirect_ptr(rnode)) { - rnode = indirect_to_ptr(rnode); + rnode = entry_to_node(rnode); } else if (rnode) { /* Single-slot tree */ iter->index = index; @@ -963,7 +960,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root, if (!radix_tree_is_indirect_ptr(slot)) break; - node = indirect_to_ptr(slot); + node = entry_to_node(slot); shift -= RADIX_TREE_MAP_SHIFT; offset = (index >> shift) & RADIX_TREE_MAP_MASK; } @@ -1048,7 +1045,7 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, return 1; } - node = indirect_to_ptr(slot); + node = entry_to_node(slot); shift -= RADIX_TREE_MAP_SHIFT; for (;;) { @@ -1063,7 +1060,7 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, goto next; /* Sibling slots never have tags set on them */ if (radix_tree_is_indirect_ptr(slot)) { - node = indirect_to_ptr(slot); + node = entry_to_node(slot); shift -= RADIX_TREE_MAP_SHIFT; continue; } @@ -1322,7 +1319,7 @@ static unsigned long __locate(struct radix_tree_node *slot, void *item, } continue; } - node = indirect_to_ptr(node); + node = entry_to_node(node); if (is_sibling_entry(slot, node)) continue; slot = node; @@ -1367,7 +1364,7 @@ unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item) break; } - node = indirect_to_ptr(node); + node = entry_to_node(node); max_index = node_maxindex(node); if (cur_index > max_index) { @@ -1403,7 +1400,7 @@ static inline bool radix_tree_shrink(struct radix_tree_root *root) if (!radix_tree_is_indirect_ptr(to_free)) break; - to_free = indirect_to_ptr(to_free); + to_free = entry_to_node(to_free); /* * The candidate node has more than one child, or its child @@ -1418,11 +1415,8 @@ static inline bool radix_tree_shrink(struct radix_tree_root *root) if (!radix_tree_is_indirect_ptr(slot) && to_free->shift) break; - if (radix_tree_is_indirect_ptr(slot)) { - slot = indirect_to_ptr(slot); - slot->parent = NULL; - slot = node_to_entry(slot); - } + if (radix_tree_is_indirect_ptr(slot)) + entry_to_node(slot)->parent = NULL; /* * We don't need rcu_assign_pointer(), since we are simply @@ -1481,7 +1475,7 @@ bool __radix_tree_delete_node(struct radix_tree_root *root, struct radix_tree_node *parent; if (node->count) { - if (node == indirect_to_ptr(root->rnode)) + if (node == entry_to_node(root->rnode)) deleted |= radix_tree_shrink(root); return deleted; } diff --git a/tools/testing/radix-tree/test.c b/tools/testing/radix-tree/test.c index 3004c58..7b0bc1f 100644 --- a/tools/testing/radix-tree/test.c +++ b/tools/testing/radix-tree/test.c @@ -149,7 +149,7 @@ static int verify_node(struct radix_tree_node *slot, unsigned int tag, int i; int j; - slot = indirect_to_ptr(slot); + slot = entry_to_node(slot); /* Verify consistency at this level */ for (i = 0; i < RADIX_TREE_TAG_LONGS; i++) { @@ -227,7 +227,7 @@ void tree_verify_min_height(struct radix_tree_root *root, int maxindex) return; } - node = indirect_to_ptr(node); + node = entry_to_node(node); assert(maxindex <= node_maxindex(node)); shift = node->shift; diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h index 866c8c6..e851313 100644 --- a/tools/testing/radix-tree/test.h +++ b/tools/testing/radix-tree/test.h @@ -39,7 +39,6 @@ void verify_tag_consistency(struct radix_tree_root *root, unsigned int tag); extern int nr_allocated; /* Normally private parts of lib/radix-tree.c */ -void *indirect_to_ptr(void *ptr); void radix_tree_dump(struct radix_tree_root *root); int root_tag_get(struct radix_tree_root *root, unsigned int tag); unsigned long node_maxindex(struct radix_tree_node *); -- cgit v0.10.2 From b194d16c27af905d6e3552f4851bc7d9fee4e90f Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:30 -0700 Subject: radix-tree: rename radix_tree_is_indirect_ptr() As with indirect_to_ptr(), ptr_to_indirect() and RADIX_TREE_INDIRECT_PTR, change radix_tree_is_indirect_ptr() to radix_tree_is_internal_node(). Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Cc: Ross Zwisler Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index b94aa19..bad6310 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -57,7 +57,7 @@ #define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \ RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE))) -static inline int radix_tree_is_indirect_ptr(void *ptr) +static inline int radix_tree_is_internal_node(void *ptr) { return (int)((unsigned long)ptr & RADIX_TREE_INTERNAL_NODE); } @@ -224,7 +224,7 @@ static inline void *radix_tree_deref_slot_protected(void **pslot, */ static inline int radix_tree_deref_retry(void *arg) { - return unlikely(radix_tree_is_indirect_ptr(arg)); + return unlikely(radix_tree_is_internal_node(arg)); } /** @@ -259,7 +259,7 @@ static inline int radix_tree_exception(void *arg) */ static inline void radix_tree_replace_slot(void **pslot, void *item) { - BUG_ON(radix_tree_is_indirect_ptr(item)); + BUG_ON(radix_tree_is_internal_node(item)); rcu_assign_pointer(*pslot, item); } @@ -468,7 +468,7 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags) if (unlikely(!iter->tags)) return NULL; while (IS_ENABLED(CONFIG_RADIX_TREE_MULTIORDER) && - radix_tree_is_indirect_ptr(slot[1])) { + radix_tree_is_internal_node(slot[1])) { if (entry_to_node(slot[1]) == canon) { iter->tags >>= 1; iter->index = __radix_tree_iter_add(iter, 1); @@ -498,7 +498,7 @@ radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags) iter->index = __radix_tree_iter_add(iter, 1); if (IS_ENABLED(CONFIG_RADIX_TREE_MULTIORDER) && - radix_tree_is_indirect_ptr(*slot)) { + radix_tree_is_internal_node(*slot)) { if (entry_to_node(*slot) == canon) continue; iter->next_index = iter->index; diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 3c3fdd9..b65c830 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -100,7 +100,7 @@ static unsigned radix_tree_descend(struct radix_tree_node *parent, void **entry = rcu_dereference_raw(parent->slots[offset]); #ifdef CONFIG_RADIX_TREE_MULTIORDER - if (radix_tree_is_indirect_ptr(entry)) { + if (radix_tree_is_internal_node(entry)) { unsigned long siboff = get_slot_offset(parent, entry); if (siboff < RADIX_TREE_MAP_SIZE) { offset = siboff; @@ -232,7 +232,7 @@ static void dump_node(struct radix_tree_node *node, unsigned long index) entry, i, *(void **)entry_to_node(entry), first, last); - } else if (!radix_tree_is_indirect_ptr(entry)) { + } else if (!radix_tree_is_internal_node(entry)) { pr_debug("radix entry %p offset %ld indices %ld-%ld\n", entry, i, first, last); } else { @@ -247,7 +247,7 @@ static void radix_tree_dump(struct radix_tree_root *root) pr_debug("radix root: %p rnode %p tags %x\n", root, root->rnode, root->gfp_mask >> __GFP_BITS_SHIFT); - if (!radix_tree_is_indirect_ptr(root->rnode)) + if (!radix_tree_is_internal_node(root->rnode)) return; dump_node(entry_to_node(root->rnode), 0); } @@ -302,7 +302,7 @@ radix_tree_node_alloc(struct radix_tree_root *root) ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask | __GFP_ACCOUNT); out: - BUG_ON(radix_tree_is_indirect_ptr(ret)); + BUG_ON(radix_tree_is_internal_node(ret)); return ret; } @@ -421,7 +421,7 @@ static unsigned radix_tree_load_root(struct radix_tree_root *root, *nodep = node; - if (likely(radix_tree_is_indirect_ptr(node))) { + if (likely(radix_tree_is_internal_node(node))) { node = entry_to_node(node); *maxindex = node_maxindex(node); return node->shift + RADIX_TREE_MAP_SHIFT; @@ -467,7 +467,7 @@ static int radix_tree_extend(struct radix_tree_root *root, node->offset = 0; node->count = 1; node->parent = NULL; - if (radix_tree_is_indirect_ptr(slot)) + if (radix_tree_is_internal_node(slot)) entry_to_node(slot)->parent = node; node->slots[0] = slot; slot = node_to_entry(node); @@ -535,7 +535,7 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, } else rcu_assign_pointer(root->rnode, node_to_entry(slot)); - } else if (!radix_tree_is_indirect_ptr(slot)) + } else if (!radix_tree_is_internal_node(slot)) break; /* Go a level down */ @@ -585,7 +585,7 @@ int __radix_tree_insert(struct radix_tree_root *root, unsigned long index, void **slot; int error; - BUG_ON(radix_tree_is_indirect_ptr(item)); + BUG_ON(radix_tree_is_internal_node(item)); error = __radix_tree_create(root, index, order, &node, &slot); if (error) @@ -637,7 +637,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index, if (index > maxindex) return NULL; - while (radix_tree_is_indirect_ptr(node)) { + while (radix_tree_is_internal_node(node)) { unsigned offset; if (node == RADIX_TREE_RETRY) @@ -720,7 +720,7 @@ void *radix_tree_tag_set(struct radix_tree_root *root, shift = radix_tree_load_root(root, &node, &maxindex); BUG_ON(index > maxindex); - while (radix_tree_is_indirect_ptr(node)) { + while (radix_tree_is_internal_node(node)) { unsigned offset; shift -= RADIX_TREE_MAP_SHIFT; @@ -770,7 +770,7 @@ void *radix_tree_tag_clear(struct radix_tree_root *root, parent = NULL; - while (radix_tree_is_indirect_ptr(node)) { + while (radix_tree_is_internal_node(node)) { shift -= RADIX_TREE_MAP_SHIFT; offset = (index >> shift) & RADIX_TREE_MAP_MASK; @@ -835,7 +835,7 @@ int radix_tree_tag_get(struct radix_tree_root *root, if (node == NULL) return 0; - while (radix_tree_is_indirect_ptr(node)) { + while (radix_tree_is_internal_node(node)) { int offset; shift -= RADIX_TREE_MAP_SHIFT; @@ -900,7 +900,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root, if (index > maxindex) return NULL; - if (radix_tree_is_indirect_ptr(rnode)) { + if (radix_tree_is_internal_node(rnode)) { rnode = entry_to_node(rnode); } else if (rnode) { /* Single-slot tree */ @@ -957,7 +957,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root, if ((slot == NULL) || (slot == RADIX_TREE_RETRY)) goto restart; - if (!radix_tree_is_indirect_ptr(slot)) + if (!radix_tree_is_internal_node(slot)) break; node = entry_to_node(slot); @@ -1039,7 +1039,7 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, *first_indexp = last_index + 1; return 0; } - if (!radix_tree_is_indirect_ptr(slot)) { + if (!radix_tree_is_internal_node(slot)) { *first_indexp = last_index + 1; root_tag_set(root, settag); return 1; @@ -1059,7 +1059,7 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, if (!tag_get(node, iftag, offset)) goto next; /* Sibling slots never have tags set on them */ - if (radix_tree_is_indirect_ptr(slot)) { + if (radix_tree_is_internal_node(slot)) { node = entry_to_node(slot); shift -= RADIX_TREE_MAP_SHIFT; continue; @@ -1152,7 +1152,7 @@ radix_tree_gang_lookup(struct radix_tree_root *root, void **results, results[ret] = rcu_dereference_raw(*slot); if (!results[ret]) continue; - if (radix_tree_is_indirect_ptr(results[ret])) { + if (radix_tree_is_internal_node(results[ret])) { slot = radix_tree_iter_retry(&iter); continue; } @@ -1235,7 +1235,7 @@ radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results, results[ret] = rcu_dereference_raw(*slot); if (!results[ret]) continue; - if (radix_tree_is_indirect_ptr(results[ret])) { + if (radix_tree_is_internal_node(results[ret])) { slot = radix_tree_iter_retry(&iter); continue; } @@ -1311,7 +1311,7 @@ static unsigned long __locate(struct radix_tree_node *slot, void *item, rcu_dereference_raw(slot->slots[i]); if (node == RADIX_TREE_RETRY) goto out; - if (!radix_tree_is_indirect_ptr(node)) { + if (!radix_tree_is_internal_node(node)) { if (node == item) { info->found_index = index; info->stop = true; @@ -1357,7 +1357,7 @@ unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item) do { rcu_read_lock(); node = rcu_dereference_raw(root->rnode); - if (!radix_tree_is_indirect_ptr(node)) { + if (!radix_tree_is_internal_node(node)) { rcu_read_unlock(); if (node == item) info.found_index = 0; @@ -1398,7 +1398,7 @@ static inline bool radix_tree_shrink(struct radix_tree_root *root) struct radix_tree_node *to_free = root->rnode; struct radix_tree_node *slot; - if (!radix_tree_is_indirect_ptr(to_free)) + if (!radix_tree_is_internal_node(to_free)) break; to_free = entry_to_node(to_free); @@ -1412,10 +1412,10 @@ static inline bool radix_tree_shrink(struct radix_tree_root *root) slot = to_free->slots[0]; if (!slot) break; - if (!radix_tree_is_indirect_ptr(slot) && to_free->shift) + if (!radix_tree_is_internal_node(slot) && to_free->shift) break; - if (radix_tree_is_indirect_ptr(slot)) + if (radix_tree_is_internal_node(slot)) entry_to_node(slot)->parent = NULL; /* @@ -1445,7 +1445,7 @@ static inline bool radix_tree_shrink(struct radix_tree_root *root) * also results in a stale slot). So tag the slot as indirect * to force callers to retry. */ - if (!radix_tree_is_indirect_ptr(slot)) + if (!radix_tree_is_internal_node(slot)) to_free->slots[0] = RADIX_TREE_RETRY; radix_tree_node_free(to_free); diff --git a/tools/testing/radix-tree/test.c b/tools/testing/radix-tree/test.c index 7b0bc1f..a6e8099 100644 --- a/tools/testing/radix-tree/test.c +++ b/tools/testing/radix-tree/test.c @@ -193,7 +193,7 @@ static int verify_node(struct radix_tree_node *slot, unsigned int tag, void verify_tag_consistency(struct radix_tree_root *root, unsigned int tag) { struct radix_tree_node *node = root->rnode; - if (!radix_tree_is_indirect_ptr(node)) + if (!radix_tree_is_internal_node(node)) return; verify_node(node, tag, !!root_tag_get(root, tag)); } @@ -222,7 +222,7 @@ void tree_verify_min_height(struct radix_tree_root *root, int maxindex) { unsigned shift; struct radix_tree_node *node = root->rnode; - if (!radix_tree_is_indirect_ptr(node)) { + if (!radix_tree_is_internal_node(node)) { assert(maxindex == 0); return; } -- cgit v0.10.2 From af49a63e101eb62376cc1d6bd25b97eb8c691d54 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:33 -0700 Subject: radix-tree: change naming conventions in radix_tree_shrink Use the more standard 'node' and 'child' instead of 'to_free' and 'slot'. Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Cc: Ross Zwisler Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index b65c830..4b4a2a2 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -1395,37 +1395,37 @@ static inline bool radix_tree_shrink(struct radix_tree_root *root) bool shrunk = false; for (;;) { - struct radix_tree_node *to_free = root->rnode; - struct radix_tree_node *slot; + struct radix_tree_node *node = root->rnode; + struct radix_tree_node *child; - if (!radix_tree_is_internal_node(to_free)) + if (!radix_tree_is_internal_node(node)) break; - to_free = entry_to_node(to_free); + node = entry_to_node(node); /* * The candidate node has more than one child, or its child * is not at the leftmost slot, or the child is a multiorder * entry, we cannot shrink. */ - if (to_free->count != 1) + if (node->count != 1) break; - slot = to_free->slots[0]; - if (!slot) + child = node->slots[0]; + if (!child) break; - if (!radix_tree_is_internal_node(slot) && to_free->shift) + if (!radix_tree_is_internal_node(child) && node->shift) break; - if (radix_tree_is_internal_node(slot)) - entry_to_node(slot)->parent = NULL; + if (radix_tree_is_internal_node(child)) + entry_to_node(child)->parent = NULL; /* * We don't need rcu_assign_pointer(), since we are simply * moving the node from one part of the tree to another: if it * was safe to dereference the old pointer to it - * (to_free->slots[0]), it will be safe to dereference the new + * (node->slots[0]), it will be safe to dereference the new * one (root->rnode) as far as dependent read barriers go. */ - root->rnode = slot; + root->rnode = child; /* * We have a dilemma here. The node's slot[0] must not be @@ -1445,10 +1445,10 @@ static inline bool radix_tree_shrink(struct radix_tree_root *root) * also results in a stale slot). So tag the slot as indirect * to force callers to retry. */ - if (!radix_tree_is_internal_node(slot)) - to_free->slots[0] = RADIX_TREE_RETRY; + if (!radix_tree_is_internal_node(child)) + node->slots[0] = RADIX_TREE_RETRY; - radix_tree_node_free(to_free); + radix_tree_node_free(node); shrunk = true; } -- cgit v0.10.2 From 8c1244de00ef98f73e21eecc42d84b2742fbb4f9 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:36 -0700 Subject: radix-tree: tidy up next_chunk Convert radix_tree_next_chunk to use 'child' instead of 'slot' as the name of the child node. Also use node_maxindex() where it makes sense. The 'rnode' variable was unnecessary; it doesn't overlap in usage with 'node', so we can just use 'node' the whole way through the function. Improve the testcase to start the walk from every index in the carefully constructed tree, and to accept any index within the range covered by the entry. Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Cc: Ross Zwisler Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 4b4a2a2..c42867a 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -876,7 +876,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root, struct radix_tree_iter *iter, unsigned flags) { unsigned shift, tag = flags & RADIX_TREE_ITER_TAG_MASK; - struct radix_tree_node *rnode, *node; + struct radix_tree_node *node, *child; unsigned long index, offset, maxindex; if ((flags & RADIX_TREE_ITER_TAGGED) && !root_tag_get(root, tag)) @@ -896,38 +896,29 @@ void **radix_tree_next_chunk(struct radix_tree_root *root, return NULL; restart: - shift = radix_tree_load_root(root, &rnode, &maxindex); + shift = radix_tree_load_root(root, &child, &maxindex); if (index > maxindex) return NULL; + if (!child) + return NULL; - if (radix_tree_is_internal_node(rnode)) { - rnode = entry_to_node(rnode); - } else if (rnode) { + if (!radix_tree_is_internal_node(child)) { /* Single-slot tree */ iter->index = index; iter->next_index = maxindex + 1; iter->tags = 1; - __set_iter_shift(iter, shift); + __set_iter_shift(iter, 0); return (void **)&root->rnode; - } else - return NULL; - - shift -= RADIX_TREE_MAP_SHIFT; - offset = index >> shift; - - node = rnode; - while (1) { - struct radix_tree_node *slot; - unsigned new_off = radix_tree_descend(node, &slot, offset); + } - if (new_off < offset) { - offset = new_off; - index &= ~((RADIX_TREE_MAP_SIZE << shift) - 1); - index |= offset << shift; - } + do { + node = entry_to_node(child); + shift -= RADIX_TREE_MAP_SHIFT; + offset = (index >> shift) & RADIX_TREE_MAP_MASK; + offset = radix_tree_descend(node, &child, offset); if ((flags & RADIX_TREE_ITER_TAGGED) ? - !tag_get(node, tag, offset) : !slot) { + !tag_get(node, tag, offset) : !child) { /* Hole detected */ if (flags & RADIX_TREE_ITER_CONTIG) return NULL; @@ -945,29 +936,23 @@ void **radix_tree_next_chunk(struct radix_tree_root *root, if (slot) break; } - index &= ~((RADIX_TREE_MAP_SIZE << shift) - 1); + index &= ~node_maxindex(node); index += offset << shift; /* Overflow after ~0UL */ if (!index) return NULL; if (offset == RADIX_TREE_MAP_SIZE) goto restart; - slot = rcu_dereference_raw(node->slots[offset]); + child = rcu_dereference_raw(node->slots[offset]); } - if ((slot == NULL) || (slot == RADIX_TREE_RETRY)) + if ((child == NULL) || (child == RADIX_TREE_RETRY)) goto restart; - if (!radix_tree_is_internal_node(slot)) - break; - - node = entry_to_node(slot); - shift -= RADIX_TREE_MAP_SHIFT; - offset = (index >> shift) & RADIX_TREE_MAP_MASK; - } + } while (radix_tree_is_internal_node(child)); /* Update the iterator state */ - iter->index = index & ~((1 << shift) - 1); - iter->next_index = (index | ((RADIX_TREE_MAP_SIZE << shift) - 1)) + 1; + iter->index = (index &~ node_maxindex(node)) | (offset << node->shift); + iter->next_index = (index | node_maxindex(node)) + 1; __set_iter_shift(iter, shift); /* Construct iter->tags bit-mask from node->tags[tag] array */ diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c index c061f4b..39d9b95 100644 --- a/tools/testing/radix-tree/multiorder.c +++ b/tools/testing/radix-tree/multiorder.c @@ -202,7 +202,7 @@ void multiorder_iteration(void) RADIX_TREE(tree, GFP_KERNEL); struct radix_tree_iter iter; void **slot; - int i, err; + int i, j, err; printf("Multiorder iteration test\n"); @@ -215,29 +215,21 @@ void multiorder_iteration(void) assert(!err); } - i = 0; - /* start from index 1 to verify we find the multi-order entry at 0 */ - radix_tree_for_each_slot(slot, &tree, &iter, 1) { - int height = order[i] / RADIX_TREE_MAP_SHIFT; - int shift = height * RADIX_TREE_MAP_SHIFT; - - assert(iter.index == index[i]); - assert(iter.shift == shift); - i++; - } - - /* - * Now iterate through the tree starting at an elevated multi-order - * entry, beginning at an index in the middle of the range. - */ - i = 8; - radix_tree_for_each_slot(slot, &tree, &iter, 70) { - int height = order[i] / RADIX_TREE_MAP_SHIFT; - int shift = height * RADIX_TREE_MAP_SHIFT; - - assert(iter.index == index[i]); - assert(iter.shift == shift); - i++; + for (j = 0; j < 256; j++) { + for (i = 0; i < NUM_ENTRIES; i++) + if (j <= (index[i] | ((1 << order[i]) - 1))) + break; + + radix_tree_for_each_slot(slot, &tree, &iter, j) { + int height = order[i] / RADIX_TREE_MAP_SHIFT; + int shift = height * RADIX_TREE_MAP_SHIFT; + int mask = (1 << order[i]) - 1; + + assert(iter.index >= (index[i] &~ mask)); + assert(iter.index <= (index[i] | mask)); + assert(iter.shift == shift); + i++; + } } item_kill_tree(&tree); @@ -249,7 +241,7 @@ void multiorder_tagged_iteration(void) struct radix_tree_iter iter; void **slot; unsigned long first = 0; - int i; + int i, j; printf("Multiorder tagged iteration test\n"); @@ -268,30 +260,49 @@ void multiorder_tagged_iteration(void) for (i = 0; i < TAG_ENTRIES; i++) assert(radix_tree_tag_set(&tree, tag_index[i], 1)); - i = 0; - /* start from index 1 to verify we find the multi-order entry at 0 */ - radix_tree_for_each_tagged(slot, &tree, &iter, 1, 1) { - assert(iter.index == tag_index[i]); - i++; - } - - /* - * Now iterate through the tree starting at an elevated multi-order - * entry, beginning at an index in the middle of the range. - */ - i = 4; - radix_tree_for_each_slot(slot, &tree, &iter, 70) { - assert(iter.index == tag_index[i]); - i++; + for (j = 0; j < 256; j++) { + int mask, k; + + for (i = 0; i < TAG_ENTRIES; i++) { + for (k = i; index[k] < tag_index[i]; k++) + ; + if (j <= (index[k] | ((1 << order[k]) - 1))) + break; + } + + radix_tree_for_each_tagged(slot, &tree, &iter, j, 1) { + for (k = i; index[k] < tag_index[i]; k++) + ; + mask = (1 << order[k]) - 1; + + assert(iter.index >= (tag_index[i] &~ mask)); + assert(iter.index <= (tag_index[i] | mask)); + i++; + } } radix_tree_range_tag_if_tagged(&tree, &first, ~0UL, MT_NUM_ENTRIES, 1, 2); - i = 0; - radix_tree_for_each_tagged(slot, &tree, &iter, 1, 2) { - assert(iter.index == tag_index[i]); - i++; + for (j = 0; j < 256; j++) { + int mask, k; + + for (i = 0; i < TAG_ENTRIES; i++) { + for (k = i; index[k] < tag_index[i]; k++) + ; + if (j <= (index[k] | ((1 << order[k]) - 1))) + break; + } + + radix_tree_for_each_tagged(slot, &tree, &iter, j, 2) { + for (k = i; index[k] < tag_index[i]; k++) + ; + mask = (1 << order[k]) - 1; + + assert(iter.index >= (tag_index[i] &~ mask)); + assert(iter.index <= (tag_index[i] | mask)); + i++; + } } first = 1; -- cgit v0.10.2 From a8e4da25d3c573a0c3cf2fb33e91ec5cad8d7f16 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:39 -0700 Subject: radix-tree: tidy up range_tag_if_tagged Convert radix_tree_range_tag_if_tagged to name the nodes parent, node and child instead of node & slot. Use parent->offset instead of playing games with 'upindex'. Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Cc: Ross Zwisler Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index c42867a..1a82066 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -1009,9 +1009,9 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, unsigned long nr_to_tag, unsigned int iftag, unsigned int settag) { - struct radix_tree_node *slot, *node = NULL; + struct radix_tree_node *parent, *node, *child; unsigned long maxindex; - unsigned int shift = radix_tree_load_root(root, &slot, &maxindex); + unsigned int shift = radix_tree_load_root(root, &child, &maxindex); unsigned long tagged = 0; unsigned long index = *first_indexp; @@ -1024,28 +1024,25 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, *first_indexp = last_index + 1; return 0; } - if (!radix_tree_is_internal_node(slot)) { + if (!radix_tree_is_internal_node(child)) { *first_indexp = last_index + 1; root_tag_set(root, settag); return 1; } - node = entry_to_node(slot); + node = entry_to_node(child); shift -= RADIX_TREE_MAP_SHIFT; for (;;) { - unsigned long upindex; - unsigned offset; - - offset = (index >> shift) & RADIX_TREE_MAP_MASK; - offset = radix_tree_descend(node, &slot, offset); - if (!slot) + unsigned offset = (index >> shift) & RADIX_TREE_MAP_MASK; + offset = radix_tree_descend(node, &child, offset); + if (!child) goto next; if (!tag_get(node, iftag, offset)) goto next; /* Sibling slots never have tags set on them */ - if (radix_tree_is_internal_node(slot)) { - node = entry_to_node(slot); + if (radix_tree_is_internal_node(child)) { + node = entry_to_node(child); shift -= RADIX_TREE_MAP_SHIFT; continue; } @@ -1054,20 +1051,18 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, tagged++; tag_set(node, settag, offset); - slot = node->parent; /* walk back up the path tagging interior nodes */ - upindex = index >> shift; - while (slot) { - upindex >>= RADIX_TREE_MAP_SHIFT; - offset = upindex & RADIX_TREE_MAP_MASK; - + parent = node; + for (;;) { + offset = parent->offset; + parent = parent->parent; + if (!parent) + break; /* stop if we find a node with the tag already set */ - if (tag_get(slot, settag, offset)) + if (tag_get(parent, settag, offset)) break; - tag_set(slot, settag, offset); - slot = slot->parent; + tag_set(parent, settag, offset); } - next: /* Go to next item at level determined by 'shift' */ index = ((index >> shift) + 1) << shift; -- cgit v0.10.2 From 89148aa40201def3fa552f9d07dd99740d880ab2 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:42 -0700 Subject: radix-tree: tidy up __radix_tree_create() 1. Rename the existing variable 'slot' to 'child'. 2. Introduce a new variable called 'slot' which is the address of the slot we're dealing with. This lets us simplify the tree insertion, and removes the recalculation of 'slot' at the end of the function. 3. Using 'slot' in the sibling pointer insertion part makes the code more readable. Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Cc: Ross Zwisler Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 1a82066..9d9b4b9 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -499,12 +499,13 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, unsigned order, struct radix_tree_node **nodep, void ***slotp) { - struct radix_tree_node *node = NULL, *slot; + struct radix_tree_node *node = NULL, *child; + void **slot = (void **)&root->rnode; unsigned long maxindex; - unsigned int shift, offset; + unsigned int shift, offset = 0; unsigned long max = index | ((1UL << order) - 1); - shift = radix_tree_load_root(root, &slot, &maxindex); + shift = radix_tree_load_root(root, &child, &maxindex); /* Make sure the tree is high enough. */ if (max > maxindex) { @@ -512,51 +513,48 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, if (error < 0) return error; shift = error; - slot = root->rnode; + child = root->rnode; if (order == shift) shift += RADIX_TREE_MAP_SHIFT; } - offset = 0; /* uninitialised var warning */ while (shift > order) { shift -= RADIX_TREE_MAP_SHIFT; - if (slot == NULL) { + if (child == NULL) { /* Have to add a child node. */ - slot = radix_tree_node_alloc(root); - if (!slot) + child = radix_tree_node_alloc(root); + if (!child) return -ENOMEM; - slot->shift = shift; - slot->offset = offset; - slot->parent = node; - if (node) { - rcu_assign_pointer(node->slots[offset], - node_to_entry(slot)); + child->shift = shift; + child->offset = offset; + child->parent = node; + rcu_assign_pointer(*slot, node_to_entry(child)); + if (node) node->count++; - } else - rcu_assign_pointer(root->rnode, - node_to_entry(slot)); - } else if (!radix_tree_is_internal_node(slot)) + } else if (!radix_tree_is_internal_node(child)) break; /* Go a level down */ - node = entry_to_node(slot); + node = entry_to_node(child); offset = (index >> shift) & RADIX_TREE_MAP_MASK; - offset = radix_tree_descend(node, &slot, offset); + offset = radix_tree_descend(node, &child, offset); + slot = &node->slots[offset]; } #ifdef CONFIG_RADIX_TREE_MULTIORDER /* Insert pointers to the canonical entry */ if (order > shift) { - int i, n = 1 << (order - shift); + unsigned i, n = 1 << (order - shift); offset = offset & ~(n - 1); - slot = node_to_entry(&node->slots[offset]); + slot = &node->slots[offset]; + child = node_to_entry(slot); for (i = 0; i < n; i++) { - if (node->slots[offset + i]) + if (slot[i]) return -EEXIST; } for (i = 1; i < n; i++) { - rcu_assign_pointer(node->slots[offset + i], slot); + rcu_assign_pointer(slot[i], child); node->count++; } } @@ -565,7 +563,7 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, if (nodep) *nodep = node; if (slotp) - *slotp = node ? node->slots + offset : (void **)&root->rnode; + *slotp = slot; return 0; } -- cgit v0.10.2 From d604c324524bf61c68182bb27db64656a78fe911 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:45 -0700 Subject: radix-tree: introduce radix_tree_replace_clear_tags() In addition to replacing the entry, we also clear all associated tags. This is really a one-off special for page_cache_tree_delete() which had far too much detailed knowledge about how the radix tree works. For efficiency, factor node_tag_clear() out of radix_tree_tag_clear() It can be used by radix_tree_delete_item() as well as radix_tree_replace_clear_tags(). Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Cc: Ross Zwisler Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index bad6310..11c8e7c 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -281,9 +281,12 @@ bool __radix_tree_delete_node(struct radix_tree_root *root, struct radix_tree_node *node); void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *); void *radix_tree_delete(struct radix_tree_root *, unsigned long); -unsigned int -radix_tree_gang_lookup(struct radix_tree_root *root, void **results, - unsigned long first_index, unsigned int max_items); +struct radix_tree_node *radix_tree_replace_clear_tags( + struct radix_tree_root *root, + unsigned long index, void *entry); +unsigned int radix_tree_gang_lookup(struct radix_tree_root *root, + void **results, unsigned long first_index, + unsigned int max_items); unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results, unsigned long *indices, unsigned long first_index, unsigned int max_items); diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 9d9b4b9..c7114d2 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -740,6 +740,26 @@ void *radix_tree_tag_set(struct radix_tree_root *root, } EXPORT_SYMBOL(radix_tree_tag_set); +static void node_tag_clear(struct radix_tree_root *root, + struct radix_tree_node *node, + unsigned int tag, unsigned int offset) +{ + while (node) { + if (!tag_get(node, tag, offset)) + return; + tag_clear(node, tag, offset); + if (any_tag_set(node, tag)) + return; + + offset = node->offset; + node = node->parent; + } + + /* clear the root's tag bit */ + if (root_tag_get(root, tag)) + root_tag_clear(root, tag); +} + /** * radix_tree_tag_clear - clear a tag on a radix tree node * @root: radix tree root @@ -776,28 +796,9 @@ void *radix_tree_tag_clear(struct radix_tree_root *root, offset = radix_tree_descend(parent, &node, offset); } - if (node == NULL) - goto out; + if (node) + node_tag_clear(root, parent, tag, offset); - index >>= shift; - - while (parent) { - if (!tag_get(parent, tag, offset)) - goto out; - tag_clear(parent, tag, offset); - if (any_tag_set(parent, tag)) - goto out; - - index >>= RADIX_TREE_MAP_SHIFT; - offset = index & RADIX_TREE_MAP_MASK; - parent = parent->parent; - } - - /* clear the root's tag bit */ - if (root_tag_get(root, tag)) - root_tag_clear(root, tag); - -out: return node; } EXPORT_SYMBOL(radix_tree_tag_clear); @@ -1525,14 +1526,9 @@ void *radix_tree_delete_item(struct radix_tree_root *root, offset = get_slot_offset(node, slot); - /* - * Clear all tags associated with the item to be deleted. - * This way of doing it would be inefficient, but seldom is any set. - */ - for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) { - if (tag_get(node, tag, offset)) - radix_tree_tag_clear(root, index, tag); - } + /* Clear all tags associated with the item to be deleted. */ + for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) + node_tag_clear(root, node, tag, offset); delete_sibling_entries(node, node_to_entry(slot), offset); node->slots[offset] = NULL; @@ -1559,6 +1555,28 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index) } EXPORT_SYMBOL(radix_tree_delete); +struct radix_tree_node *radix_tree_replace_clear_tags( + struct radix_tree_root *root, + unsigned long index, void *entry) +{ + struct radix_tree_node *node; + void **slot; + + __radix_tree_lookup(root, index, &node, &slot); + + if (node) { + unsigned int tag, offset = get_slot_offset(node, slot); + for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) + node_tag_clear(root, node, tag, offset); + } else { + /* Clear root node tags */ + root->gfp_mask &= __GFP_BITS_MASK; + } + + radix_tree_replace_slot(slot, entry); + return node; +} + /** * radix_tree_tagged - test whether any items in the tree are tagged * @root: radix tree root diff --git a/mm/filemap.c b/mm/filemap.c index b418405..9665b1d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -114,14 +114,11 @@ static void page_cache_tree_delete(struct address_space *mapping, struct page *page, void *shadow) { struct radix_tree_node *node; - unsigned long index; - unsigned int offset; - unsigned int tag; - void **slot; VM_BUG_ON(!PageLocked(page)); - __radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot); + node = radix_tree_replace_clear_tags(&mapping->page_tree, page->index, + shadow); if (shadow) { mapping->nrexceptional++; @@ -135,23 +132,9 @@ static void page_cache_tree_delete(struct address_space *mapping, } mapping->nrpages--; - if (!node) { - /* Clear direct pointer tags in root node */ - mapping->page_tree.gfp_mask &= __GFP_BITS_MASK; - radix_tree_replace_slot(slot, shadow); + if (!node) return; - } - - /* Clear tree tags for the removed page */ - index = page->index; - offset = index & RADIX_TREE_MAP_MASK; - for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) { - if (test_bit(offset, node->tags[tag])) - radix_tree_tag_clear(&mapping->page_tree, index, tag); - } - /* Delete page, swap shadow entry */ - radix_tree_replace_slot(slot, shadow); workingset_node_pages_dec(node); if (shadow) workingset_node_shadows_inc(node); -- cgit v0.10.2 From 9e85d811196583126785a0405d0c879ae7a9eb2f Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:48 -0700 Subject: radix-tree: make radix_tree_descend() more useful Now that the shift amount is stored in the node, radix_tree_descend() can calculate offset itself from index, which removes several lines of code from each of the tree walkers. Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Neil Brown Cc: Ross Zwisler Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/lib/radix-tree.c b/lib/radix-tree.c index c7114d2..8b7d845 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -94,9 +94,10 @@ static inline unsigned long get_slot_offset(struct radix_tree_node *parent, return slot - parent->slots; } -static unsigned radix_tree_descend(struct radix_tree_node *parent, - struct radix_tree_node **nodep, unsigned offset) +static unsigned int radix_tree_descend(struct radix_tree_node *parent, + struct radix_tree_node **nodep, unsigned long index) { + unsigned int offset = (index >> parent->shift) & RADIX_TREE_MAP_MASK; void **entry = rcu_dereference_raw(parent->slots[offset]); #ifdef CONFIG_RADIX_TREE_MULTIORDER @@ -536,8 +537,7 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index, /* Go a level down */ node = entry_to_node(child); - offset = (index >> shift) & RADIX_TREE_MAP_MASK; - offset = radix_tree_descend(node, &child, offset); + offset = radix_tree_descend(node, &child, index); slot = &node->slots[offset]; } @@ -625,13 +625,12 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index, { struct radix_tree_node *node, *parent; unsigned long maxindex; - unsigned int shift; void **slot; restart: parent = NULL; slot = (void **)&root->rnode; - shift = radix_tree_load_root(root, &node, &maxindex); + radix_tree_load_root(root, &node, &maxindex); if (index > maxindex) return NULL; @@ -641,9 +640,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index, if (node == RADIX_TREE_RETRY) goto restart; parent = entry_to_node(node); - shift -= RADIX_TREE_MAP_SHIFT; - offset = (index >> shift) & RADIX_TREE_MAP_MASK; - offset = radix_tree_descend(parent, &node, offset); + offset = radix_tree_descend(parent, &node, index); slot = parent->slots + offset; } @@ -713,19 +710,15 @@ void *radix_tree_tag_set(struct radix_tree_root *root, { struct radix_tree_node *node, *parent; unsigned long maxindex; - unsigned int shift; - shift = radix_tree_load_root(root, &node, &maxindex); + radix_tree_load_root(root, &node, &maxindex); BUG_ON(index > maxindex); while (radix_tree_is_internal_node(node)) { unsigned offset; - shift -= RADIX_TREE_MAP_SHIFT; - offset = (index >> shift) & RADIX_TREE_MAP_MASK; - parent = entry_to_node(node); - offset = radix_tree_descend(parent, &node, offset); + offset = radix_tree_descend(parent, &node, index); BUG_ON(!node); if (!tag_get(parent, tag, offset)) @@ -779,21 +772,17 @@ void *radix_tree_tag_clear(struct radix_tree_root *root, { struct radix_tree_node *node, *parent; unsigned long maxindex; - unsigned int shift; int uninitialized_var(offset); - shift = radix_tree_load_root(root, &node, &maxindex); + radix_tree_load_root(root, &node, &maxindex); if (index > maxindex) return NULL; parent = NULL; while (radix_tree_is_internal_node(node)) { - shift -= RADIX_TREE_MAP_SHIFT; - offset = (index >> shift) & RADIX_TREE_MAP_MASK; - parent = entry_to_node(node); - offset = radix_tree_descend(parent, &node, offset); + offset = radix_tree_descend(parent, &node, index); } if (node) @@ -823,25 +812,21 @@ int radix_tree_tag_get(struct radix_tree_root *root, { struct radix_tree_node *node, *parent; unsigned long maxindex; - unsigned int shift; if (!root_tag_get(root, tag)) return 0; - shift = radix_tree_load_root(root, &node, &maxindex); + radix_tree_load_root(root, &node, &maxindex); if (index > maxindex) return 0; if (node == NULL) return 0; while (radix_tree_is_internal_node(node)) { - int offset; - - shift -= RADIX_TREE_MAP_SHIFT; - offset = (index >> shift) & RADIX_TREE_MAP_MASK; + unsigned offset; parent = entry_to_node(node); - offset = radix_tree_descend(parent, &node, offset); + offset = radix_tree_descend(parent, &node, index); if (!node) return 0; @@ -874,7 +859,7 @@ static inline void __set_iter_shift(struct radix_tree_iter *iter, void **radix_tree_next_chunk(struct radix_tree_root *root, struct radix_tree_iter *iter, unsigned flags) { - unsigned shift, tag = flags & RADIX_TREE_ITER_TAG_MASK; + unsigned tag = flags & RADIX_TREE_ITER_TAG_MASK; struct radix_tree_node *node, *child; unsigned long index, offset, maxindex; @@ -895,7 +880,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root, return NULL; restart: - shift = radix_tree_load_root(root, &child, &maxindex); + radix_tree_load_root(root, &child, &maxindex); if (index > maxindex) return NULL; if (!child) @@ -912,9 +897,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root, do { node = entry_to_node(child); - shift -= RADIX_TREE_MAP_SHIFT; - offset = (index >> shift) & RADIX_TREE_MAP_MASK; - offset = radix_tree_descend(node, &child, offset); + offset = radix_tree_descend(node, &child, index); if ((flags & RADIX_TREE_ITER_TAGGED) ? !tag_get(node, tag, offset) : !child) { @@ -936,7 +919,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root, break; } index &= ~node_maxindex(node); - index += offset << shift; + index += offset << node->shift; /* Overflow after ~0UL */ if (!index) return NULL; @@ -952,7 +935,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root, /* Update the iterator state */ iter->index = (index &~ node_maxindex(node)) | (offset << node->shift); iter->next_index = (index | node_maxindex(node)) + 1; - __set_iter_shift(iter, shift); + __set_iter_shift(iter, node->shift); /* Construct iter->tags bit-mask from node->tags[tag] array */ if (flags & RADIX_TREE_ITER_TAGGED) { @@ -1010,10 +993,10 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, { struct radix_tree_node *parent, *node, *child; unsigned long maxindex; - unsigned int shift = radix_tree_load_root(root, &child, &maxindex); unsigned long tagged = 0; unsigned long index = *first_indexp; + radix_tree_load_root(root, &child, &maxindex); last_index = min(last_index, maxindex); if (index > last_index) return 0; @@ -1030,11 +1013,9 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, } node = entry_to_node(child); - shift -= RADIX_TREE_MAP_SHIFT; for (;;) { - unsigned offset = (index >> shift) & RADIX_TREE_MAP_MASK; - offset = radix_tree_descend(node, &child, offset); + unsigned offset = radix_tree_descend(node, &child, index); if (!child) goto next; if (!tag_get(node, iftag, offset)) @@ -1042,7 +1023,6 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, /* Sibling slots never have tags set on them */ if (radix_tree_is_internal_node(child)) { node = entry_to_node(child); - shift -= RADIX_TREE_MAP_SHIFT; continue; } @@ -1063,12 +1043,12 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, tag_set(parent, settag, offset); } next: - /* Go to next item at level determined by 'shift' */ - index = ((index >> shift) + 1) << shift; + /* Go to next entry in node */ + index = ((index >> node->shift) + 1) << node->shift; /* Overflow can happen when last_index is ~0UL... */ if (index > last_index || !index) break; - offset = (index >> shift) & RADIX_TREE_MAP_MASK; + offset = (index >> node->shift) & RADIX_TREE_MAP_MASK; while (offset == 0) { /* * We've fully scanned this node. Go up. Because @@ -1076,8 +1056,7 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, * we do below cannot wander astray. */ node = node->parent; - shift += RADIX_TREE_MAP_SHIFT; - offset = (index >> shift) & RADIX_TREE_MAP_MASK; + offset = (index >> node->shift) & RADIX_TREE_MAP_MASK; } if (is_sibling_entry(node, node->slots[offset])) goto next; @@ -1275,13 +1254,10 @@ struct locate_info { static unsigned long __locate(struct radix_tree_node *slot, void *item, unsigned long index, struct locate_info *info) { - unsigned int shift; unsigned long i; - shift = slot->shift + RADIX_TREE_MAP_SHIFT; - do { - shift -= RADIX_TREE_MAP_SHIFT; + unsigned int shift = slot->shift; for (i = (index >> shift) & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; @@ -1304,9 +1280,7 @@ static unsigned long __locate(struct radix_tree_node *slot, void *item, slot = node; break; } - if (i == RADIX_TREE_MAP_SIZE) - break; - } while (shift); + } while (i < RADIX_TREE_MAP_SIZE); out: if ((index == 0) && (i == RADIX_TREE_MAP_SIZE)) -- cgit v0.10.2 From 78a9be0a0a3367b94af242632c525d22b26f1a87 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Fri, 20 May 2016 17:03:51 -0700 Subject: dax: move RADIX_DAX_ definitions to dax.c These don't belong in radix-tree.h any more than PAGECACHE_TAG_* do. Let's try to maintain the idea that radix-tree simply implements an abstract data type. Signed-off-by: NeilBrown Reviewed-by: Ross Zwisler Reviewed-by: Jan Kara Signed-off-by: Matthew Wilcox Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/dax.c b/fs/dax.c index 0dbe4e0..a345c16 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -32,6 +32,15 @@ #include #include +#define RADIX_DAX_MASK 0xf +#define RADIX_DAX_SHIFT 4 +#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY) +#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY) +#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK) +#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT)) +#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \ + RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE))) + static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax) { struct request_queue *q = bdev->bd_queue; diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index 11c8e7c..c2f69e2 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -48,15 +48,6 @@ #define RADIX_TREE_EXCEPTIONAL_ENTRY 2 #define RADIX_TREE_EXCEPTIONAL_SHIFT 2 -#define RADIX_DAX_MASK 0xf -#define RADIX_DAX_SHIFT 4 -#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY) -#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY) -#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK) -#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT)) -#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \ - RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE))) - static inline int radix_tree_is_internal_node(void *ptr) { return (int)((unsigned long)ptr & RADIX_TREE_INTERNAL_NODE); -- cgit v0.10.2 From 3bcadd6fa6c4fd07ace3626357c824eb532488a6 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 20 May 2016 17:03:54 -0700 Subject: radix-tree: free up the bottom bit of exceptional entries for reuse We are guaranteed that pointers to radix_tree_nodes always have the bottom two bits clear (because they come from a slab cache, and slab caches have a minimum alignment of sizeof(void *)), so we can redefine 'radix_tree_is_internal_node' to only return true if the bottom two bits have value '01'. This frees up one quarter of the potential values for use by the user. Idea from Neil Brown. Signed-off-by: Matthew Wilcox Suggested-by: Neil Brown Cc: Konstantin Khlebnikov Cc: Kirill Shutemov Cc: Jan Kara Cc: Ross Zwisler Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index c2f69e2..cb4b7e8 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -29,28 +29,37 @@ #include /* - * Entries in the radix tree have the low bit set if they refer to a - * radix_tree_node. If the low bit is clear then the entry is user data. - * - * We also use the low bit to indicate that the slot will be freed in the - * next RCU idle period, and users need to re-walk the tree to find the - * new slot for the index that they were looking for. See the comment in - * radix_tree_shrink() for details. + * The bottom two bits of the slot determine how the remaining bits in the + * slot are interpreted: + * + * 00 - data pointer + * 01 - internal entry + * 10 - exceptional entry + * 11 - locked exceptional entry + * + * The internal entry may be a pointer to the next level in the tree, a + * sibling entry, or an indicator that the entry in this slot has been moved + * to another location in the tree and the lookup should be restarted. While + * NULL fits the 'data pointer' pattern, it means that there is no entry in + * the tree for this index (no matter what level of the tree it is found at). + * This means that you cannot store NULL in the tree as a value for the index. */ -#define RADIX_TREE_INTERNAL_NODE 1 +#define RADIX_TREE_ENTRY_MASK 3UL +#define RADIX_TREE_INTERNAL_NODE 1UL /* - * A common use of the radix tree is to store pointers to struct pages; - * but shmem/tmpfs needs also to store swap entries in the same tree: - * those are marked as exceptional entries to distinguish them. + * Most users of the radix tree store pointers but shmem/tmpfs stores swap + * entries in the same tree. They are marked as exceptional entries to + * distinguish them from pointers to struct page. * EXCEPTIONAL_ENTRY tests the bit, EXCEPTIONAL_SHIFT shifts content past it. */ #define RADIX_TREE_EXCEPTIONAL_ENTRY 2 #define RADIX_TREE_EXCEPTIONAL_SHIFT 2 -static inline int radix_tree_is_internal_node(void *ptr) +static inline bool radix_tree_is_internal_node(void *ptr) { - return (int)((unsigned long)ptr & RADIX_TREE_INTERNAL_NODE); + return ((unsigned long)ptr & RADIX_TREE_ENTRY_MASK) == + RADIX_TREE_INTERNAL_NODE; } /*** radix-tree API starts here ***/ @@ -236,8 +245,7 @@ static inline int radix_tree_exceptional_entry(void *arg) */ static inline int radix_tree_exception(void *arg) { - return unlikely((unsigned long)arg & - (RADIX_TREE_INTERNAL_NODE | RADIX_TREE_EXCEPTIONAL_ENTRY)); + return unlikely((unsigned long)arg & RADIX_TREE_ENTRY_MASK); } /** -- cgit v0.10.2 From fff7fb0b2d908dec779783d8eaf3d7725230f75e Mon Sep 17 00:00:00 2001 From: Zhaoxiu Zeng Date: Fri, 20 May 2016 17:03:57 -0700 Subject: lib/GCD.c: use binary GCD algorithm instead of Euclidean The binary GCD algorithm is based on the following facts: 1. If a and b are all evens, then gcd(a,b) = 2 * gcd(a/2, b/2) 2. If a is even and b is odd, then gcd(a,b) = gcd(a/2, b) 3. If a and b are all odds, then gcd(a,b) = gcd((a-b)/2, b) = gcd((a+b)/2, b) Even on x86 machines with reasonable division hardware, the binary algorithm runs about 25% faster (80% the execution time) than the division-based Euclidian algorithm. On platforms like Alpha and ARMv6 where division is a function call to emulation code, it's even more significant. There are two variants of the code here, depending on whether a fast __ffs (find least significant set bit) instruction is available. This allows the unpredictable branches in the bit-at-a-time shifting loop to be eliminated. If fast __ffs is not available, the "even/odd" GCD variant is used. I use the following code to benchmark: #include #include #include #include #include #include #define swap(a, b) \ do { \ a ^= b; \ b ^= a; \ a ^= b; \ } while (0) unsigned long gcd0(unsigned long a, unsigned long b) { unsigned long r; if (a < b) { swap(a, b); } if (b == 0) return a; while ((r = a % b) != 0) { a = b; b = r; } return b; } unsigned long gcd1(unsigned long a, unsigned long b) { unsigned long r = a | b; if (!a || !b) return r; b >>= __builtin_ctzl(b); for (;;) { a >>= __builtin_ctzl(a); if (a == b) return a << __builtin_ctzl(r); if (a < b) swap(a, b); a -= b; } } unsigned long gcd2(unsigned long a, unsigned long b) { unsigned long r = a | b; if (!a || !b) return r; r &= -r; while (!(b & r)) b >>= 1; for (;;) { while (!(a & r)) a >>= 1; if (a == b) return a; if (a < b) swap(a, b); a -= b; a >>= 1; if (a & r) a += b; a >>= 1; } } unsigned long gcd3(unsigned long a, unsigned long b) { unsigned long r = a | b; if (!a || !b) return r; b >>= __builtin_ctzl(b); if (b == 1) return r & -r; for (;;) { a >>= __builtin_ctzl(a); if (a == 1) return r & -r; if (a == b) return a << __builtin_ctzl(r); if (a < b) swap(a, b); a -= b; } } unsigned long gcd4(unsigned long a, unsigned long b) { unsigned long r = a | b; if (!a || !b) return r; r &= -r; while (!(b & r)) b >>= 1; if (b == r) return r; for (;;) { while (!(a & r)) a >>= 1; if (a == r) return r; if (a == b) return a; if (a < b) swap(a, b); a -= b; a >>= 1; if (a & r) a += b; a >>= 1; } } static unsigned long (*gcd_func[])(unsigned long a, unsigned long b) = { gcd0, gcd1, gcd2, gcd3, gcd4, }; #define TEST_ENTRIES (sizeof(gcd_func) / sizeof(gcd_func[0])) #if defined(__x86_64__) #define rdtscll(val) do { \ unsigned long __a,__d; \ __asm__ __volatile__("rdtsc" : "=a" (__a), "=d" (__d)); \ (val) = ((unsigned long long)__a) | (((unsigned long long)__d)<<32); \ } while(0) static unsigned long long benchmark_gcd_func(unsigned long (*gcd)(unsigned long, unsigned long), unsigned long a, unsigned long b, unsigned long *res) { unsigned long long start, end; unsigned long long ret; unsigned long gcd_res; rdtscll(start); gcd_res = gcd(a, b); rdtscll(end); if (end >= start) ret = end - start; else ret = ~0ULL - start + 1 + end; *res = gcd_res; return ret; } #else static inline struct timespec read_time(void) { struct timespec time; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time); return time; } static inline unsigned long long diff_time(struct timespec start, struct timespec end) { struct timespec temp; if ((end.tv_nsec - start.tv_nsec) < 0) { temp.tv_sec = end.tv_sec - start.tv_sec - 1; temp.tv_nsec = 1000000000ULL + end.tv_nsec - start.tv_nsec; } else { temp.tv_sec = end.tv_sec - start.tv_sec; temp.tv_nsec = end.tv_nsec - start.tv_nsec; } return temp.tv_sec * 1000000000ULL + temp.tv_nsec; } static unsigned long long benchmark_gcd_func(unsigned long (*gcd)(unsigned long, unsigned long), unsigned long a, unsigned long b, unsigned long *res) { struct timespec start, end; unsigned long gcd_res; start = read_time(); gcd_res = gcd(a, b); end = read_time(); *res = gcd_res; return diff_time(start, end); } #endif static inline unsigned long get_rand() { if (sizeof(long) == 8) return (unsigned long)rand() << 32 | rand(); else return rand(); } int main(int argc, char **argv) { unsigned int seed = time(0); int loops = 100; int repeats = 1000; unsigned long (*res)[TEST_ENTRIES]; unsigned long long elapsed[TEST_ENTRIES]; int i, j, k; for (;;) { int opt = getopt(argc, argv, "n:r:s:"); /* End condition always first */ if (opt == -1) break; switch (opt) { case 'n': loops = atoi(optarg); break; case 'r': repeats = atoi(optarg); break; case 's': seed = strtoul(optarg, NULL, 10); break; default: /* You won't actually get here. */ break; } } res = malloc(sizeof(unsigned long) * TEST_ENTRIES * loops); memset(elapsed, 0, sizeof(elapsed)); srand(seed); for (j = 0; j < loops; j++) { unsigned long a = get_rand(); /* Do we have args? */ unsigned long b = argc > optind ? strtoul(argv[optind], NULL, 10) : get_rand(); unsigned long long min_elapsed[TEST_ENTRIES]; for (k = 0; k < repeats; k++) { for (i = 0; i < TEST_ENTRIES; i++) { unsigned long long tmp = benchmark_gcd_func(gcd_func[i], a, b, &res[j][i]); if (k == 0 || min_elapsed[i] > tmp) min_elapsed[i] = tmp; } } for (i = 0; i < TEST_ENTRIES; i++) elapsed[i] += min_elapsed[i]; } for (i = 0; i < TEST_ENTRIES; i++) printf("gcd%d: elapsed %llu\n", i, elapsed[i]); k = 0; srand(seed); for (j = 0; j < loops; j++) { unsigned long a = get_rand(); unsigned long b = argc > optind ? strtoul(argv[optind], NULL, 10) : get_rand(); for (i = 1; i < TEST_ENTRIES; i++) { if (res[j][i] != res[j][0]) break; } if (i < TEST_ENTRIES) { if (k == 0) { k = 1; fprintf(stderr, "Error:\n"); } fprintf(stderr, "gcd(%lu, %lu): ", a, b); for (i = 0; i < TEST_ENTRIES; i++) fprintf(stderr, "%ld%s", res[j][i], i < TEST_ENTRIES - 1 ? ", " : "\n"); } } if (k == 0) fprintf(stderr, "PASS\n"); free(res); return 0; } Compiled with "-O2", on "VirtualBox 4.4.0-22-generic #38-Ubuntu x86_64" got: zhaoxiuzeng@zhaoxiuzeng-VirtualBox:~/develop$ ./gcd -r 500000 -n 10 gcd0: elapsed 10174 gcd1: elapsed 2120 gcd2: elapsed 2902 gcd3: elapsed 2039 gcd4: elapsed 2812 PASS zhaoxiuzeng@zhaoxiuzeng-VirtualBox:~/develop$ ./gcd -r 500000 -n 10 gcd0: elapsed 9309 gcd1: elapsed 2280 gcd2: elapsed 2822 gcd3: elapsed 2217 gcd4: elapsed 2710 PASS zhaoxiuzeng@zhaoxiuzeng-VirtualBox:~/develop$ ./gcd -r 500000 -n 10 gcd0: elapsed 9589 gcd1: elapsed 2098 gcd2: elapsed 2815 gcd3: elapsed 2030 gcd4: elapsed 2718 PASS zhaoxiuzeng@zhaoxiuzeng-VirtualBox:~/develop$ ./gcd -r 500000 -n 10 gcd0: elapsed 9914 gcd1: elapsed 2309 gcd2: elapsed 2779 gcd3: elapsed 2228 gcd4: elapsed 2709 PASS [akpm@linux-foundation.org: avoid #defining a CONFIG_ variable] Signed-off-by: Zhaoxiu Zeng Signed-off-by: George Spelvin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/arch/Kconfig b/arch/Kconfig index 8f84fd2..b16e74e 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -647,4 +647,7 @@ config COMPAT_OLD_SIGACTION config ARCH_NO_COHERENT_DMA_MMAP bool +config CPU_NO_EFFICIENT_FFS + def_bool n + source "kernel/gcov/Kconfig" diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index fe99f89..7f312d8 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -26,6 +26,7 @@ config ALPHA select MODULES_USE_ELF_RELA select ODD_RT_SIGACTION select OLD_SIGSUSPEND + select CPU_NO_EFFICIENT_FFS if !ALPHA_EV67 help The Alpha is a 64-bit general-purpose processor designed and marketed by the Digital Equipment Corporation of blessed memory, diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 8894f7e..0dcbacf 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -107,6 +107,7 @@ choice config ISA_ARCOMPACT bool "ARCompact ISA" + select CPU_NO_EFFICIENT_FFS help The original ARC ISA of ARC600/700 cores diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig index 5534766..cb569b6 100644 --- a/arch/arm/mm/Kconfig +++ b/arch/arm/mm/Kconfig @@ -421,18 +421,21 @@ config CPU_32v3 select CPU_USE_DOMAINS if MMU select NEED_KUSER_HELPERS select TLS_REG_EMUL if SMP || !MMU + select CPU_NO_EFFICIENT_FFS config CPU_32v4 bool select CPU_USE_DOMAINS if MMU select NEED_KUSER_HELPERS select TLS_REG_EMUL if SMP || !MMU + select CPU_NO_EFFICIENT_FFS config CPU_32v4T bool select CPU_USE_DOMAINS if MMU select NEED_KUSER_HELPERS select TLS_REG_EMUL if SMP || !MMU + select CPU_NO_EFFICIENT_FFS config CPU_32v5 bool diff --git a/arch/h8300/Kconfig b/arch/h8300/Kconfig index 986ea84..aa232de 100644 --- a/arch/h8300/Kconfig +++ b/arch/h8300/Kconfig @@ -20,6 +20,7 @@ config H8300 select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZO select HAVE_ARCH_KGDB + select CPU_NO_EFFICIENT_FFS config RWSEM_GENERIC_SPINLOCK def_bool y diff --git a/arch/m32r/Kconfig b/arch/m32r/Kconfig index c82b292..3cc8498 100644 --- a/arch/m32r/Kconfig +++ b/arch/m32r/Kconfig @@ -17,6 +17,7 @@ config M32R select ARCH_USES_GETTIMEOFFSET select MODULES_USE_ELF_RELA select HAVE_DEBUG_STACKOVERFLOW + select CPU_NO_EFFICIENT_FFS config SBUS bool diff --git a/arch/m68k/Kconfig.cpu b/arch/m68k/Kconfig.cpu index c1beb5a..8ace920 100644 --- a/arch/m68k/Kconfig.cpu +++ b/arch/m68k/Kconfig.cpu @@ -40,6 +40,7 @@ config M68000 select CPU_HAS_NO_MULDIV64 select CPU_HAS_NO_UNALIGNED select GENERIC_CSUM + select CPU_NO_EFFICIENT_FFS help The Freescale (was Motorola) 68000 CPU is the first generation of the well known M68K family of processors. The CPU core as well as @@ -51,6 +52,7 @@ config MCPU32 bool select CPU_HAS_NO_BITFIELDS select CPU_HAS_NO_UNALIGNED + select CPU_NO_EFFICIENT_FFS help The Freescale (was then Motorola) CPU32 is a CPU core that is based on the 68020 processor. For the most part it is used in @@ -130,6 +132,7 @@ config M5206 depends on !MMU select COLDFIRE_SW_A7 select HAVE_MBAR + select CPU_NO_EFFICIENT_FFS help Motorola ColdFire 5206 processor support. @@ -138,6 +141,7 @@ config M5206e depends on !MMU select COLDFIRE_SW_A7 select HAVE_MBAR + select CPU_NO_EFFICIENT_FFS help Motorola ColdFire 5206e processor support. @@ -163,6 +167,7 @@ config M5249 depends on !MMU select COLDFIRE_SW_A7 select HAVE_MBAR + select CPU_NO_EFFICIENT_FFS help Motorola ColdFire 5249 processor support. @@ -171,6 +176,7 @@ config M525x depends on !MMU select COLDFIRE_SW_A7 select HAVE_MBAR + select CPU_NO_EFFICIENT_FFS help Freescale (Motorola) Coldfire 5251/5253 processor support. @@ -189,6 +195,7 @@ config M5272 depends on !MMU select COLDFIRE_SW_A7 select HAVE_MBAR + select CPU_NO_EFFICIENT_FFS help Motorola ColdFire 5272 processor support. @@ -217,6 +224,7 @@ config M5307 select COLDFIRE_SW_A7 select HAVE_CACHE_CB select HAVE_MBAR + select CPU_NO_EFFICIENT_FFS help Motorola ColdFire 5307 processor support. @@ -242,6 +250,7 @@ config M5407 select COLDFIRE_SW_A7 select HAVE_CACHE_CB select HAVE_MBAR + select CPU_NO_EFFICIENT_FFS help Motorola ColdFire 5407 processor support. @@ -251,6 +260,7 @@ config M547x select MMU_COLDFIRE if MMU select HAVE_CACHE_CB select HAVE_MBAR + select CPU_NO_EFFICIENT_FFS help Freescale ColdFire 5470/5471/5472/5473/5474/5475 processor support. @@ -260,6 +270,7 @@ config M548x select M54xx select HAVE_CACHE_CB select HAVE_MBAR + select CPU_NO_EFFICIENT_FFS help Freescale ColdFire 5480/5481/5482/5483/5484/5485 processor support. diff --git a/arch/metag/Kconfig b/arch/metag/Kconfig index e47a08d..5b7a45d 100644 --- a/arch/metag/Kconfig +++ b/arch/metag/Kconfig @@ -30,6 +30,7 @@ config METAG select OF select OF_EARLY_FLATTREE select SPARSE_IRQ + select CPU_NO_EFFICIENT_FFS config STACKTRACE_SUPPORT def_bool y diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig index 3d793b5..f17c3a4 100644 --- a/arch/microblaze/Kconfig +++ b/arch/microblaze/Kconfig @@ -32,6 +32,7 @@ config MICROBLAZE select OF_EARLY_FLATTREE select TRACING_SUPPORT select VIRT_TO_BUS + select CPU_NO_EFFICIENT_FFS config SWAP def_bool n diff --git a/arch/mips/include/asm/cpu-features.h b/arch/mips/include/asm/cpu-features.h index e6f19fc..e961c8a 100644 --- a/arch/mips/include/asm/cpu-features.h +++ b/arch/mips/include/asm/cpu-features.h @@ -204,6 +204,16 @@ #endif #endif +/* __builtin_constant_p(cpu_has_mips_r) && cpu_has_mips_r */ +#if !((defined(cpu_has_mips32r1) && cpu_has_mips32r1) || \ + (defined(cpu_has_mips32r2) && cpu_has_mips32r2) || \ + (defined(cpu_has_mips32r6) && cpu_has_mips32r6) || \ + (defined(cpu_has_mips64r1) && cpu_has_mips64r1) || \ + (defined(cpu_has_mips64r2) && cpu_has_mips64r2) || \ + (defined(cpu_has_mips64r6) && cpu_has_mips64r6)) +#define CPU_NO_EFFICIENT_FFS 1 +#endif + #ifndef cpu_has_mips_1 # define cpu_has_mips_1 (!cpu_has_mips_r6) #endif diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig index 87ca653..51a56c8 100644 --- a/arch/nios2/Kconfig +++ b/arch/nios2/Kconfig @@ -15,6 +15,7 @@ config NIOS2 select SOC_BUS select SPARSE_IRQ select USB_ARCH_HAS_HCD if USB_SUPPORT + select CPU_NO_EFFICIENT_FFS config GENERIC_CSUM def_bool y diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig index e118c02..142cb05 100644 --- a/arch/openrisc/Kconfig +++ b/arch/openrisc/Kconfig @@ -25,6 +25,7 @@ config OPENRISC select MODULES_USE_ELF_RELA select HAVE_DEBUG_STACKOVERFLOW select OR1K_PIC + select CPU_NO_EFFICIENT_FFS if !OPENRISC_HAVE_INST_FF1 config MMU def_bool y diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig index 88cfaa8..3d498a6 100644 --- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -32,6 +32,7 @@ config PARISC select HAVE_ARCH_AUDITSYSCALL select HAVE_ARCH_SECCOMP_FILTER select ARCH_NO_COHERENT_DMA_MMAP + select CPU_NO_EFFICIENT_FFS help The PA-RISC microprocessor is designed by Hewlett-Packard and used diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 1c3c43d..a8c2590 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -123,6 +123,7 @@ config S390 select HAVE_ARCH_AUDITSYSCALL select HAVE_ARCH_EARLY_PFN_TO_NID select HAVE_ARCH_JUMP_LABEL + select CPU_NO_EFFICIENT_FFS if !HAVE_MARCH_Z9_109_FEATURES select HAVE_ARCH_SECCOMP_FILTER select HAVE_ARCH_SOFT_DIRTY select HAVE_ARCH_TRACEHOOK diff --git a/arch/score/Kconfig b/arch/score/Kconfig index 366e1b5..507d631 100644 --- a/arch/score/Kconfig +++ b/arch/score/Kconfig @@ -14,6 +14,7 @@ config SCORE select VIRT_TO_BUS select MODULES_USE_ELF_REL select CLONE_BACKWARDS + select CPU_NO_EFFICIENT_FFS choice prompt "System type" diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index f625434..e803a83 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -20,6 +20,7 @@ config SUPERH select PERF_USE_VMALLOC select HAVE_DEBUG_KMEMLEAK select HAVE_KERNEL_GZIP + select CPU_NO_EFFICIENT_FFS select HAVE_KERNEL_BZIP2 select HAVE_KERNEL_LZMA select HAVE_KERNEL_XZ diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index 1012f7f..546293d 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -42,6 +42,7 @@ config SPARC select ODD_RT_SIGACTION select OLD_SIGSUSPEND select ARCH_HAS_SG_CHAIN + select CPU_NO_EFFICIENT_FFS config SPARC32 def_bool !64BIT diff --git a/lib/gcd.c b/lib/gcd.c index 3657f12..135ee64 100644 --- a/lib/gcd.c +++ b/lib/gcd.c @@ -2,20 +2,77 @@ #include #include -/* Greatest common divisor */ +/* + * This implements the binary GCD algorithm. (Often attributed to Stein, + * but as Knuth has noted, appears in a first-century Chinese math text.) + * + * This is faster than the division-based algorithm even on x86, which + * has decent hardware division. + */ + +#if !defined(CONFIG_CPU_NO_EFFICIENT_FFS) && !defined(CPU_NO_EFFICIENT_FFS) + +/* If __ffs is available, the even/odd algorithm benchmarks slower. */ unsigned long gcd(unsigned long a, unsigned long b) { - unsigned long r; + unsigned long r = a | b; + + if (!a || !b) + return r; - if (a < b) - swap(a, b); + b >>= __ffs(b); + if (b == 1) + return r & -r; - if (!b) - return a; - while ((r = a % b) != 0) { - a = b; - b = r; + for (;;) { + a >>= __ffs(a); + if (a == 1) + return r & -r; + if (a == b) + return a << __ffs(r); + + if (a < b) + swap(a, b); + a -= b; } - return b; } + +#else + +/* If normalization is done by loops, the even/odd algorithm is a win. */ +unsigned long gcd(unsigned long a, unsigned long b) +{ + unsigned long r = a | b; + + if (!a || !b) + return r; + + /* Isolate lsbit of r */ + r &= -r; + + while (!(b & r)) + b >>= 1; + if (b == r) + return r; + + for (;;) { + while (!(a & r)) + a >>= 1; + if (a == r) + return r; + if (a == b) + return a; + + if (a < b) + swap(a, b); + a -= b; + a >>= 1; + if (a & r) + a += b; + a >>= 1; + } +} + +#endif + EXPORT_SYMBOL_GPL(gcd); -- cgit v0.10.2 From 2d6327459ec5e63a89b12945f483f6d7378a8839 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Fri, 20 May 2016 17:04:00 -0700 Subject: checkpatch: add PREFER_IS_ENABLED test Using #if defined CONFIG_ || defined CONFIG__MODULE is more verbose than necessary and IS_ENABLED(CONFIG_) is preferred. So add a test and a message for it. --fix it to if desired. Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index d574d13..eb8f887 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -5637,6 +5637,16 @@ sub process { } } +# check for #if defined CONFIG_ || defined CONFIG__MODULE + if ($line =~ /^\+\s*#\s*if\s+defined(?:\s*\(?\s*|\s+)(CONFIG_[A-Z_]+)\s*\)?\s*\|\|\s*defined(?:\s*\(?\s*|\s+)\1_MODULE\s*\)?\s*$/) { + my $config = $1; + if (WARN("PREFER_IS_ENABLED", + "Prefer IS_ENABLED() to CONFIG_ || CONFIG__MODULE\n" . $herecurr) && + $fix) { + $fixed[$fixlinenr] = "\+#if IS_ENABLED($config)"; + } + } + # check for case / default statements not preceded by break/fallthrough/switch if ($line =~ /^.\s*(?:case\s+(?:$Ident|$Constant)\s*|default):/) { my $has_break = 0; -- cgit v0.10.2 From f39e1769bbfc4936ff8364fb2529dc8bf6bc6888 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Fri, 20 May 2016 17:04:02 -0700 Subject: checkpatch: improve CONSTANT_COMPARISON test for structure members A "." dereference to an all uppercase structure member can be incorrectly reported as a CONSTANT_COMPARISON. ie: "if (table[i].PANELID == tempdx)" Fix it by checking for "." before the constant test. Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index eb8f887..e3d9c34 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -4291,7 +4291,7 @@ sub process { my $comp = $3; my $to = $4; my $newcomp = $comp; - if ($lead !~ /$Operators\s*$/ && + if ($lead !~ /(?:$Operators|\.)\s*$/ && $to !~ /^(?:Constant|[A-Z_][A-Z0-9_]*)$/ && WARN("CONSTANT_COMPARISON", "Comparisons should place the constant on the right side of the test\n" . $herecurr) && -- cgit v0.10.2 From a91e8994f242a500bf05b9ee96fcd7ab0228f05f Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Fri, 20 May 2016 17:04:05 -0700 Subject: checkpatch: add test for keywords not starting on tabstops It's somewhat common and in general a defect for c90 keywords to not start on a tabstop. Add a test for this condition and warn when it occurs. Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index e3d9c34..6c1213c 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -2755,6 +2755,19 @@ sub process { "Logical continuations should be on the previous line\n" . $hereprev); } +# check indentation starts on a tab stop + if ($^V && $^V ge 5.10.0 && + $sline =~ /^\+\t+( +)(?:$c90_Keywords\b|\{\s*$|\}\s*(?:else\b|while\b|\s*$))/) { + my $indent = length($1); + if ($indent % 8) { + if (WARN("TABSTOP", + "Statements should start on a tabstop\n" . $herecurr) && + $fix) { + $fixed[$fixlinenr] =~ s@(^\+\t+) +@$1 . "\t" x ($indent/8)@e; + } + } + } + # check multi-line statement indentation matches previous line if ($^V && $^V ge 5.10.0 && $prevline =~ /^\+([ \t]*)((?:$c90_Keywords(?:\s+if)\s*)|(?:$Declare\s*)?(?:$Ident|\(\s*\*\s*$Ident\s*\))\s*|$Ident\s*=\s*$Ident\s*)\(.*(\&\&|\|\||,)\s*$/) { -- cgit v0.10.2 From 481aea5c59a57123b66d5850be1be79f9f230c0e Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Fri, 20 May 2016 17:04:08 -0700 Subject: checkpatch: whine about ACCESS_ONCE Add a test for use of ACCESS_ONCE that could be written using READ_ONCE or WRITE_ONCE. --fix it too if desired. The WRITE_ONCE fixes are less correct than the coccinelle script below as checkpatch cannot have a completely correct "expression" mechanism because checkpatch works on patches and not complete files. $ cat access_once.cocci @@ expression e1; expression e2; @@ - ACCESS_ONCE(e1) = e2 + WRITE_ONCE(e1, e2) @@ expression e1; @@ - ACCESS_ONCE(e1) + READ_ONCE(e1) Signed-off-by: Joe Perches Cc: Julia Lawall Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 6c1213c..1048672 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -5850,6 +5850,28 @@ sub process { } } +# whine about ACCESS_ONCE + if ($^V && $^V ge 5.10.0 && + $line =~ /\bACCESS_ONCE\s*$balanced_parens\s*(=(?!=))?\s*($FuncArg)?/) { + my $par = $1; + my $eq = $2; + my $fun = $3; + $par =~ s/^\(\s*(.*)\s*\)$/$1/; + if (defined($eq)) { + if (WARN("PREFER_WRITE_ONCE", + "Prefer WRITE_ONCE(, ) over ACCESS_ONCE() = \n" . $herecurr) && + $fix) { + $fixed[$fixlinenr] =~ s/\bACCESS_ONCE\s*\(\s*\Q$par\E\s*\)\s*$eq\s*\Q$fun\E/WRITE_ONCE($par, $fun)/; + } + } else { + if (WARN("PREFER_READ_ONCE", + "Prefer READ_ONCE() over ACCESS_ONCE()\n" . $herecurr) && + $fix) { + $fixed[$fixlinenr] =~ s/\bACCESS_ONCE\s*\(\s*\Q$par\E\s*\)/READ_ONCE($par)/; + } + } + } + # check for lockdep_set_novalidate_class if ($line =~ /^.\s*lockdep_set_novalidate_class\s*\(/ || $line =~ /__lockdep_no_validate__\s*\)/ ) { -- cgit v0.10.2 From ef212196369cbc2e694eab39261f9785ec252028 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Fri, 20 May 2016 17:04:11 -0700 Subject: checkpatch: advertise the --fix and --fix-inplace options more The --fix option is relatively unknown and underutilized. Add some text to show that it's available when style defects are found. Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 1048672..f09c3f2 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -5975,6 +5975,14 @@ sub process { } if ($quiet == 0) { + # If there were any defects found and not already fixing them + if (!$clean and !$fix) { + print << "EOM" + +NOTE: For some of the reported defects, checkpatch may be able to + mechanically convert to the typical style using --fix or --fix-inplace. +EOM + } # If there were whitespace errors which cleanpatch can fix # then suggest that. if ($rpt_cleaners) { -- cgit v0.10.2 From 3beb42eced39c00011ba4d608d52718af765e5d4 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Fri, 20 May 2016 17:04:14 -0700 Subject: checkpatch: add --list-types to show message types to show or ignore The message types are not currently knowable without reading the code. Add a mechanism to see what they are. Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index f09c3f2..c5a3c95 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -33,6 +33,7 @@ my $summary = 1; my $mailback = 0; my $summary_file = 0; my $show_types = 0; +my $list_types = 0; my $fix = 0; my $fix_inplace = 0; my $root; @@ -70,11 +71,12 @@ Options: --showfile emit diffed file position, not input file position -f, --file treat FILE as regular source file --subjective, --strict enable more subjective tests + --list-types list the possible message types --types TYPE(,TYPE2...) show only these comma separated message types --ignore TYPE(,TYPE2...) ignore various comma separated message types + --show-types show the specific message type in the output --max-line-length=n set the maximum line length, if exceeded, warn --min-conf-desc-length=n set the min description length, if shorter, warn - --show-types show the message "types" in the output --root=PATH PATH to the kernel tree root --no-summary suppress the per-file summary --mailback only produce a report in case of warnings/errors @@ -106,6 +108,37 @@ EOM exit($exitcode); } +sub uniq { + my %seen; + return grep { !$seen{$_}++ } @_; +} + +sub list_types { + my ($exitcode) = @_; + + my $count = 0; + + local $/ = undef; + + open(my $script, '<', abs_path($P)) or + die "$P: Can't read '$P' $!\n"; + + my $text = <$script>; + close($script); + + my @types = (); + for ($text =~ /\b(?:(?:CHK|WARN|ERROR)\s*\(\s*"([^"]+)")/g) { + push (@types, $_); + } + @types = sort(uniq(@types)); + print("#\tMessage type\n\n"); + foreach my $type (@types) { + print(++$count . "\t" . $type . "\n"); + } + + exit($exitcode); +} + my $conf = which_conf($configuration_file); if (-f $conf) { my @conf_args; @@ -146,6 +179,7 @@ GetOptions( 'ignore=s' => \@ignore, 'types=s' => \@use, 'show-types!' => \$show_types, + 'list-types!' => \$list_types, 'max-line-length=i' => \$max_line_length, 'min-conf-desc-length=i' => \$min_conf_desc_length, 'root=s' => \$root, @@ -166,6 +200,8 @@ GetOptions( help(0) if ($help); +list_types(0) if ($list_types); + $fix = 1 if ($fix_inplace); $check_orig = $check; -- cgit v0.10.2 From 4a593c3448312906358b00898c29a95278d82cc9 Mon Sep 17 00:00:00 2001 From: "Du, Changbin" Date: Fri, 20 May 2016 17:04:16 -0700 Subject: checkpatch: add support to check already applied git commits It's sometimes useful to scan already committed patches. Add --git to scan specific or multiple commits. Single commits are scanned with --git Multiple commits are scanned with --git --git - [joe@perches.com: o Don't exec git for each -, use a single "git log - " o Consolidate the git exec for the and - variants o Output 12 character commit hash ids o Don't scan git commit merges o Use -M to reduce the size of rename commits] Signed-off-by: "Du, Changbin" Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index c5a3c95..8fc9edd 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -27,6 +27,7 @@ my $emacs = 0; my $terse = 0; my $showfile = 0; my $file = 0; +my $git = 0; my $check = 0; my $check_orig = 0; my $summary = 1; @@ -69,6 +70,16 @@ Options: --emacs emacs compile window format --terse one line per report --showfile emit diffed file position, not input file position + -g, --git treat FILE as a single commit or git revision range + single git commit with: + + ^ + ~n + multiple git commits with: + .. + ... + - + git merges are ignored -f, --file treat FILE as regular source file --subjective, --strict enable more subjective tests --list-types list the possible message types @@ -174,6 +185,7 @@ GetOptions( 'terse!' => \$terse, 'showfile!' => \$showfile, 'f|file!' => \$file, + 'g|git!' => \$git, 'subjective!' => \$check, 'strict!' => \$check, 'ignore=s' => \@ignore, @@ -788,10 +800,42 @@ my @fixed_inserted = (); my @fixed_deleted = (); my $fixlinenr = -1; +# If input is git commits, extract all commits from the commit expressions. +# For example, HEAD-3 means we need check 'HEAD, HEAD~1, HEAD~2'. +die "$P: No git repository found\n" if ($git && !-e ".git"); + +if ($git) { + my @commits = (); + for my $commit_expr (@ARGV) { + my $git_range; + if ($commit_expr =~ m/-/) { + my @tmp = split(/-/, $commit_expr); + die "$P: incorrect git commits expression $commit_expr$!\n" + if (@tmp != 2); + $git_range = "-$tmp[1] $tmp[0]"; + } elsif ($commit_expr =~ m/\.\./) { + $git_range = "$commit_expr"; + } + if (defined $git_range) { + my $lines = `git log --no-merges --pretty=format:'%H' $git_range`; + foreach my $line (split(/\n/, $lines)) { + unshift(@commits, $line); + } + } else { + unshift(@commits, $commit_expr); + } + } + die "$P: no git commits after extraction!\n" if (@commits == 0); + @ARGV = @commits; +} + my $vname; for my $filename (@ARGV) { my $FILE; - if ($file) { + if ($git) { + open($FILE, '-|', "git format-patch -M --stdout -1 $filename") || + die "$P: $filename: git format-patch failed - $!\n"; + } elsif ($file) { open($FILE, '-|', "diff -u /dev/null $filename") || die "$P: $filename: diff failed - $!\n"; } elsif ($filename eq '-') { @@ -802,6 +846,8 @@ for my $filename (@ARGV) { } if ($filename eq '-') { $vname = 'Your patch'; + } elsif ($git) { + $vname = "Commit " . substr($filename, 0, 12) . `git log -1 --pretty=format:' ("%s")' $filename`; } else { $vname = $filename; } -- cgit v0.10.2 From 0dea9f1eef86bedacad91b6f652ca1ab0d08854c Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Fri, 20 May 2016 17:04:19 -0700 Subject: checkpatch: reduce number of `git log` calls with --git checkpatch currently calls git log multiple times to first get the sha1 values and again to get the subject for each individual sha1 commit. Always get the sha1 and subject at the same time instead. Store the subject in a sha1 hash to avoid the second git log exec. Link: http://lkml.kernel.org/r/274efab2332ad2308ab5de85a95d255f6e2de5f3.1462711962.git.joe@perches.com Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 8fc9edd..9283662 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -28,6 +28,7 @@ my $terse = 0; my $showfile = 0; my $file = 0; my $git = 0; +my %git_commits = (); my $check = 0; my $check_orig = 0; my $summary = 1; @@ -806,23 +807,25 @@ die "$P: No git repository found\n" if ($git && !-e ".git"); if ($git) { my @commits = (); - for my $commit_expr (@ARGV) { + foreach my $commit_expr (@ARGV) { my $git_range; if ($commit_expr =~ m/-/) { my @tmp = split(/-/, $commit_expr); - die "$P: incorrect git commits expression $commit_expr$!\n" + die "$P: incorrect git commit expression '$commit_expr' $!\n" if (@tmp != 2); $git_range = "-$tmp[1] $tmp[0]"; } elsif ($commit_expr =~ m/\.\./) { $git_range = "$commit_expr"; - } - if (defined $git_range) { - my $lines = `git log --no-merges --pretty=format:'%H' $git_range`; - foreach my $line (split(/\n/, $lines)) { - unshift(@commits, $line); - } } else { - unshift(@commits, $commit_expr); + $git_range = "-1 $commit_expr"; + } + my $lines = `git log --no-color --no-merges --pretty=format:'%H %s' $git_range`; + foreach my $line (split(/\n/, $lines)) { + $line =~ /(^\w+) (.*)/; + my $sha1 = $1; + my $subject = $2; + unshift(@commits, $sha1); + $git_commits{$sha1} = $subject; } } die "$P: no git commits after extraction!\n" if (@commits == 0); @@ -847,7 +850,7 @@ for my $filename (@ARGV) { if ($filename eq '-') { $vname = 'Your patch'; } elsif ($git) { - $vname = "Commit " . substr($filename, 0, 12) . `git log -1 --pretty=format:' ("%s")' $filename`; + $vname = "Commit " . substr($filename, 0, 12) . ' ("' . $git_commits{$filename} . '")'; } else { $vname = $filename; } -- cgit v0.10.2 From 28898fd1a85fa37a59c090271e8be37afe710276 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Fri, 20 May 2016 17:04:22 -0700 Subject: checkpatch: improve --git shortcut The --git shortcut can be confused by a tag with a dash like v4.4-rc1. Improve the test to verify the expression ends with a dash followed by a numeric value. Improve the git log result to verify the " " output as well. Link: http://lkml.kernel.org/r/c4a3f759291d967641860c3a54bb81177f34325f.1462711962.git.joe@perches.com Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 9283662..6750595 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -809,11 +809,8 @@ if ($git) { my @commits = (); foreach my $commit_expr (@ARGV) { my $git_range; - if ($commit_expr =~ m/-/) { - my @tmp = split(/-/, $commit_expr); - die "$P: incorrect git commit expression '$commit_expr' $!\n" - if (@tmp != 2); - $git_range = "-$tmp[1] $tmp[0]"; + if ($commit_expr =~ m/^(.*)-(\d+)$/) { + $git_range = "-$2 $1"; } elsif ($commit_expr =~ m/\.\./) { $git_range = "$commit_expr"; } else { @@ -821,7 +818,8 @@ if ($git) { } my $lines = `git log --no-color --no-merges --pretty=format:'%H %s' $git_range`; foreach my $line (split(/\n/, $lines)) { - $line =~ /(^\w+) (.*)/; + $line =~ /^([0-9a-fA-F]{40,40}) (.*)$/; + next if (!defined($1) || !defined($2)); my $sha1 = $1; my $subject = $2; unshift(@commits, $sha1); -- cgit v0.10.2 From 4108124f5c46efc951c4c6be947598a21d6b7fde Mon Sep 17 00:00:00 2001 From: Heloise Date: Fri, 20 May 2016 17:04:25 -0700 Subject: fs/efs/super.c: fix return value When sb_bread() fails, the return value should be -EIO, fix it. Link: http://lkml.kernel.org/r/1463464943-4142-1-git-send-email-os@iscas.ac.cn Signed-off-by: Heloise Cc: Johannes Weiner Cc: Firo Yang Cc: Vladimir Davydov Cc: Al Viro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/fs/efs/super.c b/fs/efs/super.c index cb68dac..368f7dd 100644 --- a/fs/efs/super.c +++ b/fs/efs/super.c @@ -275,7 +275,7 @@ static int efs_fill_super(struct super_block *s, void *d, int silent) if (!bh) { pr_err("cannot read volume header\n"); - return -EINVAL; + return -EIO; } /* @@ -293,7 +293,7 @@ static int efs_fill_super(struct super_block *s, void *d, int silent) bh = sb_bread(s, sb->fs_start + EFS_SUPER); if (!bh) { pr_err("cannot read superblock\n"); - return -EINVAL; + return -EIO; } if (efs_validate_super(sb, (struct efs_super *) bh->b_data)) { -- cgit v0.10.2 From c8cdd2be213f0f5372287a764225665f1ac64e56 Mon Sep 17 00:00:00 2001 From: Rasmus Villemoes Date: Fri, 20 May 2016 17:04:27 -0700 Subject: init/main.c: simplify initcall_blacklisted() Using kasprintf to get the function name makes us look up the name twice, along with all the vsnprintf overhead of parsing the format string etc. It also means there is an allocation failure case to deal with. Since symbol_string in vsprintf.c would anyway allocate an array of size KSYM_SYMBOL_LEN on the stack, that might as well be done up here. Moreover, since this is a debug feature and the blacklisted_initcalls list is usually empty, we might as well test that and thus avoid looking up the symbol name even once in the common case. Signed-off-by: Rasmus Villemoes Acked-by: Rusty Russell Acked-by: Prarit Bhargava Cc: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/init/main.c b/init/main.c index fa9b2bd..bc0f9e0 100644 --- a/init/main.c +++ b/init/main.c @@ -706,21 +706,20 @@ static int __init initcall_blacklist(char *str) static bool __init_or_module initcall_blacklisted(initcall_t fn) { struct blacklist_entry *entry; - char *fn_name; + char fn_name[KSYM_SYMBOL_LEN]; - fn_name = kasprintf(GFP_KERNEL, "%pf", fn); - if (!fn_name) + if (list_empty(&blacklisted_initcalls)) return false; + sprint_symbol_no_offset(fn_name, (unsigned long)fn); + list_for_each_entry(entry, &blacklisted_initcalls, next) { if (!strcmp(fn_name, entry->buf)) { pr_debug("initcall %s blacklisted\n", fn_name); - kfree(fn_name); return true; } } - kfree(fn_name); return false; } #else -- cgit v0.10.2 From 603ac5df86df5bfd26ed4d9c310dd196b07590ae Mon Sep 17 00:00:00 2001 From: Huang Shijie Date: Fri, 20 May 2016 17:04:30 -0700 Subject: kprobes: add the "tls" argument for j_do_fork Commit 3033f14ab78c ("clone: support passing tls argument via C rather than pt_regs magic") added the tls argument for _do_fork(). This patch adds the "tls" argument for j_do_fork to make it match _do_fork(). Signed-off-by: Huang Shijie Acked-by: Steve Capper Reviewed-by: Josh Triplett Cc: Masami Hiramatsu Cc: Andy Lutomirski Cc: Thiago Macieira Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/samples/kprobes/jprobe_example.c b/samples/kprobes/jprobe_example.c index c285a3b..c3108bb 100644 --- a/samples/kprobes/jprobe_example.c +++ b/samples/kprobes/jprobe_example.c @@ -25,7 +25,7 @@ /* Proxy routine having the same arguments as actual _do_fork() routine */ static long j_do_fork(unsigned long clone_flags, unsigned long stack_start, unsigned long stack_size, int __user *parent_tidptr, - int __user *child_tidptr) + int __user *child_tidptr, unsigned long tls) { pr_info("jprobe: clone_flags = 0x%lx, stack_start = 0x%lx " "stack_size = 0x%lx\n", clone_flags, stack_start, stack_size); -- cgit v0.10.2 From d04659ac94528e9224dbf1aed37dd11dd952cacc Mon Sep 17 00:00:00 2001 From: Huang Shijie Date: Fri, 20 May 2016 17:04:33 -0700 Subject: samples/kprobes: add a new module parameter Add a new module parameter which can be used as the symbol name. Without this patch, we can only test the "_do_fork" function with this kernel module. With this patch, the module becomes more flexible; we can test any functions with this module with # insmod kprobe_example.ko symbol="xxx" Link: http://lkml.kernel.org/r/1463535417-29637-1-git-send-email-shijie.huang@arm.com Signed-off-by: Huang Shijie Cc: Petr Mladek Cc: Steve Capper Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/samples/kprobes/kprobe_example.c b/samples/kprobes/kprobe_example.c index 727eb21..2bb190d 100644 --- a/samples/kprobes/kprobe_example.c +++ b/samples/kprobes/kprobe_example.c @@ -14,9 +14,13 @@ #include #include +#define MAX_SYMBOL_LEN 64 +static char symbol[MAX_SYMBOL_LEN] = "_do_fork"; +module_param_string(symbol, symbol, sizeof(symbol), 0644); + /* For each probe you need to allocate a kprobe structure */ static struct kprobe kp = { - .symbol_name = "_do_fork", + .symbol_name = symbol, }; /* kprobe pre_handler: called just before the probed instruction is executed */ -- cgit v0.10.2 From ea9b50133ffebbd580cb5cd0aa222784d7a2fcb1 Mon Sep 17 00:00:00 2001 From: Huang Shijie Date: Fri, 20 May 2016 17:04:36 -0700 Subject: samples/kprobes: print out the symbol name for the hooks Print out the symbol name for the hooks, it makes the logs more readable. Link: http://lkml.kernel.org/r/1463535417-29637-2-git-send-email-shijie.huang@arm.com Signed-off-by: Huang Shijie Cc: Petr Mladek Cc: Steve Capper Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds diff --git a/samples/kprobes/kprobe_example.c b/samples/kprobes/kprobe_example.c index 2bb190d..ed0ca0c 100644 --- a/samples/kprobes/kprobe_example.c +++ b/samples/kprobes/kprobe_example.c @@ -27,24 +27,24 @@ static struct kprobe kp = { static int handler_pre(struct kprobe *p, struct pt_regs *regs) { #ifdef CONFIG_X86 - printk(KERN_INFO "pre_handler: p->addr = 0x%p, ip = %lx," + printk(KERN_INFO "<%s> pre_handler: p->addr = 0x%p, ip = %lx," " flags = 0x%lx\n", - p->addr, regs->ip, regs->flags); + p->symbol_name, p->addr, regs->ip, regs->flags); #endif #ifdef CONFIG_PPC - printk(KERN_INFO "pre_handler: p->addr = 0x%p, nip = 0x%lx," + printk(KERN_INFO "<%s> pre_handler: p->addr = 0x%p, nip = 0x%lx," " msr = 0x%lx\n", - p->addr, regs->nip, regs->msr); + p->symbol_name, p->addr, regs->nip, regs->msr); #endif #ifdef CONFIG_MIPS - printk(KERN_INFO "pre_handler: p->addr = 0x%p, epc = 0x%lx," + printk(KERN_INFO "<%s> pre_handler: p->addr = 0x%p, epc = 0x%lx," " status = 0x%lx\n", - p->addr, regs->cp0_epc, regs->cp0_status); + p->symbol_name, p->addr, regs->cp0_epc, regs->cp0_status); #endif #ifdef CONFIG_TILEGX - printk(KERN_INFO "pre_handler: p->addr = 0x%p, pc = 0x%lx," + printk(KERN_INFO "<%s> pre_handler: p->addr = 0x%p, pc = 0x%lx," " ex1 = 0x%lx\n", - p->addr, regs->pc, regs->ex1); + p->symbol_name, p->addr, regs->pc, regs->ex1); #endif /* A dump_stack() here will give a stack backtrace */ @@ -56,20 +56,20 @@ static void handler_post(struct kprobe *p, struct pt_regs *regs, unsigned long flags) { #ifdef CONFIG_X86 - printk(KERN_INFO "post_handler: p->addr = 0x%p, flags = 0x%lx\n", - p->addr, regs->flags); + printk(KERN_INFO "<%s> post_handler: p->addr = 0x%p, flags = 0x%lx\n", + p->symbol_name, p->addr, regs->flags); #endif #ifdef CONFIG_PPC - printk(KERN_INFO "post_handler: p->addr = 0x%p, msr = 0x%lx\n", - p->addr, regs->msr); + printk(KERN_INFO "<%s> post_handler: p->addr = 0x%p, msr = 0x%lx\n", + p->symbol_name, p->addr, regs->msr); #endif #ifdef CONFIG_MIPS - printk(KERN_INFO "post_handler: p->addr = 0x%p, status = 0x%lx\n", - p->addr, regs->cp0_status); + printk(KERN_INFO "<%s> post_handler: p->addr = 0x%p, status = 0x%lx\n", + p->symbol_name, p->addr, regs->cp0_status); #endif #ifdef CONFIG_TILEGX - printk(KERN_INFO "post_handler: p->addr = 0x%p, ex1 = 0x%lx\n", - p->addr, regs->ex1); + printk(KERN_INFO "<%s> post_handler: p->addr = 0x%p, ex1 = 0x%lx\n", + p->symbol_name, p->addr, regs->ex1); #endif } -- cgit v0.10.2