summaryrefslogtreecommitdiff
path: root/mm/sparse.c
AgeCommit message (Collapse)Author
2008-04-30revert "memory hotplug: allocate usemap on the section with pgdat"Andrew Morton
This: commit 86f6dae1377523689bd8468fed2f2dd180fc0560 Author: Yasunori Goto <y-goto@jp.fujitsu.com> Date: Mon Apr 28 02:13:33 2008 -0700 memory hotplug: allocate usemap on the section with pgdat Usemaps are allocated on the section which has pgdat by this. Because usemap size is very small, many other sections usemaps are allocated on only one page. If a section has usemap, it can't be removed until removing other sections. This dependency is not desirable for memory removing. Pgdat has similar feature. When a section has pgdat area, it must be the last section for removing on the node. So, if section A has pgdat and section B has usemap for section A, Both sections can't be removed due to dependency each other. To solve this issue, this patch collects usemap on same section with pgdat. If other sections doesn't have any dependency, this section will be able to be removed finally. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> broke davem's sparc64 bootup. Revert it while we work out what went wrong. Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30mm: remove remaining __FUNCTION__ occurrencesHarvey Harrison
__FUNCTION__ is gcc-specific, use __func__ Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28memory hotplug: free memmaps allocated by bootmemYasunori Goto
This patch is to free memmaps which is allocated by bootmem. Freeing usemap is not necessary. The pages of usemap may be necessary for other sections. If removing section is last section on the node, its section is the final user of usemap page. (usemaps are allocated on its section by previous patch.) But it shouldn't be freed too, because the section must be logical offline state which all pages are isolated against page allocater. If it is freed, page alloctor may use it which will be removed physically soon. It will be disaster. So, this patch keeps it as it is. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28memory hotplug: allocate usemap on the section with pgdatYasunori Goto
Usemaps are allocated on the section which has pgdat by this. Because usemap size is very small, many other sections usemaps are allocated on only one page. If a section has usemap, it can't be removed until removing other sections. This dependency is not desirable for memory removing. Pgdat has similar feature. When a section has pgdat area, it must be the last section for removing on the node. So, if section A has pgdat and section B has usemap for section A, Both sections can't be removed due to dependency each other. To solve this issue, this patch collects usemap on same section with pgdat. If other sections doesn't have any dependency, this section will be able to be removed finally. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28memory hotplug: align memmap to page sizeYasunori Goto
To free memmap easier, this patch aligns it to page size. Bootmem allocater may mix some objects in one pages. It's not good for freeing memmap of memory hot-remove. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28memory hotplug: register section/node id to freeYasunori Goto
This patch set is to free pages which is allocated by bootmem for memory-hotremove. Some structures of memory management are allocated by bootmem. ex) memmap, etc. To remove memory physically, some of them must be freed according to circumstance. This patch set makes basis to free those pages, and free memmaps. Basic my idea is using remain members of struct page to remember information of users of bootmem (section number or node id). When the section is removing, kernel can confirm it. By this information, some issues can be solved. 1) When the memmap of removing section is allocated on other section by bootmem, it should/can be free. 2) When the memmap of removing section is allocated on the same section, it shouldn't be freed. Because the section has to be logical memory offlined already and all pages must be isolated against page allocater. If it is freed, page allocator may use it which will be removed physically soon. 3) When removing section has other section's memmap, kernel will be able to show easily which section should be removed before it for user. (Not implemented yet) 4) When the above case 2), the page isolation will be able to check and skip memmap's page when logical memory offline (offline_pages()). Current page isolation code fails in this case because this page is just reserved page and it can't distinguish this pages can be removed or not. But, it will be able to do by this patch. (Not implemented yet.) 5) The node information like pgdat has similar issues. But, this will be able to be solved too by this. (Not implemented yet, but, remembering node id in the pages.) Fortunately, current bootmem allocator just keeps PageReserved flags, and doesn't use any other members of page struct. The users of bootmem doesn't use them too. This patch: This is to register information which is node or section's id. Kernel can distinguish which node/section uses the pages allcated by bootmem. This is basis for hot-remove sections or nodes. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28hotplug memory remove: generic __remove_pages() supportBadari Pulavarty
Generic helper function to remove section mappings and sysfs entries for the section of the memory we are removing. offline_pages() correctly adjusted zone and marked the pages reserved. TODO: Yasunori Goto is working on patches to free up allocations from bootmem. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-26x86_64/mm: check and print vmemmap allocation continuousYinghai Lu
On big systems with lots of memory, don't print out too much during bootup, and make it easy to find if it is continuous. on 256G 8 sockets system will get [ffffe20000000000-ffffe20002bfffff] PMD -> [ffff810001400000-ffff810003ffffff] on node 0 [ffffe2001c700000-ffffe2001c7fffff] potential offnode page_structs [ffffe20002c00000-ffffe2001c7fffff] PMD -> [ffff81000c000000-ffff8100255fffff] on node 0 [ffffe20038700000-ffffe200387fffff] potential offnode page_structs [ffffe2001c800000-ffffe200387fffff] PMD -> [ffff810820200000-ffff81083c1fffff] on node 1 [ffffe20040000000-ffffe2007fffffff] PUD ->ffff811027a00000 on node 2 [ffffe20038800000-ffffe2003fffffff] PMD -> [ffff811020200000-ffff8110279fffff] on node 2 [ffffe20054700000-ffffe200547fffff] potential offnode page_structs [ffffe20040000000-ffffe200547fffff] PMD -> [ffff811027c00000-ffff81103c3fffff] on node 2 [ffffe20070700000-ffffe200707fffff] potential offnode page_structs [ffffe20054800000-ffffe200707fffff] PMD -> [ffff811820200000-ffff81183c1fffff] on node 3 [ffffe20080000000-ffffe200bfffffff] PUD ->ffff81202fa00000 on node 4 [ffffe20070800000-ffffe2007fffffff] PMD -> [ffff812020200000-ffff81202f9fffff] on node 4 [ffffe2008c700000-ffffe2008c7fffff] potential offnode page_structs [ffffe20080000000-ffffe2008c7fffff] PMD -> [ffff81202fc00000-ffff81203c3fffff] on node 4 [ffffe200a8700000-ffffe200a87fffff] potential offnode page_structs [ffffe2008c800000-ffffe200a87fffff] PMD -> [ffff812820200000-ffff81283c1fffff] on node 5 [ffffe200c0000000-ffffe200ffffffff] PUD ->ffff813037a00000 on node 6 [ffffe200a8800000-ffffe200bfffffff] PMD -> [ffff813020200000-ffff8130379fffff] on node 6 [ffffe200c4700000-ffffe200c47fffff] potential offnode page_structs [ffffe200c0000000-ffffe200c47fffff] PMD -> [ffff813037c00000-ffff81303c3fffff] on node 6 [ffffe200c4800000-ffffe200e07fffff] PMD -> [ffff813820200000-ffff81383c1fffff] on node 7 instead of a very long print out... Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-26mm: make mem_map allocation continuousYinghai Lu
vmemmap allocation currently has this layout: [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0 [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001800000 on node 0 [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001c00000 on node 0 [ffffe20000600000-ffffe200007fffff] PMD ->ffff810002000000 on node 0 [ffffe20000800000-ffffe200009fffff] PMD ->ffff810002400000 on node 0 ... note that there is a 2M hole between them - not optimal. the root cause is that usemap (24 bytes) will be allocated after every 2M mem_map, and it will push next vmemmap (2M) to the next (2M) alignment. solution: try to allocate the mem_map continously. after the patch, we get: [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0 [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001600000 on node 0 [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001800000 on node 0 [ffffe20000600000-ffffe200007fffff] PMD ->ffff810001a00000 on node 0 [ffffe20000800000-ffffe200009fffff] PMD ->ffff810001c00000 on node 0 ... which is the ideal layout. and usemap will share a page because of they are allocated continuously too: sparse_early_usemap_alloc: usemap = ffff810024e00000 size = 24 sparse_early_usemap_alloc: usemap = ffff810024e00080 size = 24 sparse_early_usemap_alloc: usemap = ffff810024e00100 size = 24 sparse_early_usemap_alloc: usemap = ffff810024e00180 size = 24 ... so we make the bootmem allocation more compact and use less memory for usemap => mission accomplished ;-) Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-16mm: sparsemem memory_present() fixIngo Molnar
Fix memory corruption and crash on 32-bit x86 systems. If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of RAM, then we call memory_present() with a start/end that goes outside the scope of MAX_PHYSMEM_BITS. That causes this loop to happily walk over the limit of the sparse memory section map: for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) { unsigned long section = pfn_to_section_nr(pfn); struct mem_section *ms; sparse_index_init(section, nid); set_section_nid(section, nid); ms = __nr_to_section(section); if (!ms->section_mem_map) ms->section_mem_map = sparse_encode_early_nid(nid) | SECTION_MARKED_PRESENT; 'ms' will be out of bounds and we'll corrupt a small amount of memory by encoding the node ID and writing SECTION_MARKED_PRESENT (==0x1) over it. The corruption might happen when encoding a non-zero node ID, or due to the SECTION_MARKED_PRESENT which is 0x1: mmzone.h:#define SECTION_MARKED_PRESENT (1UL<<0) The fix is to sanity check anything the architecture passes to sparsemem. This bug seems to be rather old (as old as sparsemem support itself), but the exact incarnation depended on random details like configs, which made this bug more prominent in v2.6.25-to-be. An additional enhancement might be to print a warning about ignored or trimmed memory ranges. Signed-off-by: Ingo Molnar <mingo@elte.hu> Tested-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Yinghai Lu <Yinghai.Lu@sun.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05mm: fix section mismatch warning in sparse.cSam Ravnborg
Fix following warning: WARNING: mm/built-in.o(.text+0x22069): Section mismatch in reference from the function sparse_early_usemap_alloc() to the function .init.text:__alloc_bootmem_node() static sparse_early_usemap_alloc() were used only by sparse_init() and with sparse_init() annotated _init it is safe to annotate sparse_early_usemap_alloc with __init too. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05is_vmalloc_addr(): Check if an address is within the vmalloc boundariesChristoph Lameter
Checking if an address is a vmalloc address is done in a couple of places. Define a common version in mm.h and replace the other checks. Again the include structures suck. The definition of VMALLOC_START and VMALLOC_END is not available in vmalloc.h since highmem.c cannot be included there. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-18mm/sparse.c: improve the error handling for sparse_add_one_section()WANG Cong
Improve the error handling for mm/sparse.c::sparse_add_one_section(). And I see no reason to check 'usemap' until holding the 'pgdat_resize_lock'. [geoffrey.levand@am.sony.com: sparse_index_init() returns -EEXIST] Cc: Christoph Lameter <clameter@sgi.com> Acked-by: Dave Hansen <haveblue@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-18mm/sparse.c: check the return value of sparse_index_alloc()WANG Cong
Since sparse_index_alloc() can return NULL on memory allocation failure, we must deal with the failure condition when calling it. Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-29Revert "x86_64: allocate sparsemem memmap above 4G"Linus Torvalds
This reverts commit 2e1c49db4c640b35df13889b86b9d62215ade4b6. First off, testing in Fedora has shown it to cause boot failures, bisected down by Martin Ebourne, and reported by Dave Jobes. So the commit will likely be reverted in the 2.6.23 stable kernels. Secondly, in the 2.6.24 model, x86-64 has now grown support for SPARSEMEM_VMEMMAP, which disables the relevant code anyway, so while the bug is not visible any more, it's become invisible due to the code just being irrelevant and no longer enabled on the only architecture that this ever affected. Reported-by: Dave Jones <davej@redhat.com> Tested-by: Martin Ebourne <fedora@ebourne.me.uk> Cc: Zou Nan hai <nanhai.zou@intel.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16memory hotplug: Hot-add with sparsemem-vmemmapYasunori Goto
This patch is to avoid panic when memory hot-add is executed with sparsemem-vmemmap. Current vmemmap-sparsemem code doesn't support memory hot-add. Vmemmap must be populated when hot-add. This is for 2.6.23-rc2-mm2. Todo: # Even if this patch is applied, the message "[xxxx-xxxx] potential offnode page_structs" is displayed. To allocate memmap on its node, memmap (and pgdat) must be initialized itself like chicken and egg relationship. # vmemmap_unpopulate will be necessary for followings. - For cancel hot-add due to error. - For unplug. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16Fix corruption of memmap on IA64 SPARSEMEM when mem_section is not a power of 2Mel Gorman
There are problems in the use of SPARSEMEM and pageblock flags that causes problems on ia64. The first part of the problem is that units are incorrect in SECTION_BLOCKFLAGS_BITS computation. This results in a map_section's section_mem_map being treated as part of a bitmap which isn't good. This was evident with an invalid virtual address when mem_init attempted to free bootmem pages while relinquishing control from the bootmem allocator. The second part of the problem occurs because the pageblock flags bitmap is be located with the mem_section. The SECTIONS_PER_ROOT computation using sizeof (mem_section) may not be a power of 2 depending on the size of the bitmap. This renders masks and other such things not power of 2 base. This issue was seen with SPARSEMEM_EXTREME on ia64. This patch moves the bitmap outside of mem_section and uses a pointer instead in the mem_section. The bitmaps are allocated when the section is being initialised. Note that sparse_early_usemap_alloc() does not use alloc_remap() like sparse_early_mem_map_alloc(). The allocation required for the bitmap on x86, the only architecture that uses alloc_remap is typically smaller than a cache line. alloc_remap() pads out allocations to the cache size which would be a needless waste. Credit to Bob Picco for identifying the original problem and effecting a fix for the SECTION_BLOCKFLAGS_BITS calculation. Credit to Andy Whitcroft for devising the best way of allocating the bitmaps only when required for the section. [wli@holomorphy.com: warning fix] Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: William Irwin <bill.irwin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16Generic Virtual Memmap support for SPARSEMEMChristoph Lameter
SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all the arches. It would be great if it could be the default so that we can get rid of various forms of DISCONTIG and other variations on memory maps. So far what has hindered this are the additional lookups that SPARSEMEM introduces for virt_to_page and page_address. This goes so far that the code to do this has to be kept in a separate function and cannot be used inline. This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap is mapped into a virtually contigious area, only the active sections are physically backed. This allows virt_to_page page_address and cohorts become simple shift/add operations. No page flag fields, no table lookups, nothing involving memory is required. The two key operations pfn_to_page and page_to_page become: #define __pfn_to_page(pfn) (vmemmap + (pfn)) #define __page_to_pfn(page) ((page) - vmemmap) By having a virtual mapping for the memmap we allow simple access without wasting physical memory. As kernel memory is typically already mapped 1:1 this introduces no additional overhead. The virtual mapping must be big enough to allow a struct page to be allocated and mapped for all valid physical pages. This vill make a virtual memmap difficult to use on 32 bit platforms that support 36 address bits. However, if there is enough virtual space available and the arch already maps its 1-1 kernel space using TLBs (f.e. true of IA64 and x86_64) then this technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM. FLATMEM needs to read the contents of the mem_map variable to get the start of the memmap and then add the offset to the required entry. vmemmap is a constant to which we can simply add the offset. This patch has the potential to allow us to make SPARSMEM the default (and even the only) option for most systems. It should be optimal on UP, SMP and NUMA on most platforms. Then we may even be able to remove the other memory models: FLATMEM, DISCONTIG etc. [apw@shadowen.org: config cleanups, resplit code etc] [kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init] [apw@shadowen.org: vmemmap: remove excess debugging] [apw@shadowen.org: simplify initialisation code and reduce duplication] [apw@shadowen.org: pull out the vmemmap code into its own file] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16sparsemem: record when a section has a valid mem_mapAndy Whitcroft
We have flags to indicate whether a section actually has a valid mem_map associated with it. This is never set and we rely solely on the present bit to indicate a section is valid. By definition a section is not valid if it has no mem_map and there is a window during init where the present bit is set but there is no mem_map, during which pfn_valid() will return true incorrectly. Use the existing SECTION_HAS_MEM_MAP flag to indicate the presence of a valid mem_map. Switch valid_section{,_nr} and pfn_valid() to this bit. Add a new present_section{,_nr} and pfn_present() interfaces for those users who care to know that a section is going to be valid. [akpm@linux-foundation.org: coding-syle fixes] Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16sparsemem: clean up spelling error in commentsAndy Whitcroft
SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all the arches. It would be great if it could be the default so that we can get rid of various forms of DISCONTIG and other variations on memory maps. So far what has hindered this are the additional lookups that SPARSEMEM introduces for virt_to_page and page_address. This goes so far that the code to do this has to be kept in a separate function and cannot be used inline. This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap is mapped into a virtually contigious area, only the active sections are physically backed. This allows virt_to_page page_address and cohorts become simple shift/add operations. No page flag fields, no table lookups, nothing involving memory is required. The two key operations pfn_to_page and page_to_page become: #define __pfn_to_page(pfn) (vmemmap + (pfn)) #define __page_to_pfn(page) ((page) - vmemmap) By having a virtual mapping for the memmap we allow simple access without wasting physical memory. As kernel memory is typically already mapped 1:1 this introduces no additional overhead. The virtual mapping must be big enough to allow a struct page to be allocated and mapped for all valid physical pages. This vill make a virtual memmap difficult to use on 32 bit platforms that support 36 address bits. However, if there is enough virtual space available and the arch already maps its 1-1 kernel space using TLBs (f.e. true of IA64 and x86_64) then this technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM. FLATMEM needs to read the contents of the mem_map variable to get the start of the memmap and then add the offset to the required entry. vmemmap is a constant to which we can simply add the offset. This patch has the potential to allow us to make SPARSMEM the default (and even the only) option for most systems. It should be optimal on UP, SMP and NUMA on most platforms. Then we may even be able to remove the other memory models: FLATMEM, DISCONTIG etc. The current aim is to bring a common virtually mapped mem_map to all architectures. This should facilitate the removal of the bespoke implementations from the architectures. This also brings performance improvements for most architecture making sparsmem vmemmap the more desirable memory model. The ultimate aim of this work is to expand sparsemem support to encompass all the features of the other memory models. This could allow us to drop support for and remove the other models in the longer term. Below are some comparitive kernbench numbers for various architectures, comparing default memory model against SPARSEMEM VMEMMAP. All but ia64 show marginal improvement; we expect the ia64 figures to be sorted out when the larger mapping support returns. x86-64 non-NUMA Base VMEMAP % change (-ve good) User 85.07 84.84 -0.26 System 34.32 33.84 -1.39 Total 119.38 118.68 -0.59 ia64 Base VMEMAP % change (-ve good) User 1016.41 1016.93 0.05 System 50.83 51.02 0.36 Total 1067.25 1067.95 0.07 x86-64 NUMA Base VMEMAP % change (-ve good) User 30.77 431.73 0.22 System 45.39 43.98 -3.11 Total 476.17 475.71 -0.10 ppc64 Base VMEMAP % change (-ve good) User 488.77 488.35 -0.09 System 56.92 56.37 -0.97 Total 545.69 544.72 -0.18 Below are some AIM bencharks on IA64 and x86-64 (thank Bob). The seems pretty much flat as you would expect. ia64 results 2 cpu non-numa 4Gb SCSI disk Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" extreme Jun 1 07:17:24 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 98.9 100 58.9 1.3 1.6482 101 5547.1 95 106.0 79.4 0.9154 201 6377.7 95 183.4 158.3 0.5288 301 6932.2 95 252.7 237.3 0.3838 401 7075.8 93 329.8 316.7 0.2941 501 7235.6 94 403.0 396.2 0.2407 600 7387.5 94 472.7 475.0 0.2052 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" vmemmap Jun 1 09:59:04 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 99.1 100 58.8 1.2 1.6509 101 5480.9 95 107.2 79.2 0.9044 201 6490.3 95 180.2 157.8 0.5382 301 6886.6 94 254.4 236.8 0.3813 401 7078.2 94 329.7 316.0 0.2942 501 7250.3 95 402.2 395.4 0.2412 600 7399.1 94 471.9 473.9 0.2055 open power 710 2 cpu, 4 Gb, SCSI and configured physically Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" extreme May 29 15:42:53 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 25.7 100 226.3 4.3 0.4286 101 1096.0 97 536.4 199.8 0.1809 201 1236.4 96 946.1 389.1 0.1025 301 1280.5 96 1368.0 582.3 0.0709 401 1270.2 95 1837.4 771.0 0.0528 501 1251.4 96 2330.1 955.9 0.0416 601 1252.6 96 2792.4 1139.2 0.0347 701 1245.2 96 3276.5 1334.6 0.0296 918 1229.5 96 4345.4 1728.7 0.0223 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" vmemmap May 30 07:28:26 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 25.6 100 226.9 4.3 0.4275 101 1049.3 97 560.2 198.1 0.1731 201 1199.1 97 975.6 390.7 0.0994 301 1261.7 96 1388.5 591.5 0.0699 401 1256.1 96 1858.1 771.9 0.0522 501 1220.1 96 2389.7 955.3 0.0406 601 1224.6 96 2856.3 1133.4 0.0340 701 1252.0 96 3258.7 1314.1 0.0298 915 1232.8 96 4319.7 1704.0 0.0225 amd64 2 2-core, 4Gb and SATA Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" extreme Jun 2 03:59:48 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 13.0 100 446.4 2.1 0.2173 101 533.4 97 1102.0 110.2 0.0880 201 578.3 97 2022.8 220.8 0.0480 301 583.8 97 3000.6 332.3 0.0323 401 580.5 97 4020.1 442.2 0.0241 501 574.8 98 5072.8 558.8 0.0191 600 566.5 98 6163.8 671.0 0.0157 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" vmemmap Jun 3 04:19:31 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 13.0 100 447.8 2.0 0.2166 101 536.5 97 1095.6 109.7 0.0885 201 567.7 97 2060.5 219.3 0.0471 301 582.1 96 3009.4 330.2 0.0322 401 578.2 96 4036.4 442.4 0.0240 501 585.1 98 4983.2 555.1 0.0195 600 565.5 98 6175.2 660.6 0.0157 This patch: Fix some spelling errors. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-23sparsemem: ensure we initialise the node mapping for SPARSEMEM_STATICAndy Whitcroft
Booting SPARSEMEM on NUMA systems trips a BUG in page_alloc.c: Initializing HighMem for node 0 (00038000:00100000) Initializing HighMem for node 1 (00100000:001ffe00) ------------[ cut here ]------------ kernel BUG at /home/apw/git/linux-2.6/mm/page_alloc.c:456! [...] This occurs because the section to node id mapping is not being setup correctly during init under SPARSEMEM_STATIC, leading to an attempt to free pages from all nodes into the zones on node 0. When the zone_table[] was removed in the following commit, a new section to node mapping table was introduced: commit 89689ae7f95995723fbcd5c116c47933a3bb8b13 [PATCH] Get rid of zone_table[] That conversion inadvertantly only initialised the node mapping in SPARSEMEM_EXTREME. Ensure we initialise the node mapping in SPARSEMEM_STATIC. [akpm@linux-foundation.org: make the stubs static inline] Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-22x86_64: fix section mismatch warning in init.cSam Ravnborg
Fix following warning: WARNING: vmlinux.o(.text+0x188ea): Section mismatch: reference to .init.text:__alloc_bootmem_core (between 'alloc_bootmem_high_node' and 'get_gate_vma') alloc_bootmem_high_node() is only used from __init scope so declare it __init. And in addition declare the weak variant __init too. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-09Move three functions that are only needed for CONFIG_MEMORY_HOTPLUGStephen Rothwell
into the appropriate #ifdef. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01x86_64: allocate sparsemem memmap above 4GZou Nan hai
On systems with huge amount of physical memory, VFS cache and memory memmap may eat all available system memory under 4G, then the system may fail to allocate swiotlb bounce buffer. There was a fix for this issue in arch/x86_64/mm/numa.c, but that fix dose not cover sparsemem model. This patch add fix to sparsemem model by first try to allocate memmap above 4G. Signed-off-by: Zou Nan hai <nanhai.zou@intel.com> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-19mm: fix section mismatch warningsSam Ravnborg
modpost had two cases hardcoded for mm/ Shift over to __init_refok and kill the hardcoded function names in modpost. This has the drawback that the functions will always be kept no matter configuration. With previous code the function were placed in init section if configuration allowed it. Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2007-05-08Add white list into modpost.c for memory hotplug code and ia64's machvec sectionYasunori Goto
This patch is add white list into modpost.c for some functions and ia64's section to fix section mismatchs. sparse_index_alloc() and zone_wait_table_init() calls bootmem allocator at boot time, and kmalloc/vmalloc at hotplug time. If config memory hotplug is on, there are references of bootmem allocater(init text) from them (normal text). This is cause of section mismatch. Bootmem is called by many functions and it must be used only at boot time. I think __init of them should keep for section mismatch check. So, I would like to register sparse_index_alloc() and zone_wait_table_init() into white list. In addition, ia64's .machvec section is function table of some platform dependent code. It is mixture of .init.text and normal text. These reference of __init functions are valid too. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08Fix section mismatch of memory hotplug related code.Yasunori Goto
This is to fix many section mismatches of code related to memory hotplug. I checked compile with memory hotplug on/off on ia64 and x86-64 box. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07[MM]: sparse_init() should be __init.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-07[PATCH] numa node ids are int, page_to_nid and zone_to_nid should return intAndy Whitcroft
NUMA node ids are passed as either int or unsigned int almost exclusivly page_to_nid and zone_to_nid both return unsigned long. This is a throw back to when page_to_nid was a #define and was thus exposing the real type of the page flags field. In addition to fixing up the definitions of page_to_nid and zone_to_nid I audited the users of these functions identifying the following incorrect uses: 1) mm/page_alloc.c show_node() -- printk dumping the node id, 2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison against numa_node_id() which returns an int from cpu_to_node(), and 3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which uses bit_set which in generic code takes an int. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] Get rid of zone_table[]Christoph Lameter
The zone table is mostly not needed. If we have a node in the page flags then we can get to the zone via NODE_DATA() which is much more likely to be already in the cpu cache. In case of SMP and UP NODE_DATA() is a constant pointer which allows us to access an exact replica of zonetable in the node_zones field. In all of the above cases there will be no need at all for the zone table. The only remaining case is if in a NUMA system the node numbers do not fit into the page flags. In that case we make sparse generate a table that maps sections to nodes and use that table to to figure out the node number. This table is sized to fit in a single cache line for the known 32 bit NUMA platform which makes it very likely that the information can be obtained without a cache miss. For sparsemem the zone table seems to be have been fairly large based on the maximum possible number of sections and the number of zones per node. There is some memory saving by removing zone_table. The main benefit is to reduce the cache foootprint of the VM from the frequent lookups of zones. Plus it simplifies the page allocator. [akpm@osdl.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28[PATCH] memory hotplug: __GFP_NOWARN is better for __kmalloc_section_memmap()Yasunori Goto
Add __GFP_NOWARN flag to calling of __alloc_pages() in __kmalloc_section_memmap(). It can reduce noisy failure message. In ia64, section size is 1 GB, this means that order 8 pages are necessary for each section's memmap. It is often very hard requirement under heavy memory pressure as you know. So, __alloc_pages() gives up allocation and shows many noisy stack traces which means no page for each sections. (Current my environment shows 32 times of stack trace....) But, __kmalloc_section_memmap() calls vmalloc() after failure of it, and it can succeed allocation of memmap. So, its stack trace warning becomes just noisy. I suppose it shouldn't be shown. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30Remove obsolete #include <linux/config.h>Jörn Engel
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-28[PATCH] spin/rwlock init cleanupsIngo Molnar
locking init cleanups: - convert " = SPIN_LOCK_UNLOCKED" to spin_lock_init() or DEFINE_SPINLOCK() - convert rwlocks in a similar manner this patch was generated automatically. Motivation: - cleanliness - lockdep needs control of lock initialization, which the open-coded variants do not give - it's also useful for -rt and for lock debugging in general Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] sparsemem: record nid during memory presentAndy Whitcroft
Record the node id as we mark sections for instantiation. Use this nid during instantiation to direct allocations. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Mike Kravetz <kravetz@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Bob Picco <bob.picco@hp.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Martin Bligh <mbligh@google.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21[PATCH] SPARSEMEM incorrectly calculates section numberMike Kravetz
A bad calculation/loop in __section_nr() could result in incorrect section information being put into sysfs memory entries. This primarily impacts memory add operations as the sysfs information is used while onlining new memory. Fix suggested by Dave Hansen. Note that the bug may not be obvious from the patch. It actually occurs in the function's return statement: return (root_nr * SECTIONS_PER_ROOT) + (ms - root); In the existing code, root_nr has already been multiplied by SECTIONS_PER_ROOT. Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-15[PATCH] add slab_is_available() routine for boot codeMike Kravetz
slab_is_available() indicates slab based allocators are available for use. SPARSEMEM code needs to know this as it can be called at various times during the boot process. Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-02[PATCH] sparsemem interaction with memory add bug fixesMike Kravetz
This patch fixes two bugs with the way sparsemem interacts with memory add. They are: - memory leak if memmap for section already exists - calling alloc_bootmem_node() after boot These bugs were discovered and a first cut at the fixes were provided by Arnd Bergmann <arnd@arndb.de> and Joel Schopp <jschopp@us.ibm.com>. Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-09[PATCH] Change maxaligned_in_smp alignemnt macros to internodealigned_in_smp ↵Ravikiran G Thirumalai
macros ____cacheline_maxaligned_in_smp is currently used to align critical structures and avoid false sharing. It uses per-arch L1_CACHE_SHIFT_MAX and people find L1_CACHE_SHIFT_MAX useless. However, we have been using ____cacheline_maxaligned_in_smp to align structures on the internode cacheline size. As per Andi's suggestion, following patch kills ____cacheline_maxaligned_in_smp and introduces INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches. Arches needing L3/Internode cacheline alignment can define INTERNODE_CACHE_SHIFT in the arch asm/cache.h. Patch replaces ____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp With this patch, L1_CACHE_SHIFT_MAX can be killed Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] memory hotplug: move section_mem_map alloc to sparse.cDave Hansen
This basically keeps up from having to extern __kmalloc_section_memmap(). The vaddr_in_vmalloc_area() helper could go in a vmalloc header, but that header gets hard to work with, because it needs some arch-specific macros. Just stick it in here for now, instead of creating another header. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Lion Vollnhals <webmaster@schiggl.de> Signed-off-by: Jiri Slaby <xslaby@fi.muni.cz> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] memory hotplug prep: __section_nr helperDave Hansen
A little helper that we use in the hotplug code. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] sparsemem extreme: hotplug preparationDave Hansen
This splits up sparse_index_alloc() into two pieces. This is needed because we'll allocate the memory for the second level in a different place from where we actually consume it to keep the allocation from happening underneath a lock Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] sparsemem extreme implementationBob Picco
With cleanups from Dave Hansen <haveblue@us.ibm.com> SPARSEMEM_EXTREME makes mem_section a one dimensional array of pointers to mem_sections. This two level layout scheme is able to achieve smaller memory requirements for SPARSEMEM with the tradeoff of an additional shift and load when fetching the memory section. The current SPARSEMEM implementation is a one dimensional array of mem_sections which is the default SPARSEMEM configuration. The patch attempts isolates the implementation details of the physical layout of the sparsemem section array. SPARSEMEM_EXTREME requires bootmem to be functioning at the time of memory_present() calls. This is not always feasible, so architectures which do not need it may allocate everything statically by using SPARSEMEM_STATIC. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] SPARSEMEM EXTREMEBob Picco
A new option for SPARSEMEM is ARCH_SPARSEMEM_EXTREME. Architecture platforms with a very sparse physical address space would likely want to select this option. For those architecture platforms that don't select the option, the code generated is equivalent to SPARSEMEM currently in -mm. I'll be posting a patch on ia64 ml which uses this new SPARSEMEM feature. ARCH_SPARSEMEM_EXTREME makes mem_section a one dimensional array of pointers to mem_sections. This two level layout scheme is able to achieve smaller memory requirements for SPARSEMEM with the tradeoff of an additional shift and load when fetching the memory section. The current SPARSEMEM -mm implementation is a one dimensional array of mem_sections which is the default SPARSEMEM configuration. The patch attempts isolates the implementation details of the physical layout of the sparsemem section array. ARCH_SPARSEMEM_EXTREME depends on 64BIT and is by default boolean false. I've boot tested under aim load ia64 configured for ARCH_SPARSEMEM_EXTREME. I've also boot tested a 4 way Opteron machine with !ARCH_SPARSEMEM_EXTREME and tested with aim. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem hotplug baseAndy Whitcroft
Make sparse's initalization be accessible at runtime. This allows sparse mappings to be created after boot in a hotplug situation. This patch is separated from the previous one just to give an indication how much of the sparse infrastructure is *just* for hotplug memory. The section_mem_map doesn't really store a pointer. It stores something that is convenient to do some math against to get a pointer. It isn't valid to just do *section_mem_map, so I don't think it should be stored as a pointer. There are a couple of things I'd like to store about a section. First of all, the fact that it is !NULL does not mean that it is present. There could be such a combination where section_mem_map *is* NULL, but the math gets you properly to a real mem_map. So, I don't think that check is safe. Since we're storing 32-bit-aligned structures, we have a few bits in the bottom of the pointer to play with. Use one bit to encode whether there's really a mem_map there, and the other one to tell whether there's a valid section there. We need to distinguish between the two because sometimes there's a gap between when a section is discovered to be present and when we can get the mem_map for it. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem memory modelAndy Whitcroft
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of mem_map[] is needed by discontiguous memory machines (like in the old CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually become a complete replacement. A significant advantage over DISCONTIGMEM is that it's completely separated from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA and DISCONTIG are often confused. Another advantage is that sparse doesn't require each NUMA node's ranges to be contiguous. It can handle overlapping ranges between nodes with no problems, where DISCONTIGMEM currently throws away that memory. Sparsemem uses an array to provide different pfn_to_page() translations for each SECTION_SIZE area of physical memory. This is what allows the mem_map[] to be chopped up. In order to do quick pfn_to_page() operations, the section number of the page is encoded in page->flags. Part of the sparsemem infrastructure enables sharing of these bits more dynamically (at compile-time) between the page_zone() and sparsemem operations. However, on 32-bit architectures, the number of bits is quite limited, and may require growing the size of the page->flags type in certain conditions. Several things might force this to occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of memory), an increase in the physical address space, or an increase in the number of used page->flags. One thing to note is that, once sparsemem is present, the NUMA node information no longer needs to be stored in the page->flags. It might provide speed increases on certain platforms and will be stored there if there is room. But, if out of room, an alternate (theoretically slower) mechanism is used. This patch introduces CONFIG_FLATMEM. It is used in almost all cases where there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM often have to compile out the same areas of code. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Martin Bligh <mbligh@aracnet.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>