summaryrefslogtreecommitdiff
path: root/mm/swapfile.c
AgeCommit message (Collapse)Author
2007-07-29Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATIONRafael J. Wysocki
Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATION to avoid confusion (among other things, with CONFIG_SUSPEND introduced in the next patch). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16vmscan: fix comments related to shrink_list()Anderson Briglia
Fix the shrink_list name on some files under mm/ directory. Signed-off-by: Anderson Briglia <anderson.briglia@indt.org.br> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07mm: make read_cache_page synchronousNick Piggin
Ensure pages are uptodate after returning from read_cache_page, which allows us to cut out most of the filesystem-internal PageUptodate calls. I didn't have a great look down the call chains, but this appears to fixes 7 possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in block2mtd. All depending on whether the filler is async and/or can return with a !uptodate page. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-06[PATCH] swsusp: Do not fail if resume device is not setRafael J. Wysocki
In the kernels later than 2.6.19 there is a regression that makes swsusp fail if the resume device is not explicitly specified. It can be fixed by adding an additional parameter to mm/swapfile.c:swap_type_of() allowing us to pass the (struct block_device *) corresponding to the first available swap back to the caller. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08[PATCH] mm: change uses of f_{dentry,vfsmnt} to use f_pathJosef "Jeff" Sipek
Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in linux/mm/. Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] struct seq_operations and struct file_operations constificationHelge Deller
- move some file_operations structs into the .rodata section - move static strings from policy_types[] array into the .rodata section - fix generic seq_operations usages, so that those structs may be defined as "const" as well [akpm@osdl.org: couple of fixes] Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] swsusp: use block device offsets to identify swap locationsRafael J. Wysocki
Make swsusp use block device offsets instead of swap offsets to identify swap locations and make it use the same code paths for writing as well as for reading data. This allows us to use the same code for handling swap files and swap partitions and to simplify the code, eg. by dropping rw_swap_page_sync(). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] swsusp: use partition device and offset to identify swap areasRafael J. Wysocki
The Linux kernel handles swap files almost in the same way as it handles swap partitions and there are only two differences between these two types of swap areas: (1) swap files need not be contiguous, (2) the header of a swap file is not in the first block of the partition that holds it. From the swsusp's point of view (1) is not a problem, because it is already taken care of by the swap-handling code, but (2) has to be taken into consideration. In principle the location of a swap file's header may be determined with the help of appropriate filesystem driver. Unfortunately, however, it requires the filesystem holding the swap file to be mounted, and if this filesystem is journaled, it cannot be mounted during a resume from disk. For this reason we need some other means by which swap areas can be identified. For example, to identify a swap area we can use the partition that holds the area and the offset from the beginning of this partition at which the swap header is located. The following patch allows swsusp to identify swap areas this way. It changes swap_type_of() so that it takes an additional argument representing an offset of the swap header within the partition represented by its first argument. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] reject corrupt swapfiles earlierEric Sandeen
The fsfuzzer found this; with a corrupt small swapfile that claims to have many pages: [root]# file swap.741.img swap.741.img: Linux/i386 swap file (new style) 1 (4K pages) size 1040191487 pages [root]# ls -l swap.741.img -rw-r--r-- 1 root root 16777216 Nov 22 05:18 swap.741.img sys_swapon() will try to vmalloc all those pages, and -then- check to see if the file is actually that large: if (!(p->swap_map = vmalloc(maxpages * sizeof(short)))) { <snip> if (swapfilesize && maxpages > swapfilesize) { printk(KERN_WARNING "Swap area shorter than signature indicates\n"); It seems to me that it would make more sense to move this test up before the vmalloc, with the other checks, to avoid the OOM-killer in this situation... Signed-off-by: Eric Sandeen <sandeen@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] Always print out the header line in /proc/swapsSuleiman Souhlal
It would be possible for /proc/swaps to not always print out the header: swapon /dev/hdc2 swapon /dev/hde2 swapoff /dev/hdc2 At this point /proc/swaps would not have a header. Signed-off-by: Suleiman Souhlal <suleiman@google.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29[PATCH] valid_swaphandles() fixHugh Dickins
akpm draws my attention to the fact that sysctl(VM_PAGE_CLUSTER) might conceivably change page_cluster to 0 while valid_swaphandles() is in the middle of using it, leading to an embarrassingly long loop: take a local snapshot of page_cluster and work with that. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27[PATCH] swsusp: Fix swap_type_ofRafael J. Wysocki
There is a bug in mm/swapfile.c#swap_type_of() that makes swsusp only be able to use the first active swap partition as the resume device. Fix it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Hugh Dickins <hugh@veritas.com> Acked-by: Pavel Machek <pavel@suse.cz> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30Remove obsolete #include <linux/config.h>Jörn Engel
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-23[PATCH] read_mapping_page for address spacePekka Enberg
Add read_mapping_page() which is used for callers that pass mapping->a_ops->readpage as the filler for read_cache_page. This removes some duplication from filesystem code. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] swapoff: use atomic_inc_not_zero() on mm_usersHugh Dickins
Now that we have atomic_inc_not_zero, it's more elegant for try_to_unuse to use that on mm_users: doesn't actually matter at present, but safer to be sure that once mm_users has gone to 0, nothing raises it for an instant. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] Swapless page migration: rip out swap based logicChristoph Lameter
Rip the page migration logic out. Remove all code that has to do with swapping during page migration. This also guts the ability to migrate pages to swap. No one used that so lets let it go for good. Page migration should be a bit broken after this patch. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] Swapless page migration: add R/W migration entriesChristoph Lameter
Implement read/write migration ptes We take the upper two swapfiles for the two types of migration ptes and define a series of macros in swapops.h. The VM is modified to handle the migration entries. migration entries can only be encountered when the page they are pointing to is locked. This limits the number of places one has to fix. We also check in copy_pte_range and in mprotect_pte_range() for migration ptes. We check for migration ptes in do_swap_cache and call a function that will then wait on the page lock. This allows us to effectively stop all accesses to apge. Migration entries are created by try_to_unmap if called for migration and removed by local functions in migrate.c From: Hugh Dickins <hugh@veritas.com> Several times while testing swapless page migration (I've no NUMA, just hacking it up to migrate recklessly while running load), I've hit the BUG_ON(!PageLocked(p)) in migration_entry_to_page. This comes from an orphaned migration entry, unrelated to the current correctly locked migration, but hit by remove_anon_migration_ptes as it checks an address in each vma of the anon_vma list. Such an orphan may be left behind if an earlier migration raced with fork: copy_one_pte can duplicate a migration entry from parent to child, after remove_anon_migration_ptes has checked the child vma, but before it has removed it from the parent vma. (If the process were later to fault on this orphaned entry, it would hit the same BUG from migration_entry_wait.) This could be fixed by locking anon_vma in copy_one_pte, but we'd rather not. There's no such problem with file pages, because vma_prio_tree_add adds child vma after parent vma, and the page table locking at each end is enough to serialize. Follow that example with anon_vma: add new vmas to the tail instead of the head. (There's no corresponding problem when inserting migration entries, because a missed pte will leave the page count and mapcount high, which is allowed for. And there's no corresponding problem when migrating via swap, because a leftover swap entry will be correctly faulted. But the swapless method has no refcounting of its entries.) From: Ingo Molnar <mingo@elte.hu> pte_unmap_unlock() takes the pte pointer as an argument. From: Hugh Dickins <hugh@veritas.com> Several times while testing swapless page migration, gcc has tried to exec a pointer instead of a string: smells like COW mappings are not being properly write-protected on fork. The protection in copy_one_pte looks very convincing, until at last you realize that the second arg to make_migration_entry is a boolean "write", and SWP_MIGRATION_READ is 30. Anyway, it's better done like in change_pte_range, using is_write_migration_entry and make_migration_entry_read. From: Hugh Dickins <hugh@veritas.com> Remove unnecessary obfuscation from sys_swapon's range check on swap type, which blew up causing memory corruption once swapless migration made MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> From: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] migration: remove unnecessary PageSwapCache checksChristoph Lameter
Remove two unnecessary PageSwapCache checks. The page refcount is raised and therefore page migration cannot occur in both functions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] mm: schedule find_trylock_page() removalNick Piggin
find_trylock_page() is an odd interface in that it doesn't take a reference like the others. Now that XFS no longer uses it, and its last remaining caller actually wants an elevated refcount, opencode that callsite and schedule find_trylock_page() for removal. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] swsusp: userland interfaceRafael J. Wysocki
This patch introduces a user space interface for swsusp. The interface is based on a special character device, called the snapshot device, that allows user space processes to perform suspend and resume-related operations with the help of some ioctls and the read()/write() functions.  Additionally it allows these processes to allocate free swap pages from a selected swap partition, called the resume partition, so that they know which sectors of the resume partition are available to them. The interface uses the same low-level system memory snapshot-handling functions that are used by the built-it swap-writing/reading code of swsusp. The interface documentation is included in the patch. The patch assumes that the major and minor numbers of the snapshot device will be 10 (ie. misc device) and 231, the registration of which has already been requested. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] swsusp: low level interfaceRafael J. Wysocki
Introduce the low level interface that can be used for handling the snapshot of the system memory by the in-kernel swap-writing/reading code of swsusp and the userland interface code (to be introduced shortly). Also change the way in which swsusp records the allocated swap pages and, consequently, simplifies the in-kernel swap-writing/reading code (this is necessary for the userland interface too). To this end, it introduces two helper functions in mm/swapfile.c, so that the swsusp code does not refer directly to the swap internals. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] fix swap cluster offsetAkinobu Mita
When we've allocated SWAPFILE_CLUSTER pages, ->cluster_next should be the first index of swap cluster. But current code probably sets it wrong offset. Signed-off-by: Akinobu Mita <mita@miraclelinux.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] Direct Migration V9: remove_from_swap() to remove swap ptesChristoph Lameter
Add remove_from_swap remove_from_swap() allows the restoration of the pte entries that existed before page migration occurred for anonymous pages by walking the reverse maps. This reduces swap use and establishes regular pte's without the need for page faults. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] Direct Migration V9: PageSwapCache checksChristoph Lameter
Check for PageSwapCache after looking up and locking a swap page. The page migration code may change a swap pte to point to a different page under lock_page(). If that happens then the vm must retry the lookup operation in the swap space to find the correct page number. There are a couple of locations in the VM where a lock_page() is done on a swap page. In these locations we need to check afterwards if the page was migrated. If the page was migrated then the old page that was looked up before was freed and no longer has the PageSwapCache bit set. Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-19[PATCH] sem2mutex: mm/slab.cIngo Molnar
Convert mm/swapfile.c's swapon_sem to swapon_mutex. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12[PATCH] move capable() to capability.hRandy.Dunlap
- Move capable() from sched.h to capability.h; - Use <linux/capability.h> where capable() is used (in include/, block/, ipc/, kernel/, a few drivers/, mm/, security/, & sound/; many more drivers/ to go) Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11add missing printk loglevel in mm/swapfile.cJesper Juhl
in mm/swapfile.c a printk() is missing a loglevel. I believe the proper loglevel for this situation is KERN_ERR, so that's what the patch below sets -if you agree, please apply. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-01-09[PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_semJes Sorensen
This patch converts the inode semaphore to a mutex. I have tested it on XFS and compiled as much as one can consider on an ia64. Anyway your luck with it might be different. Modified-by: Ingo Molnar <mingo@elte.hu> (finished the conversion) Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2006-01-09[PATCH] mm: clean up local variablesTobias Klauser
Clean up a local variable with the same name as a variable in a larger block. Also move a variable into the block where it's actually used. Spotted by http://linuxicc.sourceforge.net/ Signed-off-by: Tobias Klauser <tklauser@nuerscht.ch> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: add a new function (needed for swap suspend)Rafael J. Wysocki
This adds the function get_swap_page_of_type() allowing us to specify an index in swap_info[] and select a swap_info_struct structure to be used for allocating a swap page. This function (or another one of similar functionality) will be necessary for implementing the image-writing part of swsusp in the user space.  It can also be used for simplifying the current in-kernel implementation of the image-writing part of swsusp. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] mm/swapfile.c: unexport total_swap_pagesAdrian Bunk
I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] mm: split page table lockHugh Dickins
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with a many-threaded application which concurrently initializes different parts of a large anonymous area. This patch corrects that, by using a separate spinlock per page table page, to guard the page table entries in that page, instead of using the mm's single page_table_lock. (But even then, page_table_lock is still used to guard page table allocation, and anon_vma allocation.) In this implementation, the spinlock is tucked inside the struct page of the page table page: with a BUILD_BUG_ON in case it overflows - which it would in the case of 32-bit PA-RISC with spinlock debugging enabled. Splitting the lock is not quite for free: another cacheline access. Ideally, I suppose we would use split ptlock only for multi-threaded processes on multi-cpu machines; but deciding that dynamically would have its own costs. So for now enable it by config, at some number of cpus - since the Kconfig language doesn't support inequalities, let preprocessor compare that with NR_CPUS. But I don't think it's worth being user-configurable: for good testing of both split and unsplit configs, split now at 4 cpus, and perhaps change that to 8 later. There is a benefit even for singly threaded processes: kswapd can be attacking one part of the mm while another part is busy faulting. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] mm: pte_offset_map_lock loopsHugh Dickins
Convert those common loops using page_table_lock on the outside and pte_offset_map within to use just pte_offset_map_lock within instead. These all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them. But whereas pte_alloc loops tested with the "atomic" pmd_present, these loops are testing with pmd_none, which on i386 PAE tests both lower and upper halves. That's now unsafe, so add a cast into pmd_none to test only the vital lower half: we lose a little sensitivity to a corrupt middle directory, but not enough to worry about. It appears that i386 and UML were the only architectures vulnerable in this way, and pgd and pud no problem. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] mm: rss = file_rss + anon_rssHugh Dickins
I was lazy when we added anon_rss, and chose to change as few places as possible. So currently each anonymous page has to be counted twice, in rss and in anon_rss. Which won't be so good if those are atomic counts in some configurations. Change that around: keep file_rss and anon_rss separately, and add them together (with get_mm_rss macro) when the total is needed - reading two atomics is much cheaper than updating two atomics. And update anon_rss upfront, typically in memory.c, not tucked away in page_add_anon_rmap. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] mm: anon is already wrprotectedHugh Dickins
do_anonymous_page's pte_wrprotect causes some confusion: in such a case, vm_page_prot must already be forcing COW, so must omit write permission, and so the pte_wrprotect is redundant. Replace it by a comment to that effect, and reword the comment on unuse_pte which also caused confusion. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-23[PATCH] Fix bd_claim() error code.Rob Landley
Problem: In some circumstances, bd_claim() is returning the wrong error code. If we try to swapon an unused block device that isn't swap formatted, we get -EINVAL. But if that same block device is already mounted, we instead get -EBUSY, even though it still isn't a valid swap device. This issue came up on the busybox list trying to get the error message from "swapon -a" right. If a swap device is already enabled, we get -EBUSY, and we shouldn't report this as an error. But we can't distinguish the two -EBUSY conditions, which are very different errors. In the code, bd_claim() returns either 0 or -EBUSY, but in this case busy means "somebody other than sys_swapon has already claimed this", and _that_ means this block device can't be a valid swap device. So return -EINVAL there. Signed-off-by: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] mm: fix-up schedule_timeout() usageNishanth Aravamudan
Use schedule_timeout_{,un}interruptible() instead of set_current_state()/schedule_timeout() to reduce kernel size. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: swap_lock replace list+deviceHugh Dickins
The idea of a swap_device_lock per device, and a swap_list_lock over them all, is appealing; but in practice almost every holder of swap_device_lock must already hold swap_list_lock, which defeats the purpose of the split. The only exceptions have been swap_duplicate, valid_swaphandles and an untrodden path in try_to_unuse (plus a few places added in this series). valid_swaphandles doesn't show up high in profiles, but swap_duplicate does demand attention. However, with the hold time in get_swap_pages so much reduced, I've not yet found a load and set of swap device priorities to show even swap_duplicate benefitting from the split. Certainly the split is mere overhead in the common case of a single swap device. So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock (generally we seem to prefer an _ in the name, and not hide in a macro). If someone can show a regression in swap_duplicate, then probably we should add a hashlock for the swap_map entries alone (shorts being anatomic), so as to help the case of the single swap device too. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: scan_swap_map latency breaksHugh Dickins
The get_swap_page/scan_swap_map latency can be so bad that even those without preemption configured deserve relief: periodically cond_resched. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: scan_swap_map drop swap_device_lockHugh Dickins
get_swap_page has often shown up on latency traces, doing lengthy scans while holding two spinlocks. swap_list_lock is already dropped, now scan_swap_map drop swap_device_lock before scanning the swap_map. While scanning for an empty cluster, don't worry that racing tasks may allocate what was free and free what was allocated; but when allocating an entry, check it's still free after retaking the lock. Avoid dropping the lock in the expected common path. No barriers beyond the locks, just let the cookie crumble; highest_bit limit is volatile, but benign. Guard against swapoff: must check SWP_WRITEOK before allocating, must raise SWP_SCANNING reference count while in scan_swap_map, swapoff wait for that to fall - just use schedule_timeout, we don't want to burden scan_swap_map itself, and it's very unlikely that anyone can really still be in scan_swap_map once swapoff gets this far. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: scan_swap_map restyledHugh Dickins
Rewrite scan_swap_map to allocate in just the same way as before (taking the next free entry SWAPFILE_CLUSTER-1 times, then restarting at the lowest wholly empty cluster, falling back to lowest entry if none), but with a view towards dropping the lock in the next patch. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: get_swap_page drop swap_list_lockHugh Dickins
Rewrite get_swap_page to allocate in just the same sequence as before, but without holding swap_list_lock across its scan_swap_map. Decrement nr_swap_pages and update swap_list.next in advance, while still holding swap_list_lock. Skip full devices by testing highest_bit. Swapoff hold swap_device_lock as well as swap_list_lock to clear SWP_WRITEOK. Reduces lock contention when there are parallel swap devices of the same priority. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: freeing update swap_list.nextHugh Dickins
This makes negligible difference in practice: but swap_list.next should not be updated to a higher prio in the general helper swap_info_get, but rather in swap_entry_free; and then only in the case when entry is actually freed. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: swap unsigned int consistencyHugh Dickins
The swap header's unsigned int last_page determines the range of swap pages, but swap_info has been using int or unsigned long in some cases: use unsigned int throughout (except, in several places a local unsigned long is useful to avoid overflows when adding). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: show span of swap extentsHugh Dickins
The "Adding %dk swap" message shows the number of swap extents, as a guide to how fragmented the swapfile may be. But a useful further guide is what total extent they span across (sometimes scarily large). And there's no need to keep nr_extents in swap_info: it's unused after the initial message, so save a little space by keeping it on stack. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: swap extent list is orderedHugh Dickins
There are several comments that swap's extent_list.prev points to the lowest extent: that's not so, it's extent_list.next which points to it, as you'd expect. And a couple of loops in add_swap_extent which go all the way through the list, when they should just add to the other end. Fix those up, and let map_swap_page search the list forwards: profiles shows it to be twice as quick that way - because prefetch works better on how the structs are typically kmalloc'ed? or because usually more is written to than read from swap, and swap is allocated ascendingly? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: move destroy_swap_extents callsHugh Dickins
sys_swapon's call to destroy_swap_extents on failure is made after the final swap_list_unlock, which is faintly unsafe: another sys_swapon might already be setting up that swap_info_struct. Calling it earlier, before taking swap_list_lock, is safe. sys_swapoff's call to destroy_swap_extents was safe, but likewise move it earlier, before taking the locks (once try_to_unuse has completed, nothing can be needing the swap extents). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: correct swapfile nr_good_pagesHugh Dickins
If a regular swapfile lies on a filesystem whose blocksize is less than PAGE_SIZE, then setup_swap_extents may have to cut the number of usable swap pages; but sys_swapon's nr_good_pages was not expecting that. Also, setup_swap_extents takes no account of badpages listed in the swap header: not worth doing so, but ensure nr_badpages is 0 for a regular swapfile. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: update swapfile i_sem commentHugh Dickins
Update swap extents comment: nowadays we guard with S_SWAPFILE not i_sem. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-22[PATCH] can_share_swap_page: use page_mapcountHugh Dickins
Remember that ironic get_user_pages race? when the raised page_count on a page swapped out led do_wp_page to decide that it had to copy on write, so substituted a different page into userspace. 2.6.7 onwards have Andrea's solution, where try_to_unmap_one backs out if it finds page_count raised. Which works, but is unsatisfying (rmap.c has no other page_count heuristics), and was found a few months ago to hang an intensive page migration test. A year ago I was hesitant to engage page_mapcount, now it seems the right fix. So remove the page_count hack from try_to_unmap_one; and use activate_page in unuse_mm when dropping lock, to replace its secondary effect of helping swapoff to make progress in that case. Simplify can_share_swap_page (now called only on anonymous pages) to check page_mapcount + page_swapcount == 1: still needs the page lock to stabilize their (pessimistic) sum, but does not need swapper_space.tree_lock for that. In do_swap_page, move swap_free and unlock_page below page_add_anon_rmap, to keep sum on the high side, and correct when can_share_swap_page called. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>