summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2009-01-06mm: remove cgroup_mm_owner_callbacksHugh Dickins
cgroup_mm_owner_callbacks() was brought in to support the memrlimit controller, but sneaked into mainline ahead of it. That controller has now been shelved, and the mm_owner_changed() args were inadequate for it anyway (they needed an mm pointer instead of a task pointer). Remove the dead code, and restore mm_update_next_owner() locking to how it was before: taking mmap_sem there does nothing for memcontrol.c, now the only user of mm->owner. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06do_mpage_readpage(): remove useless clear_buffer_mapped() callFranck Bui-Huu
It is known that buffer_mapped() is false in this code path. Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm/apply_to_range: call pte function with lazy updatesJeremy Fitzhardinge
Make the pte-level function in apply_to_range be called in lazy mmu mode, so that any pagetable modifications can be batched. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: vmalloc make lazy unmapping configurableNick Piggin
Lazy unmapping in the vmalloc code has now opened the possibility for use after free bugs to go undetected. We can catch those by forcing an unmap and flush (which is going to be slow, but that's what happens). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: vmalloc use mutex for purgeNick Piggin
The vmalloc purge lock can be a mutex so we can sleep while a purge is going on (purge involves a global kernel TLB invalidate, so it can take quite a while). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: vmalloc improve vmallocinfoGlauber Costa
If we do that, output of files like /proc/vmallocinfo will show things like "vmalloc_32", "vmalloc_user", or whomever the caller was as the caller. This info is not as useful as the real caller of the allocation. So, proposal is to call __vmalloc_node node directly, with matching parameters to save the caller information Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: vmalloc tweak failure printkGlauber Costa
If we can't service a vmalloc allocation, show size of the allocation that actually failed. Useful for debugging. Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: more likely reclaim MADV_SEQUENTIAL mappingsJohannes Weiner
File pages mapped only in sequentially read mappings are perfect reclaim canditates. This patch makes these mappings behave like weak references, their pages will be reclaimed unless they have a strong reference from a normal mapping as well. It changes the reclaim and the unmap path where they check if the page has been referenced. In both cases, accesses through sequentially read mappings will be ignored. Benchmark results from KOSAKI Motohiro: http://marc.info/?l=linux-mm&m=122485301925098&w=2 Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: cleanup: remove #ifdef CONFIG_MIGRATIONKOSAKI Motohiro
#ifdef in *.c file decrease source readability a bit. removing is better. This patch doesn't have any functional change. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: get rid of pagevec_release_nonlru()KOSAKI Motohiro
speculative page references patch (commit: e286781d5f2e9c846e012a39653a166e9d31777d) removed last pagevec_release_nonlru() caller. So this function can be removed now. This patch doesn't have any functional change. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: print out memmap number only if it is not zeroYinghai Lu
Don't print the size of the zone's memmap array if it does not have one. Impact: cleanup Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: show node to memory section relationship with symlinks in sysfsGary Hade
Show node to memory section relationship with symlinks in sysfs Add /sys/devices/system/node/nodeX/memoryY symlinks for all the memory sections located on nodeX. For example: /sys/devices/system/node/node1/memory135 -> ../../memory/memory135 indicates that memory section 135 resides on node1. Also revises documentation to cover this change as well as updating Documentation/ABI/testing/sysfs-devices-memory to include descriptions of memory hotremove files 'phys_device', 'phys_index', and 'state' that were previously not described there. In addition to it always being a good policy to provide users with the maximum possible amount of physical location information for resources that can be hot-added and/or hot-removed, the following are some (but likely not all) of the user benefits provided by this change. Immediate: - Provides information needed to determine the specific node on which a defective DIMM is located. This will reduce system downtime when the node or defective DIMM is swapped out. - Prevents unintended onlining of a memory section that was previously offlined due to a defective DIMM. This could happen during node hot-add when the user or node hot-add assist script onlines _all_ offlined sections due to user or script inability to identify the specific memory sections located on the hot-added node. The consequences of reintroducing the defective memory could be ugly. - Provides information needed to vary the amount and distribution of memory on specific nodes for testing or debugging purposes. Future: - Will provide information needed to identify the memory sections that need to be offlined prior to physical removal of a specific node. Symlink creation during boot was tested on 2-node x86_64, 2-node ppc64, and 2-node ia64 systems. Symlink creation during physical memory hot-add tested on a 2-node x86_64 system. Signed-off-by: Gary Hade <garyhade@us.ibm.com> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: do_sync_mapping_range integrity fixNick Piggin
Chris Mason notices do_sync_mapping_range didn't actually ask for data integrity writeout. Unfortunately, it is advertised as being usable for data integrity operations. This is a data integrity bug. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: write_cache_pages more terminate quicklyAndrew Morton
Now that we have the early-termination logic in place, it makes sense to bail out early in all other cases where done is set to 1. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: write_cache_pages terminate quicklyNick Piggin
Terminate the write_cache_pages loop upon encountering the first page past end, without locking the page. Pages cannot have their index change when we have a reference on them (truncate, eg truncate_inode_pages_range performs the same check without the page lock). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: write_cache_pages optimise page cleaningNick Piggin
In write_cache_pages, if we get stuck behind another process that is cleaning pages, we will be forced to wait for them to finish, then perform our own writeout (if it was redirtied during the long wait), then wait for that. If a page under writeout is still clean, we can skip waiting for it (if we're part of a data integrity sync, we'll be waiting for all writeout pages afterwards, so we'll still be waiting for the other guy's write that's cleaned the page). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: write_cache_pages cleanupsNick Piggin
Get rid of some complex expressions from flow control statements, add a comment, remove some duplicate code. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: write_cache_pages integrity fixNick Piggin
In write_cache_pages, nr_to_write is heeded even for data-integrity syncs, so the function will return success after writing out nr_to_write pages, even if that was not sufficient to guarantee data integrity. The callers tend to set it to values that could break data interity semantics easily in practice. For example, nr_to_write can be set to mapping->nr_pages * 2, however if a file has a single, dirty page, then fsync is called, subsequent pages might be concurrently added and dirtied, then write_cache_pages might writeout two of these newly dirty pages, while not writing out the old page that should have been written out. Fix this by ignoring nr_to_write if it is a data integrity sync. This is a data integrity bug. The reason this has been done in the past is to avoid stalling sync operations behind page dirtiers. "If a file has one dirty page at offset 1000000000000000 then someone does an fsync() and someone else gets in first and starts madly writing pages at offset 0, we want to write that page at 1000000000000000. Somehow." What we do today is return success after an arbitrary amount of pages are written, whether or not we have provided the data-integrity semantics that the caller has asked for. Even this doesn't actually fix all stall cases completely: in the above situation, if the file has a huge number of pages in pagecache (but not dirty), then mapping->nrpages is going to be huge, even if pages are being dirtied. This change does indeed make the possibility of long stalls lager, and that's not a good thing, but lying about data integrity is even worse. We have to either perform the sync, or return -ELINUXISLAME so at least the caller knows what has happened. There are subsequent competing approaches in the works to solve the stall problems properly, without compromising data integrity. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: write_cache_pages writepage error fixNick Piggin
In write_cache_pages, if ret signals a real error, but we still have some pages left in the pagevec, done would be set to 1, but the remaining pages would continue to be processed and ret will be overwritten in the process. It could easily be overwritten with success, and thus success will be returned even if there is an error. Thus the caller is told all writes succeeded, wheras in reality some did not. Fix this by bailing immediately if there is an error, and retaining the first error code. This is a data integrity bug. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: write_cache_pages early loop terminationNick Piggin
We'd like to break out of the loop early in many situations, however the existing code has been setting mapping->writeback_index past the final page in the pagevec lookup for cyclic writeback. This is a problem if we don't process all pages up to the final page. Currently the code mostly keeps writeback_index reasonable and hacked around this by not breaking out of the loop or writing pages outside the range in these cases. Keep track of a real "done index" that enables us to terminate the loop in a much more flexible manner. Needed by the subsequent patch to preserve writepage errors, and then further patches to break out of the loop early for other reasons. However there are no functional changes with this patch alone. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: write_cache_pages cyclic fixNick Piggin
In write_cache_pages, scanned == 1 is supposed to mean that cyclic writeback has circled through zero, thus we should not circle again. However it gets set to 1 after the first successful pagevec lookup. This leads to cases where not enough data gets written. Counterexample: file with first 10 pages dirty, writeback_index == 5, nr_to_write == 10. Then the 5 last pages will be found, and scanned will be set to 1, after writing those out, we will not cycle back to get the first 5. Rework this logic, now we'll always cycle unless we started off from index 0. When cycling, only write out as far as 1 page before the start page from the first cycle (so we don't write parts of the file twice). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06do_mpage_readpage(): don't submit lots of small bios on boundaryMiquel van Smoorenburg
While tracing I/O patterns with blktrace (a great tool) a few weeks ago I identified a minor issue in fs/mpage.c As the comment above mpage_readpages() says, a fs's get_block function will set BH_Boundary when it maps a block just before a block for which extra I/O is required. Since get_block() can map a range of pages, for all these pages the BH_Boundary flag will be set. But we only need to push what I/O we have accumulated at the last block of this range. This makes do_mpage_readpage() send out the largest possible bio instead of a bunch of page-sized ones in the BH_Boundary case. Signed-off-by: Miquel van Smoorenburg <mikevs@xs4all.net> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06oom: print triggering task's cpuset and mems allowedDavid Rientjes
When cpusets are enabled, it's necessary to print the triggering task's set of allowable nodes so the subsequently printed meminfo can be interpreted correctly. We also print the task's cpuset name for informational purposes. [rientjes@google.com: task lock current before dereferencing cpuset] Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06oom: fix zone_scan_mutex nameDavid Rientjes
zone_scan_mutex is actually a spinlock, so name it appropriately. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: invoke oom-killer from page faultNick Piggin
Rather than have the pagefault handler kill a process directly if it gets a VM_FAULT_OOM, have it call into the OOM killer. With increasingly sophisticated oom behaviour (cpusets, memory cgroups, oom killing throttling, oom priority adjustment or selective disabling, panic on oom, etc), it's silly to unconditionally kill the faulting process at page fault time. Create a hook for pagefault oom path to call into instead. Only converted x86 and uml so far. [akpm@linux-foundation.org: make __out_of_memory() static] [akpm@linux-foundation.org: fix comment] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jeff Dike <jdike@addtoit.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: move_pages: no need to set pp->page to ZERO_PAGE(0) by defaultBrice Goglin
pp->page is never used when not set to the right page, so there is no need to set it to ZERO_PAGE(0) by default. Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: rework do_pages_move() to work on page_sized chunksBrice Goglin
Rework do_pages_move() to work by page-sized chunks of struct page_to_node that are passed to do_move_page_to_node_array(). We now only have to allocate a single page instead a possibly very large vmalloc area to store all page_to_node entries. As a result, new_page_node() will now have a very small lookup, hidding much of the overall sys_move_pages() overhead. Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> Signed-off-by: Nathalie Furmento <Nathalie.Furmento@labri.fr> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: don't mark_page_accessed in shmem_faultHugh Dickins
Following "mm: don't mark_page_accessed in fault path", which now places a mark_page_accessed() in zap_pte_range(), we should remove the mark_page_accessed() from shmem_fault(). Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Johannes Weiner <hannes@saeurebad.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: don't mark_page_accessed in fault pathNick Piggin
Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at unmap-time if the pte is young has a number of problems. mark_page_accessed is supposed to be roughly the equivalent of a young pte for unmapped references. Unfortunately it doesn't come with any context: after being called, reclaim doesn't know who or why the page was touched. So calling mark_page_accessed not only adds extra lru or PG_referenced manipulations for pages that are already going to have pte_young ptes anyway, but it also adds these references which are difficult to work with from the context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not wish to contribute to the page being referenced). Then, simply doing SetPageReferenced when zapping a pte and finding it is young, is not a really good solution either. SetPageReferenced does not correctly promote the page to the active list for example. So after removing mark_page_accessed from the fault path, several mmap()+touch+munmap() would have a very different result from several read(2) calls for example, which is not really desirable. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Johannes Weiner <hannes@saeurebad.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: report the MMU pagesize in /proc/pid/smapsMel Gorman
The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the kernel to back a VMA. This matches the size used by the MMU in the majority of cases. However, one counter-example occurs on PPC64 kernels whereby a kernel using 64K as a base pagesize may still use 4K pages for the MMU on older processor. To distinguish, this patch reports MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: report the pagesize backing a VMA in /proc/pid/smapsMel Gorman
It is useful to verify a hugepage-aware application is using the expected pagesizes for its memory regions. This patch creates an entry called KernelPageSize in /proc/pid/smaps that is the size of page used by the kernel to back a VMA. The entry is not called PageSize as it is possible the MMU uses a different size. This extension should not break any sensible parser that skips lines containing unrecognised information. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dmLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: dm snapshot: extend exception store functions dm snapshot: split out exception store implementations dm snapshot: rename struct exception_store dm snapshot: separate out exception store interface dm mpath: move trigger_event to system workqueue dm: add name and uuid to sysfs dm table: rework reference counting dm: support barriers on simple devices dm request: extend target interface dm request: add caches dm ioctl: allow dm_copy_name_and_uuid to return only one field dm log: ensure log bitmap fits on log device dm log: move region_size validation dm log: avoid reinitialising io_req on every operation dm: consolidate target deregistration error handling dm raid1: fix error count dm log: fix dm_io_client leak on error paths dm snapshot: change yield to msleep dm table: drop reference at unbind
2009-01-06dm snapshot: extend exception store functionsJonathan Brassow
Supply dm_add_exception as a callback to the read_metadata function. Add a status function ready for a later patch and name the functions consistently. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm snapshot: split out exception store implementationsAlasdair G Kergon
Move the existing snapshot exception store implementations out into separate files. Later patches will place these behind a new interface in preparation for alternative implementations. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm snapshot: rename struct exception_storeJonathan Brassow
Rename struct exception_store to dm_exception_store. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm snapshot: separate out exception store interfaceJonathan Brassow
Pull structures that bridge the gap between snapshot and exception store out of dm-snap.h and put them in a new .h file - dm-exception-store.h. This file will define the API for new exception stores. Ultimately, dm-snap.h is unnecessary, since only dm-snap.c should be using it. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm mpath: move trigger_event to system workqueueAlasdair G Kergon
The same workqueue is used both for sending uevents and processing queued I/O. Deadlock has been reported in RHEL5 when sending a uevent was blocked waiting for the queued I/O to be processed. Use scheduled_work() for the asynchronous uevents instead. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm: add name and uuid to sysfsMilan Broz
Implement simple read-only sysfs entry for device-mapper block device. This patch adds a simple sysfs directory named "dm" under block device properties and implements - name attribute (string containing mapped device name) - uuid attribute (string containing UUID, or empty string if not set) The kobject is embedded in mapped_device struct, so no additional memory allocation is needed for initializing sysfs entry. During the processing of sysfs attribute we need to lock mapped device which is done by a new function dm_get_from_kobj, which returns the md associated with kobject and increases the usage count. Each 'show attribute' function is responsible for its own locking. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm table: rework reference countingMikulas Patocka
Rework table reference counting. The existing code uses a reference counter. When the last reference is dropped and the counter reaches zero, the table destructor is called. Table reference counters are acquired/released from upcalls from other kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all). If the reference counter reaches zero in one of the upcalls, the table destructor is called from almost random kernel code. This leads to various problems: * dm_any_congested being called under a spinlock, which calls the destructor, which calls some sleeping function. * the destructor attempting to take a lock that is already taken by the same process. * stale reference from some other kernel code keeps the table constructed, which keeps some devices open, even after successful return from "dmsetup remove". This can confuse lvm and prevent closing of underlying devices or reusing device minor numbers. The patch changes reference counting so that the table destructor can be called only at predetermined places. The table has always exactly one reference from either mapped_device->map or hash_cell->new_map. After this patch, this reference is not counted in table->holders. A pair of dm_create_table/dm_destroy_table functions is used for table creation/destruction. Temporary references from the other code increase table->holders. A pair of dm_table_get/dm_table_put functions is used to manipulate it. When the table is about to be destroyed, we wait for table->holders to reach 0. Then, we call the table destructor. We use active waiting with msleep(1), because the situation happens rarely (to one user in 5 years) and removing the device isn't performance-critical task: the user doesn't care if it takes one tick more or not. This way, the destructor is called only at specific points (dm_table_destroy function) and the above problems associated with lazy destruction can't happen. Finally remove the temporary protection added to dm_any_congested(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm: support barriers on simple devicesAndi Kleen
Implement barrier support for single device DM devices This patch implements barrier support in DM for the common case of dm linear just remapping a single underlying device. In this case we can safely pass the barrier through because there can be no reordering between devices. NB. Any DM device might cease to support barriers if it gets reconfigured so code must continue to allow for a possible -EOPNOTSUPP on every barrier bio submitted. - agk Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm request: extend target interfaceKiyoshi Ueda
This patch adds the following target interfaces for request-based dm. map_rq : for mapping a request rq_end_io : for finishing a request busy : for avoiding performance regression from bio-based dm. Target can tell dm core not to map requests now, and that may help requests in the block layer queue to be bigger by I/O merging. In bio-based dm, this behavior is done by device drivers managing the block layer queue. But in request-based dm, dm core has to do that since dm core manages the block layer queue. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm request: add cachesKiyoshi Ueda
This patch prepares some kmem_caches for request-based dm. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm ioctl: allow dm_copy_name_and_uuid to return only one fieldMilan Broz
Allow NULL buffer in dm_copy_name_and_uuid if you only want to return one of the fields. (Required by a following patch that adds these fields to sysfs.) Signed-off-by: Milan Broz <mbroz@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm log: ensure log bitmap fits on log deviceMilan Broz
Check that the log bitmap will fit within the log device. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm log: move region_size validationMilan Broz
Move log size validation from mirror target to log constructor. Removed PAGE_SIZE restriction we no longer think necessary. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm log: avoid reinitialising io_req on every operationTakahiro Yasui
rw_header function updates three members of io_req data every time when I/O is processed. bi_rw and notify.fn are never modified once they get initialized, and so they can be set in advance. header_to_disk() can also be pulled out of write_header() since only one caller needs it and write_header() can be replaced by rw_header() directly. Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm: consolidate target deregistration error handlingMikulas Patocka
Change dm_unregister_target to return void and use BUG() for error reporting. dm_unregister_target can only fail because of programming bug in the target driver. It can't fail because of user's behavior or disk errors. This patch changes unregister_target to return void and use BUG if someone tries to unregister non-registered target or unregister target that is in use. This patch removes code duplication (testing of error codes in all dm targets) and reports bugs in just one place, in dm_unregister_target. In some target drivers, these return codes were ignored, which could lead to a situation where bugs could be missed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm raid1: fix error countJonathan Brassow
Always increase the error count when I/O on a leg of a mirror fails. The error count is used to decide whether to select an alternative mirror leg. If the target doesn't use the "handle_errors" feature, the error count is not updated and the bio can get requeued forever by the read callback. Fix it by increasing error_count before the handle_errors feature checking. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm log: fix dm_io_client leak on error pathsTakahiro Yasui
In create_log_context function, dm_io_client_destroy function needs to be called, when memory allocation of disk_header, sync_bits and recovering_bits failed, but dm_io_client_destroy is not called. Cc: stable@kernel.org Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Acked-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm snapshot: change yield to msleepMikulas Patocka
Change yield() to msleep(1). If the thread had realtime priority, yield() doesn't really yield, so the yielding process would loop indefinitely and cause machine lockup. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>