summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2011-01-13dm: use non reentrant workqueues if equivalentTejun Heo
kmirrord_wq, kcopyd_work and md->wq are created per dm instance and serve only a single work item from the dm instance, so non-reentrant workqueues would provide the same ordering guarantees as ordered ones while allowing CPU affinity and use of the workqueues for other purposes. Switch them to non-reentrant workqueues. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm: convert workqueues to alloc_orderedTejun Heo
Convert all create[_singlethread]_work() users to the new alloc[_ordered]_workqueue(). This conversion is mechanical and doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm stripe: switch from local workqueue to system_wqTejun Heo
kstriped only serves sc->kstriped_ws which runs dm_table_event(). This doesn't need to be executed from an ordered workqueue w/ rescuer. Drop kstriped and use the system_wq instead. While at it, rename kstriped_ws to trigger_event so that it's consistent with other dm modules. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm: dont use flush_scheduled_workTejun Heo
flush_scheduled_work() is being deprecated. Flush the used work directly instead. In all dm targets, the only work which uses system_wq is ->trigger_event. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm snapshot: remove unused dm_snapshot queued_bios_workTejun Heo
dm_snapshot->queued_bios_work isn't used. Remove ->queued_bios[_work] from dm_snapshot structure, the flush_queued_bios work function and ksnapd workqueue. The DM snapshot changes that were going to use the ksnapd workqueue were either superseded (fix for origin write races) or never completed (deallocation of invalid snapshot's memory via workqueue). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm ioctl: suppress needless warning messagesMilan Broz
The device-mapper should not send warning messages to syslog if a device is not found. This can be done by userspace according to the returned dm-ioctl error code. So move these messages to debug level and use rate limiting to not flood syslog. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm crypt: add loop aes iv generatorMilan Broz
This patch adds a compatible implementation of the block chaining mode used by the Loop-AES block device encryption system (http://loop-aes.sourceforge.net/) designed by Jari Ruusu. It operates on full 512 byte sectors and uses CBC with an IV derived from the sector number, the data and optionally extra IV seed. This means that after CBC decryption the first block of sector must be tweaked according to decrypted data. Loop-AES can use three encryption schemes: version 1: is plain aes-cbc mode (already compatible) version 2: uses 64 multikey scheme with own IV generator version 3: the same as version 2 with additional IV seed (it uses 65 keys, last key is used as IV seed) The IV generator is here named lmk (Loop-AES multikey) and for the cipher specification looks like: aes:64-cbc-lmk Version 2 and 3 is recognised according to length of provided multi-key string (which is just hexa encoded "raw key" used in original Loop-AES ioctl). Configuration of the device and decoding key string will be done in userspace (cryptsetup). (Loop-AES stores keys in gpg encrypted file, raw keys are output of simple hashing of lines in this file). Based on an implementation by Max Vozeler: http://article.gmane.org/gmane.linux.kernel.cryptoapi/3752/ Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> CC: Max Vozeler <max@hinterhof.net>
2011-01-13dm crypt: add multi key capabilityMilan Broz
This patch adds generic multikey handling to be used in following patch for Loop-AES mode compatibility. This patch extends mapping table to optional keycount and implements generic multi-key capability. With more keys defined the <key> string is divided into several <keycount> sections and these are used for tfms. The tfm is used according to sector offset (sector 0->tfm[0], sector 1->tfm[1], sector N->tfm[N modulo keycount]) (only power of two values supported for keycount here). Because of tfms per-cpu allocation, this mode can be take a lot of memory on large smp systems. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Max Vozeler <max@hinterhof.net>
2011-01-13dm crypt: add post iv call to iv generatorMilan Broz
IV (initialisation vector) can in principle depend not only on sector but also on plaintext data (or other attributes). Change IV generator interface to work directly with dmreq structure to allow such dependence in generator. Also add post() function which is called after the crypto operation. This allows tricky modification of decrypted data or IV internals. In asynchronous mode the post() can be called after ctx->sector count was increased so it is needed to add iv_sector copy directly to dmreq structure. (N.B. dmreq always include only one sector in scatterlists) Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm crypt: use io thread for reads only if mempool exhaustedMilan Broz
If there is enough memory, code can directly submit bio instead queing this operation in separate thread. Try to alloc bio clone with GFP_NOWAIT and only if it fails use separate queue (map function cannot block here). Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm crypt: scale to multiple cpusAndi Kleen
Currently dm-crypt does all the encryption work for a single dm-crypt mapping in a single workqueue. This does not scale well when multiple CPUs are submitting IO at a high rate. The single CPU running the single thread cannot keep up with the encryption and encrypted IO performance tanks. This patch changes the crypto workqueue to be per CPU. This means that as long as the IO submitter (or the interrupt target CPUs for reads) runs on different CPUs the encryption work will be also parallel. To avoid a bottleneck on the IO worker I also changed those to be per-CPU threads. There is still some shared data, so I suspect some bouncing cache lines. But I haven't done a detailed study on that yet. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm crypt: simplify compatible table outputMilan Broz
Rename cc->cipher_mode to cc->cipher_string and store the whole of the cipher information so it can easily be printed when processing the DM_DEV_STATUS ioctl. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm log userspace: add version number to commsJonathan Brassow
This patch adds a 'version' field to the 'dm_ulog_request' structure. The 'version' field is taken from a portion of the unused 'padding' field in the 'dm_ulog_request' structure. This was done to avoid changing the size of the structure and possibly disrupting backwards compatibility. The version number will help notify user-space daemons when a change has been made to the kernel/userspace log API. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm log userspace: group clear and mark requestsJonathan Brassow
Allow the device-mapper log's 'mark' and 'clear' requests to be grouped and processed in a batch. This can significantly reduce the amount of traffic going between the kernel and userspace (where the processing daemon resides). Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm log userspace: split flush queueJonathan Brassow
Split the 'flush_list', which contained a mix of both 'mark' and 'clear' requests, into two distinct lists ('mark_list' and 'clear_list'). The device mapper log implementations (used by various DM targets) are allowed to cache 'mark' and 'clear' requests until a 'flush' is received. Until now, these cached requests were kept in the same list. They will now be put into distinct lists to facilitate group processing of these requests (in the next patch). Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm kcopyd: delay unpluggingMikulas Patocka
Make kcopyd merge more I/O requests by using device unplugging. Without this patch, each I/O request is dispatched separately to the device. If the device supports tagged queuing, there are many small requests sent to the device. To improve performance, this patch will batch as many requests as possible, allowing the queue to merge consecutive requests, and send them to the device at once. In my tests (15k SCSI disk), this patch improves sequential write throughput: Sequential write throughput (chunksize of 4k, 32k, 512k) unpatched: 15.2, 18.5, 17.5 MB/s patched: 14.4, 22.6, 23.0 MB/s In most common uses (snapshot or two-way mirror), kcopyd is only used for two devices, one for reading and the other for writing, thus this optimization is implemented only for two devices. The optimization may be extended to n-way mirrors with some code complexity increase. We keep track of two block devices to unplug (one for read and the other for write) and unplug them when exiting "do_work" thread. If there are more devices used (in theory it could happen, in practice it is rare), we unplug immediately. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm log userspace: trap all failed log construction errorsJonathan Brassow
When constructing a mirror log, it is possible for the initial request to fail for other reasons besides -ESRCH. These must be handled too. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm crypt: set key size earlyMilan Broz
Simplify key size verification (hexadecimal string) and set key size early in constructor. (Patch required by later changes.) Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm: remove dm_mutex after bkl conversionMilan Broz
This patch replaces dm_mutex with _minor_lock in dm_blk_close() and then removes it. During the BKL conversion, commit 6e9624b8caec290d28b4c6d9ec75749df6372b87 (block: push down BKL into .open and .release) pushed lock_kernel() down into dm_blk_open/close calls. Commit 2a48fc0ab24241755dc93bfd4f01d68efab47f5a (block: autoconvert trivial BKL users to private mutex) converted it to a local mutex, but _minor_lock is sufficient. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm raid1: support discardMike Snitzer
Enable discard support in the DM mirror target. Also change an existing use of 'bvec' to 'addr' in the union. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm ioctl: allow rename to fill empty uuidPeter Jones
Allow the uuid of a mapped device to be set after device creation. Previously the uuid (which is optional) could only be set by DM_DEV_CREATE. If no uuid was supplied it could not be set later. Sometimes it's necessary to create the device before the uuid is known, and in such cases the uuid must be filled in after the creation. This patch extends DM_DEV_RENAME to accept a uuid accompanied by a new flag DM_UUID_FLAG. This can only be done once and if no uuid was previously supplied. It cannot be used to change an existing uuid. DM_VERSION_MINOR is also bumped to 19 to indicate this interface extension is available. Signed-off-by: Peter Jones <pjones@redhat.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm io: remove BIO_RW_SYNCIO flag from kcopydMikulas Patocka
Remove the REQ_SYNC flag to improve write throughput when writing to the origin with a snapshot on the same device (using the CFQ I/O scheduler). Sequential write throughput (chunksize of 4k, 32k, 512k) unpatched: 8.5, 8.6, 9.3 MB/s patched: 15.2, 18.5, 17.5 MB/s Snapshot exception reallocations are triggered by writes that are usually async, so mark the associated dm_io_request as async as well. This helps when using the CFQ I/O scheduler because it has separate queues for sync and async I/O. Async is optimized for throughput; sync for latency. With this change we're consciously favoring throughput over latency. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm mpath: disable blk_abort_queueMike Snitzer
Revert commit 224cb3e981f1b2f9f93dbd49eaef505d17d894c2 dm: Call blk_abort_queue on failed paths Multipath began to use blk_abort_queue() to allow for lower latency path deactivation. This was found to cause list corruption: the cmd gets blk_abort_queued/timedout run on it and the scsi eh somehow is able to complete and run scsi_queue_insert while scsi_request_fn is still trying to process the request. https://www.redhat.com/archives/dm-devel/2010-November/msg00085.html Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: stable@kernel.org
2011-01-13dm: dont take i_mutex to change device sizeMike Snitzer
No longer needlessly hold md->bdev->bd_inode->i_mutex when changing the size of a DM device. This additional locking is unnecessary because i_size_write() is already protected by the existing critical section in dm_swap_table(). DM already has a reference on md->bdev so the associated bd_inode may be changed without lifetime concerns. A negative side-effect of having held md->bdev->bd_inode->i_mutex was that a concurrent DM device resize and flush (via fsync) would deadlock. Dropping md->bdev->bd_inode->i_mutex eliminates this potential for deadlock. The following reproducer no longer deadlocks: https://www.redhat.com/archives/dm-devel/2009-July/msg00284.html Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org
2011-01-13Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits) block: ensure that completion error gets properly traced blktrace: add missing probe argument to block_bio_complete block cfq: don't use atomic_t for cfq_group block cfq: don't use atomic_t for cfq_queue block: trace event block fix unassigned field block: add internal hd part table references block: fix accounting bug on cross partition merges kref: add kref_test_and_get bio-integrity: mark kintegrityd_wq highpri and CPU intensive block: make kblockd_workqueue smarter Revert "sd: implement sd_check_events()" block: Clean up exit_io_context() source code. Fix compile warnings due to missing removal of a 'ret' variable fs/block: type signature of major_to_index(int) to major_to_index(unsigned) block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p) cfq-iosched: don't check cfqg in choose_service_tree() fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors cdrom: export cdrom_check_events() sd: implement sd_check_events() sr: implement sr_check_events() ...
2011-01-11md: fix regression with re-adding devices to arrays with no metadataNeilBrown
Commit 1a855a0606 (2.6.37-rc4) fixed a problem where devices were re-added when they shouldn't be but caused a regression in a less common case that means sometimes devices cannot be re-added when they should be. In particular, when re-adding a device to an array without metadata we should always access the device, but after the above commit we didn't. This patch sets the In_sync flag in that case so that the re-add succeeds. This patch is suitable for any -stable kernel to which 1a855a0606 was applied. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-07block: trace event block fix unassigned fieldJeff Moyer
The "error" field in block_bio_complete is not assigned, leaving the memory area uninitialized (keeping garbage data). Pass an additional tracepoint argument to this event to initialize this field. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> CC: Steven Rostedt <rostedt@goodmis.org> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Ingo Molnar <mingo@elte.hu> CC: Thomas Gleixner <tglx@linutronix.de> CC: Li Zefan <lizf@cn.fujitsu.com> CC: Alan.Brunelle@hp.com Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-20Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: cciss: fix cciss_revalidate panic block: max hardware sectors limit wrapper block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead blk-throttle: Correct the placement of smp_rmb() blk-throttle: Trim/adjust slice_end once a bio has been dispatched block: check for proper length of iov entries earlier in blk_rq_map_user_iov() drbd: fix for spin_lock_irqsave in endio callback drbd: don't recvmsg with zero length
2010-12-17block: max hardware sectors limit wrapperMike Snitzer
Implement blk_limits_max_hw_sectors() and make blk_queue_max_hw_sectors() a wrapper around it. DM needs this to avoid setting queue_limits' max_hw_sectors and max_sectors directly. dm_set_device_limits() now leverages blk_limits_max_hw_sectors() logic to establish the appropriate max_hw_sectors minimum (PAGE_SIZE). Fixes issue where DM was incorrectly setting max_sectors rather than max_hw_sectors (which caused dm_merge_bvec()'s max_hw_sectors check to be ineffective). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@kernel.org Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-17block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits insteadMartin K. Petersen
When stacking devices, a request_queue is not always available. This forced us to have a no_cluster flag in the queue_limits that could be used as a carrier until the request_queue had been set up for a metadevice. There were several problems with that approach. First of all it was up to the stacking device to remember to set queue flag after stacking had completed. Also, the queue flag and the queue limits had to be kept in sync at all times. We got that wrong, which could lead to us issuing commands that went beyond the max scatterlist limit set by the driver. The proper fix is to avoid having two flags for tracking the same thing. We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the block layer merging functions. The queue_limit 'no_cluster' is turned into 'cluster' to avoid double negatives and to ease stacking. Clustering defaults to being enabled as before. The queue flag logic is removed from the stacking function, and explicitly setting the cluster flag is no longer necessary in DM and MD. Reported-by: Ed Lin <ed.lin@promise.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-09md: protect against NULL reference when waiting to start a raid10.NeilBrown
When we fail to start a raid10 for some reason, we call md_unregister_thread to kill the thread that was created. Unfortunately md_thread() will then make one call into the handler (raid10d) even though md_wakeup_thread has not been called. This is not safe and as md_unregister_thread is called after mddev->private has been set to NULL, it will definitely cause a NULL dereference. So fix this at both ends: - md_thread should only call the handler if THREAD_WAKEUP has been set. - raid10 should call md_unregister_thread before setting things to NULL just like all the other raid modules do. This is applicable to 2.6.35 and later. Cc: stable@kernel.org Reported-by: "Citizen" <citizen_lee@thecus.com> Signed-off-by: NeilBrown <neilb@suse.de>
2010-12-09md: fix bug with re-adding of partially recovered device.NeilBrown
With v0.90 metadata, a hot-spare does not become a full member of the array until recovery is complete. So if we re-add such a device to the array, we know that all of it is as up-to-date as the event count would suggest, and so it a bitmap-based recovery is possible. However with v1.x metadata, the hot-spare immediately becomes a full member of the array, but it record how much of the device has been recovered. If the array is stopped and re-assembled recovery starts from this point. When such a device is hot-added to an array we currently lose the 'how much is recovered' information and incorrectly included it as a full in-sync member (after bitmap-based fixup). This is wrong and unsafe and could corrupt data. So be more careful about setting saved_raid_disk - which is what guides the re-adding of devices back into an array. The new code matches the code in slot_store which does a similar thing, which is encouraging. This is suitable for any -stable kernel. Reported-by: "Dailey, Nate" <Nate.Dailey@stratus.com> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2010-12-09md: fix possible deadlock in handling flush requests.NeilBrown
As recorded in https://bugzilla.kernel.org/show_bug.cgi?id=24012 it is possible for a flush request through md to hang. This is due to an interaction between the recursion avoidance in generic_make_request, the insistence in md of only having one flush active at a time, and the possibility of dm (or md) submitting two flush requests to a device from the one generic_make_request. If a generic_make_request call into dm causes two flush requests to be queued (as happens if the dm table has two targets - they get one each), these two will be queued inside generic_make_request. Assume they are for the same md device. The first is processed and causes 1 or more flush requests to be sent to lower devices. These get queued within generic_make_request too. Then the second flush to the md device gets handled and it blocks waiting for the first flush to complete. But it won't complete until the two lower-device requests complete, and they haven't even been submitted yet as they are on the generic_make_request queue. The deadlock can be broken by using a separate thread to submit the requests to lower devices. md has such a thread readily available: md_wq. So use it to submit these requests. Reported-by: Giacomo Catenazzi <cate@cateee.net> Tested-by: Giacomo Catenazzi <cate@cateee.net> Signed-off-by: NeilBrown <neilb@suse.de>
2010-12-09md: move code in to submit_flushes.NeilBrown
submit_flushes is called from exactly one place. Move the code that is before and after that call into submit_flushes. This has not functional change, but will make the next patch smaller and easier to follow. Signed-off-by: NeilBrown <neilb@suse.de>
2010-12-09md: remove handling of flush_pending in md_submit_flush_dataNeilBrown
None of the functions called between setting flush_pending to 1, and atomic_dec_and_test can change flush_pending, or will anything running in any other thread (as ->flush_bio is not NULL). So the atomic_dec_and_test will always succeed. So remove the atomic_sec and the atomic_dec_and_test. Signed-off-by: NeilBrown <neilb@suse.de>
2010-11-27Merge branch 'cleanup-bd_claim' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into for-2.6.38/core
2010-11-24md: Call blk_queue_flush() to establish flush/fua supportDarrick J. Wong
Before 2.6.37, the md layer had a mechanism for catching I/Os with the barrier flag set, and translating the barrier into barriers for all the underlying devices. With 2.6.37, I/O barriers have become plain old flushes, and the md code was updated to reflect this. However, one piece was left out -- the md layer does not tell the block layer that it supports flushes or FUA access at all, which results in md silently dropping flush requests. Since the support already seems there, just add this one piece of bookkeeping. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de>
2010-11-24md/raid1: really fix recovery looping when single good device fails.NeilBrown
Commit 4044ba58dd15cb01797c4fd034f39ef4a75f7cc3 supposedly fixed a problem where if a raid1 with just one good device gets a read-error during recovery, the recovery would abort and immediately restart in an infinite loop. However it depended on raid1_remove_disk removing the spare device from the array. But that does not happen in this case. So add a test so that in the 'recovery_disabled' case, the device will be removed. This suitable for any kernel since 2.6.29 which is when recovery_disabled was introduced. Cc: stable@kernel.org Reported-by: Sebastian Färber <faerber@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2010-11-24md: fix return value of rdev_size_change()Justin Maggard
When trying to grow an array by enlarging component devices, rdev_size_store() expects the return value of rdev_size_change() to be in sectors, but the actual value is returned in KBs. This functionality was broken by commit dd8ac336c13fd8afdb082ebacb1cddd5cf727889 so this patch is suitable for any kernel since 2.6.30. Cc: stable@kernel.org Signed-off-by: Justin Maggard <jmaggard10@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2010-11-16block: Rename "block_remap" tracepoint to "block_bio_remap" to clarify the ↵Mike Snitzer
event. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-13block: clean up blkdev_get() wrappers and their usersTejun Heo
After recent blkdev_get() modifications, open_by_devnum() and open_bdev_exclusive() are simple wrappers around blkdev_get(). Replace them with blkdev_get_by_dev() and blkdev_get_by_path(). blkdev_get_by_dev() is identical to open_by_devnum(). blkdev_get_by_path() is slightly different in that it doesn't automatically add %FMODE_EXCL to @mode. All users are converted. Most conversions are mechanical and don't introduce any behavior difference. There are several exceptions. * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no reason to OR it explicitly on blkdev_put(). * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in sb->s_mode. * With the above changes, sb->s_mode now always should contain FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect errors. The new blkdev_get_*() functions are with proper docbook comments. While at it, add function description to blkdev_get() too. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Neil Brown <neilb@suse.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Joern Engel <joern@lazybastard.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jan Kara <jack@suse.cz> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: reiserfs-devel@vger.kernel.org Cc: xfs-masters@oss.sgi.com Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13block: make blkdev_get/put() handle exclusive accessTejun Heo
Over time, block layer has accumulated a set of APIs dealing with bdev open, close, claim and release. * blkdev_get/put() are the primary open and close functions. * bd_claim/release() deal with exclusive open. * open/close_bdev_exclusive() are combination of open and claim and the other way around, respectively. * bd_link/unlink_disk_holder() to create and remove holder/slave symlinks. * open_by_devnum() wraps bdget() + blkdev_get(). The interface is a bit confusing and the decoupling of open and claim makes it impossible to properly guarantee exclusive access as in-kernel open + claim sequence can disturb the existing exclusive open even before the block layer knows the current open if for another exclusive access. Reorganize the interface such that, * blkdev_get() is extended to include exclusive access management. @holder argument is added and, if is @FMODE_EXCL specified, it will gain exclusive access atomically w.r.t. other exclusive accesses. * blkdev_put() is similarly extended. It now takes @mode argument and if @FMODE_EXCL is set, it releases an exclusive access. Also, when the last exclusive claim is released, the holder/slave symlinks are removed automatically. * bd_claim/release() and close_bdev_exclusive() are no longer necessary and either made static or removed. * bd_link_disk_holder() remains the same but bd_unlink_disk_holder() is no longer necessary and removed. * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev() and blkdev_get(). It also has an unexpected extra bdev_read_only() test which probably should be moved into blkdev_get(). * open_by_devnum() is modified to take @holder argument and pass it to blkdev_get(). Most of bdev open/close operations are unified into blkdev_get/put() and most exclusive accesses are tested atomically at the open time (as it should). This cleans up code and removes some, both valid and invalid, but unnecessary all the same, corner cases. open_bdev_exclusive() and open_by_devnum() can use further cleanup - rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop special features. Well, let's leave them for another day. Most conversions are straight-forward. drbd conversion is a bit more involved as there was some reordering, but the logic should stay the same. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Brown <neilb@suse.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Philipp Reisner <philipp.reisner@linbit.com> Cc: Peter Osterlund <petero2@telia.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Alex Elder <aelder@sgi.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: dm-devel@redhat.com Cc: drbd-dev@lists.linbit.com Cc: Leo Chen <leochen@broadcom.com> Cc: Scott Branden <sbranden@broadcom.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Joern Engel <joern@logfs.org> Cc: reiserfs-devel@vger.kernel.org Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13block: simplify holder symlink handlingTejun Heo
Code to manage symlinks in /sys/block/*/{holders|slaves} are overly complex with multiple holder considerations, redundant extra references to all involved kobjects, unused generic kobject holder support and unnecessary mixup with bd_claim/release functionalities. Strip it down to what's necessary (single gendisk holder) and make it use a separate interface. This is a step for cleaning up bd_claim/release. This patch makes dm-table slightly more complex but it will be simplified again with further changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Brown <neilb@suse.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com
2010-11-10block: read i_size with i_size_read()Mike Snitzer
Convert direct reads of an inode's i_size to using i_size_read(). i_size_{read,write} use a seqcount to protect reads from accessing incomple writes. Concurrent i_size_write()s require mutual exclussion to protect the seqcount that is used by i_size_{read,write}. But i_size_read() callers do not need to use additional locking. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: NeilBrown <neilb@suse.de> Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-29md: tidy up device searches in read_balance.NeilBrown
The code for searching through the device list to read-balance in raid1 is rather clumsy and hard to follow. Try to simplify it a bit. No important functionality change here. Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-29md/raid1: fix some typos in comments.NeilBrown
Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-29md/raid1: discard unused variable.NeilBrown
This structure field (flushing_bio_list) is never used, so remove it. Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-29md: unplug writes to external bitmaps.NeilBrown
When writing to an 'external' bitmap we don't currently unplug the device before waiting, so we can get a 3msec delay each time; So use REQ_UNPLUG to force and unplug. Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28md: use separate bio pool for each md device.NeilBrown
bio_clone and bio_alloc allocate from a common bio pool. If an md device is stacked with other devices that use this pool, or under something like swap which uses the pool, then the multiple calls on the pool can cause deadlocks. So allocate a local bio pool for each md array and use that rather than the common pool. This pool is used both for regular IO and metadata updates. Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28md: change type of first arg to sync_page_io.NeilBrown
Currently sync_page_io takes a 'bdev'. Every caller passes 'rdev->bdev'. We will soon want another field out of the rdev in sync_page_io, So just pass the rdev instead of the bdev out of it. Signed-off-by: NeilBrown <neilb@suse.de>