summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2009-09-09Merge commit 'md/for-linus' into async-tx-nextDan Williams
Conflicts: drivers/md/raid5.c
2009-09-09Merge branch 'dmaengine' into async-tx-nextDan Williams
Conflicts: crypto/async_tx/async_xor.c drivers/dma/ioat/dma_v2.h drivers/dma/ioat/pci.c drivers/md/raid5.c
2009-09-09dmaengine: add fence supportDan Williams
Some engines optimize operation by reading ahead in the descriptor chain such that descriptor2 may start execution before descriptor1 completes. If descriptor2 depends on the result from descriptor1 then a fence is required (on descriptor2) to disable this optimization. The async_tx api could implicitly identify dependencies via the 'depend_tx' parameter, but that would constrain cases where the dependency chain only specifies a completion order rather than a data dependency. So, provide an ASYNC_TX_FENCE to explicitly identify data dependencies. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-09Merge branch 'md-raid6-accel' into ioat3.2Dan Williams
Conflicts: include/linux/dmaengine.h
2009-09-04dm snapshot: fix on disk chunk size validationMikulas Patocka
Fix some problems seen in the chunk size processing when activating a pre-existing snapshot. For a new snapshot, the chunk size can either be supplied by the creator or a default value can be used. For an existing snapshot, the chunk size in the snapshot header on disk should always be used. If someone attempts to load an existing snapshot and has the 'default chunk size' option set, the kernel uses its default value even when it is incorrect for the snapshot being loaded. This patch ensures the correct on-disk value is always used. Secondly, when the code does use the chunk size stored on the disk it is prudent to revalidate it, so the code can exit cleanly if it got corrupted as happened in https://bugzilla.redhat.com/show_bug.cgi?id=461506 . Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04dm exception store: split set_chunk_sizeMikulas Patocka
Break the function set_chunk_size to two functions in preparation for the fix in the following patch. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04dm snapshot: fix header corruption race on invalidationMikulas Patocka
If a persistent snapshot fills up, a race can corrupt the on-disk header which causes a crash on any future attempt to activate the snapshot (typically while booting). This patch fixes the race. When the snapshot overflows, __invalidate_snapshot is called, which calls snapshot store method drop_snapshot. It goes to persistent_drop_snapshot that calls write_header. write_header constructs the new header in the "area" location. Concurrently, an existing kcopyd job may finish, call copy_callback and commit_exception method, that goes to persistent_commit_exception. persistent_commit_exception doesn't do locking, relying on the fact that callbacks are single-threaded, but it can race with snapshot invalidation and overwrite the header that is just being written while the snapshot is being invalidated. The result of this race is a corrupted header being written that can lead to a crash on further reactivation (if chunk_size is zero in the corrupted header). The fix is to use separate memory areas for each. See the bug: https://bugzilla.redhat.com/show_bug.cgi?id=461506 Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04dm snapshot: refactor zero_disk_area to use chunk_ioMikulas Patocka
Refactor chunk_io to prepare for the fix in the following patch. Pass an area pointer to chunk_io and simplify zero_disk_area to use chunk_io. No functional change. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04dm log: userspace add luid to distinguish between concurrent log instancesJonathan Brassow
Device-mapper userspace logs (like the clustered log) are identified by a universally unique identifier (UUID). This identifier is used to associate requests from the kernel to a specific log in userspace. The UUID must be unique everywhere, since multiple machines may use this identifier when communicating about a particular log, as is the case for cluster logs. Sometimes, device-mapper/LVM may re-use a UUID. This is the case during pvmoves, when moving from one segment of an LV to another, or when resizing a mirror, etc. In these cases, a new log is created with the same UUID and loaded in the "inactive" slot. When a device-mapper "resume" is issued, the "live" table is deactivated and the new "inactive" table becomes "live". (The "inactive" table can also be removed via a device-mapper 'clear' command.) The above two issues were colliding. More than one log was being created with the same UUID, and there was no way to distinguish between them. So, sometimes the wrong log would be swapped out during the exchange. The solution is to create a locally unique identifier, 'luid', to go along with the UUID. This new identifier is used to determine exactly which log is being referenced by the kernel when the log exchange is made. The identifier is not universally safe, but it does not need to be, since create/destroy/suspend/resume operations are bound to a specific machine; and these are the operations that make up the exchange. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04dm raid1: do not allow log_failure variable to unset after being setJonathan Brassow
This patch fixes a bug which was triggering a case where the primary leg could not be changed on failure even when the mirror was in-sync. The case involves the failure of the primary device along with the transient failure of the log device. The problem is that bios can be put on the 'failures' list (due to log failure) before 'fail_mirror' is called due to the primary device failure. Normally, this is fine, but if the log device failure is transient, a subsequent iteration of the work thread, 'do_mirror', will reset 'log_failure'. The 'do_failures' function then resets the 'in_sync' variable when processing bios on the failures list. The 'in_sync' variable is what is used to determine if the primary device can be switched in the event of a failure. Since this has been reset, the primary device is incorrectly assumed to be not switchable. The case has been seen in the cluster mirror context, where one machine realizes the log device is dead before the other machines. As the responsibilities of the server migrate from one node to another (because the mirror is being reconfigured due to the failure), the new server may think for a moment that the log device is fine - thus resetting the 'log_failure' variable. In any case, it is inappropiate for us to reset the 'log_failure' variable. The above bug simply illustrates that it can actually hurt us. Cc: stable@kernel.org Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04dm log: remove incorrect field from userspace table outputJonathan Brassow
The output of 'dmsetup table' includes an internal field that should not be there. This patch removes it. To make the fix simpler, we first reorder a constructor argument The 'device size' argument is generated internally. Currently it is placed as the last space-separated word of the constructor string. However, we need to use a version of the string without this word, so we move it to the beginning instead so it is trivial to skip past it. We keep a copy of the arguments passed to userspace for creating a log, just in case we need to resend them. These are the same arguments that are desired in the STATUSTYPE_TABLE request, except for one. When creating the userspace log, the userspace daemon must know the size of the mirror, so that is added to the arguments given in the constructor table. We were printing this extra argument out as well, which is a mistake. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04dm log: fix userspace status outputJonathan Brassow
Fix 'dmsetup table' output. There is a missing ' ' at the end of the string causing two words to run together. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04dm stripe: expose correct io hintsMike Snitzer
Set sensible I/O hints for striped DM devices in the topology infrastructure added for 2.6.31 for userspace tools to obtain via sysfs. Add .io_hints to 'struct target_type' to allow the I/O hints portion (io_min and io_opt) of the 'struct queue_limits' to be set by each target and implement this for dm-stripe. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04dm table: add more context to terse warning messagesMike Snitzer
A couple of recent warning messages make it difficult for the reader to determine exactly what is wrong. This patch adds more information to those messages. The messages were added by these commits: 5dea271b6d87bd1d79a59c1d5baac2596a841c37 ("dm table: pass correct dev area size to device_area_is_valid") ea9df47cc92573b159ef3b4fda516c32cba9c4fd ("dm table: fix blk_stack_limits arg to use bytes not sectors") The patch also corrects references to logical_block_size in printk format strings from %hu to %u. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04dm table: fix queue_limit checking device iteratorMikulas Patocka
The logic to check for valid device areas is inverted relative to proper use with iterate_devices. The iterate_devices method calls its callback for every underlying device in the target. If any callback returns non-zero, iterate_devices exits immediately. But the callback device_area_is_valid() returns 0 on error and 1 on success. The overall effect without is that an error is issued only if every device is invalid. This patch renames device_area_is_valid to device_area_is_invalid and inverts the logic so that one invalid device is sufficient to raise an error. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04dm snapshot: implement iterate devicesMike Snitzer
Implement the .iterate_devices for the origin and snapshot targets. dm-snapshot's lack of .iterate_devices resulted in the inability to properly establish queue_limits for both targets. With 4K sector drives: an unfortunate side-effect of not establishing proper limits in either targets' DM device was that IO to the devices would fail even though both had been created without error. Commit af4874e03ed82f050d5872d8c39ce64bf16b5c38 ("dm target:s introduce iterate devices fn") in 2.6.31-rc1 should have implemented .iterate_devices for dm-snap.c's origin and snapshot targets. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04dm multipath: fix oops when request based io fails when no pathsKiyoshi Ueda
The patch posted at http://marc.info/?l=dm-devel&m=124539787228784&w=2 which was merged into cec47e3d4a861e1d942b3a580d0bbef2700d2bb2 ("dm: prepare for request based option") introduced a regression in request-based dm. If map_request() calls dm_kill_unmapped_request() to complete a cloned bio without dispatching it, clone->bio is still set when dm_end_request() is called and the BUG_ON(clone->bio) is incorrect. The patch fixes this bug by freeing bio in dm_end_request() if the clone has bio. I've redone my tests to cover all I/O paths and confirmed there's no other regression. Here is the oops I hit in request-based dm when I do I/O to a multipath device which doesn't have any active path nor queue_if_no_path setting: ------------[ cut here ]------------ kernel BUG at /root/2.6.31-rc4.rqdm/drivers/md/dm.c:828! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map CPU 1 Modules linked in: autofs4 sunrpc cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log dm_service_time dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac sg sr_mod e1000e button cdrom serio_raw rtc_cmos rtc_core rtc_lib piix lpfc scsi_transport_fc ata_piix libata megaraid_sas sd_mod scsi_mod crc_t10dif ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 7, comm: ksoftirqd/1 Not tainted 2.6.31-rc4.rqdm #1 Express5800/120Lj [N8100-1417] RIP: 0010:[<ffffffffa023629d>] [<ffffffffa023629d>] dm_softirq_done+0xbd/0x100 [dm_mod] RSP: 0018:ffff8800280a1f08 EFLAGS: 00010282 RAX: ffffffffa02544e0 RBX: ffff8802aa1111d0 RCX: ffff8802aa1111e0 RDX: ffff8802ab913e70 RSI: 0000000000000000 RDI: ffff8802ab913e70 RBP: ffff8800280a1f28 R08: ffffc90005457040 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: 00000000fffffffb R13: ffff8802ab913e88 R14: ffff8802ab9c1438 R15: 0000000000000100 FS: 0000000000000000(0000) GS:ffff88002809e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000003d54a98640 CR3: 000000029f0a1000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process ksoftirqd/1 (pid: 7, threadinfo ffff8802ae50e000, task ffff8802ae4f8040) Stack: ffff8800280a1f38 0000000000000020 ffffffff814f30a0 0000000000000004 <0> ffff8800280a1f58 ffffffff8116b245 ffff8800280a1f38 ffff8800280a1f38 <0> ffff8800280a1f58 0000000000000001 ffff8800280a1fa8 ffffffff810477bc Call Trace: <IRQ> [<ffffffff8116b245>] blk_done_softirq+0x75/0x90 [<ffffffff810477bc>] __do_softirq+0xcc/0x210 [<ffffffff81047170>] ? ksoftirqd+0x0/0x110 [<ffffffff8100ce7c>] call_softirq+0x1c/0x50 <EOI> [<ffffffff8100e785>] do_softirq+0x65/0xa0 [<ffffffff81047170>] ? ksoftirqd+0x0/0x110 [<ffffffff810471e0>] ksoftirqd+0x70/0x110 [<ffffffff81059559>] kthread+0x99/0xb0 [<ffffffff8100cd7a>] child_rip+0xa/0x20 [<ffffffff8100c73c>] ? restore_args+0x0/0x30 [<ffffffff810594c0>] ? kthread+0x0/0xb0 [<ffffffff8100cd70>] ? child_rip+0x0/0x20 Code: 44 89 e6 48 89 df e8 23 fb f2 e0 be 01 00 00 00 4c 89 f7 e8 f6 fd ff ff 5b 41 5c 41 5d 41 5e c9 c3 4c 89 ef e8 85 fe ff ff eb ed <0f> 0b eb fe 41 8b 85 dc 00 00 00 48 83 bb 10 01 00 00 00 89 83 RIP [<ffffffffa023629d>] dm_softirq_done+0xbd/0x100 [dm_mod] RSP <ffff8800280a1f08> ---[ end trace 16af0a1d8542da55 ]--- Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-08-30md/raid456: distribute raid processing over multiple coresDan Williams
Now that the resources to handle stripe_head operations are allocated percpu it is possible for raid5d to distribute stripe handling over multiple cores. This conversion also adds a call to cond_resched() in the non-multicore case to prevent one core from getting monopolized for raid operations. Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-30md/raid6: remove synchronous infrastructureYuri Tikhonov
These routines have been replaced by there asynchronous counterparts. Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-30md/raid6: asynchronous handle_stripe6Yuri Tikhonov
1/ Use STRIPE_OP_BIOFILL to offload completion of read requests to raid_run_ops 2/ Implement a handler for sh->reconstruct_state similar to the raid5 case (adds handling of Q parity) 3/ Prevent handle_parity_checks6 from running concurrently with 'compute' operations 4/ Hook up raid_run_ops Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-30md/raid6: asynchronous handle_parity_check6Dan Williams
[ Based on an original patch by Yuri Tikhonov ] Implement the state machine for handling the RAID-6 parities check and repair functionality. Note that the raid6 case does not need to check for new failures, like raid5, as it will always writeback the correct disks. The raid5 case can be updated to check zero_sum_result to avoid getting confused by new failures rather than retrying the entire check operation. Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-30md/raid6: asynchronous handle_stripe_dirtying6Yuri Tikhonov
In the synchronous implementation of stripe dirtying we processed a degraded stripe with one call to handle_stripe_dirtying6(). I.e. compute the missing blocks from the other drives, then copy in the new data and reconstruct the parities. In the asynchronous case we do not perform stripe operations directly. Instead, operations are scheduled with flags to be later serviced by raid_run_ops. So, for the degraded case the final reconstruction step can only be carried out after all blocks have been brought up to date by being read, or computed. Like the raid5 case schedule_reconstruction() sets STRIPE_OP_RECONSTRUCT to request a parity generation pass and through operation chaining can handle compute and reconstruct in a single raid_run_ops pass. [dan.j.williams@intel.com: fixup handle_stripe_dirtying6 gating] Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-30md/raid6: asynchronous handle_stripe_fill6Yuri Tikhonov
Modify handle_stripe_fill6 to work asynchronously by introducing fetch_block6 as the raid6 analog of fetch_block5 (schedule compute operations for missing/out-of-sync disks). [dan.j.williams@intel.com: compute D+Q in one pass] Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-30md/raid5,6: common schedule_reconstruction for raid5/6Yuri Tikhonov
Extend schedule_reconstruction5 for reuse by the raid6 path. Add support for generating Q and BUG() if a request is made to perform 'prexor'. Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-30md/raid6: asynchronous raid6 operationsDan Williams
[ Based on an original patch by Yuri Tikhonov ] The raid_run_ops routine uses the asynchronous offload api and the stripe_operations member of a stripe_head to carry out xor+pq+copy operations asynchronously, outside the lock. The operations performed by RAID-6 are the same as in the RAID-5 case except for no support of STRIPE_OP_PREXOR operations. All the others are supported: STRIPE_OP_BIOFILL - copy data into request buffers to satisfy a read request STRIPE_OP_COMPUTE_BLK - generate missing blocks (1 or 2) in the cache from the other blocks STRIPE_OP_BIODRAIN - copy data out of request buffers to satisfy a write request STRIPE_OP_RECONSTRUCT - recalculate parity for new data that has entered the cache STRIPE_OP_CHECK - verify that the parity is correct The flow is the same as in the RAID-5 case, and reuses some routines, namely: 1/ ops_complete_postxor (renamed to ops_complete_reconstruct) 2/ ops_complete_compute (updated to set up to 2 targets uptodate) 3/ ops_run_check (renamed to ops_run_check_p for xor parity checks) [neilb@suse.de: fixes to get it to pass mdadm regression suite] Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-30md/raid5: factor out mark_uptodate from ops_complete_compute5Dan Williams
ops_complete_compute5 can be reused in the raid6 path if it is updated to generically handle a second target. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-30async_tx: raid6 recovery self testDan Williams
Port drivers/md/raid6test/test.c to use the async raid6 recovery routines. This is meant as a unit test for raid6 acceleration drivers. In addition to the 16-drive test case this implements tests for the 4-disk and 5-disk special cases (dma devices can not generically handle less than 2 sources), and adds a test for the D+Q case. Reviewed-by: Andre Noll <maan@systemlinux.org> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-30async_tx: add sum check flagsDan Williams
Replace the flat zero_sum_result with a collection of flags to contain the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead of DMA_ since these flags will be used on non-dma-zero-sum enabled platforms. Reviewed-by: Andre Noll <maan@systemlinux.org> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-30md/raid5,6: add percpu scribble region for buffer listsDan Williams
Use percpu memory rather than stack for storing the buffer lists used in parity calculations. Include space for dma address conversions and pass that to async_tx via the async_submit_ctl.scribble pointer. [ Impact: move memory pressure from stack to heap ] Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-30md/raid6: move the spare page to a percpu allocationDan Williams
In preparation for asynchronous handling of raid6 operations move the spare page to a percpu allocation to allow multiple simultaneous synchronous raid6 recovery operations. Make this allocation cpu hotplug aware to maximize allocation efficiency. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-22[SCSI] scsi_dh: Use scsi_dh_set_params() in multipath.Chandra Seetharaman
Use scsi_dh_set_params() set parameters provided. Save the parameters in parse_hw_handler() and use it in parse_path(). Reported-by: Eddie Williams <Eddie.Williams@steeleye.com> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Tested-by: Eddie Williams <Eddie.Williams@steeleye.com> Cc: Alasdair G Kergon <agk@redhat.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de>
2009-08-18Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds
* 'for-linus' of git://neil.brown.name/md: Fix new incorrect error return from do_md_stop.
2009-08-18Fix new incorrect error return from do_md_stop.NeilBrown
Recent commit c8c00a6915a2e3d10416e8bdd3138429beb96210 changed the exit paths in do_md_stop and was not quite careful enough. There is one path were 'err' now needs to be cleared but it isn't. So setting an array to readonly (with mdadm --readonly) will work, but will incorrectly report and error: ENXIO. Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-16dm-log-userspace: fix printk format warningRandy Dunlap
drivers/md/dm-log-userspace-transfer.c:110: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'size_t' Previously posted and acked, but apparently lost. http://lkml.indiana.edu/hypermail/linux/kernel/0906.2/02074.html Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: dm-devel@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-13md: allow upper limit for resync/reshape to be set when array is read-onlyNeilBrown
Normally we only allow the upper limit for a reshape to be decreased when the array not performing a sync/recovery/reshape, otherwise there could be races. But if an array is part-way through a reshape when it is assembled the reshape is started immediately leaving no window to set an upper bound. If the array is started read-only, the reshape will be suspended until the array becomes writable, so that provides a window during which it is perfectly safe to reduce the upper limit of a reshape. So: allow the upper limit (sync_max) to be reduced even if the reshape thread is running, as long as the array is still read-only. Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13md/raid5: Properly remove excess drives after shrinking a raid5/6NeilBrown
We were removing the drives, from the array, but not removing symlinks from /sys/.... and not marking the device as having been removed. Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13md/raid5: make sure a reshape restarts at the correct address.NeilBrown
This "if" don't allow for the possibility that the number of devices doesn't change, and so sector_nr isn't set correctly in that case. So change '>' to '>='. Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13md/raid5: allow new reshape modes to be restarted in the middle.NeilBrown
md/raid5 doesn't allow a reshape to restart if it involves writing over the same part of disk that it would be reading from. This happens at the beginning of a reshape that increases the number of devices, at the end of a reshape that decreases the number of devices, and continuously for a reshape that does not change the number of devices. The current code is correct for the "increase number of devices" case as the critical section at the start is handled by userspace performing a backup. It does not work for reducing the number of devices, or the no-change case. For 'reducing', we need to invert the test. For no-change we cannot really be sure things will be safe, so simply require the array to be read-only, which is how the user-space code which carefully starts such arrays works. Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-12md: never advance 'events' counter by more than 1.NeilBrown
When assembling arrays, md allows two devices to have different event counts as long as the difference is only '1'. This is to cope with a system failure between updating the metadata on two difference devices. However there are currently times when we update the event count by 2. This was done to keep the event count even when the array is clean and odd when it is dirty, which allows us to avoid writing common update to spare devices and so allow those spares to go to sleep. This is bad for the above reason. So change it to never increase by two. This means that the alignment between 'odd/even' and 'clean/dirty' might take a little longer to attain, but that is only a small cost. The spares will get a few more updates but that will still be spared (;-) most updates and can still go to sleep. Prior to this patch there was a small chance that after a crash an array would fail to assemble due to the overly large event count mismatch. Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-10Remove deadlock potential in md_openNeilBrown
A recent commit: commit 449aad3e25358812c43afc60918c5ad3819488e7 introduced the possibility of an A-B/B-A deadlock between bd_mutex and reconfig_mutex. __blkdev_get holds bd_mutex while calling md_open which takes reconfig_mutex, do_md_run is always called with reconfig_mutex held, and it now takes bd_mutex in the call the revalidate_disk. This potential deadlock was not caught by lockdep due to the use of mutex_lock_interruptible_nexted which was introduced by commit d63a5a74dee87883fda6b7d170244acaac5b05e8 do avoid a warning of an impossible deadlock. It is quite possible to split reconfig_mutex in to two locks. One protects the array data structures while it is being reconfigured, the other ensures that an array is never even partially open while it is being deactivated. In particular, the second lock prevents an open from completing between the time when do_md_stop checks if there are any active opens, and the time when the array is either set read-only, or when ->pers is set to NULL. So we can be certain that no IO is in flight as the array is being destroyed. So create a new lock, open_mutex, just to ensure exclusion between 'open' and 'stop'. This avoids the deadlock and also avoids the lockdep warning mentioned in commit d63a5a74d Reported-by: "Mike Snitzer" <snitzer@gmail.com> Reported-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03md: Use revalidate_disk to effect changes in size of device.NeilBrown
As revalidate_disk calls check_disk_size_change, it will cause any capacity change of a gendisk to be propagated to the blockdev inode. So use that instead of mucking about with locks and i_size_write. Also add a call to revalidate_disk in do_md_run and a few other places where the gendisk capacity is changed. Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03md: allow raid5_quiesce to work properly when reshape is happening.NeilBrown
The ->quiesce method is not supposed to stop resync/recovery/reshape, just normal IO. But in raid5 we don't have a way to know which stripes are being used for normal IO and which for resync etc, so we need to wait for all stripes to be idle to be sure that all writes have completed. However reshape keeps at least some stripe busy for an extended period of time, so a call to raid5_quiesce can block for several seconds needlessly. So arrange for reshape etc to pause briefly while raid5_quiesce is trying to quiesce the array so that the active_stripes count can drop to zero. Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03md/raid5: set reshape_position correctly when reshape starts.NeilBrown
As the internal reshape_progress counter is the main driver for reshape, the fact that reshape_position sometimes starts with the wrong value has minimal effect. It is visible in sysfs and that is all. Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03md: Handle growth of v1.x metadata correctly.NeilBrown
The v1.x metadata does not have a fixed size and can grow when devices are added. If it grows enough to require an extra sector of storage, we need to update the 'sb_size' to match. Without this, md can write out an incomplete superblock with a bad checksum, which will be rejected when trying to re-assemble the array. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03md: avoid array overflow with bad v1.x metadataNeilBrown
We trust the 'desc_nr' field in v1.x metadata enough to use it as an index in an array. This isn't really safe. So range-check the value first. Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03md: when a level change reduces the number of devices, remove the excess.NeilBrown
When an array is changed from RAID6 to RAID5, fewer drives are needed. So any device that is made superfluous by the level conversion must be marked as not-active. For the RAID6->RAID5 conversion, this will be a drive which only has 'Q' blocks on it. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03md: Push down data integrity code to personalities.Andre Noll
This patch replaces md_integrity_check() by two new public functions: md_integrity_register() and md_integrity_add_rdev() which are both personality-independent. md_integrity_register() is called from the ->run and ->hot_remove methods of all personalities that support data integrity. The function iterates over the component devices of the array and determines if all active devices are integrity capable and if their profiles match. If this is the case, the common profile is registered for the mddev via blk_integrity_register(). The second new function, md_integrity_add_rdev() is called from the ->hot_add_disk methods, i.e. whenever a new device is being added to a raid array. If the new device does not support data integrity, or has a profile different from the one already registered, data integrity for the mddev is disabled. For raid0 and linear, only the call to md_integrity_register() from the ->run method is necessary. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-31md/raid6: release spare page at ->stop()Dan Williams
Add missing call to safe_put_page from stop() by unifying open coded raid5_conf_t de-allocation under free_conf(). Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-23dm table: pass correct dev area size to device_area_is_validMike Snitzer
Incorrect device area lengths are being passed to device_area_is_valid(). The regression appeared in 2.6.31-rc1 through commit 754c5fc7ebb417b23601a6222a6005cc2e7f2913. With the dm-stripe target, the size of the target (ti->len) was used instead of the stripe_width (ti->len/#stripes). An example of a consequent incorrect error message is: device-mapper: table: 254:0: sdb too small for target Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-07-23dm: remove queue next_ordered workaround for barriersMike Snitzer
This patch removes DM's bio-based vs request-based conditional setting of next_ordered. For bio-based DM the next_ordered check is no longer a concern (as that check is now in the __make_request path). For request-based DM the default of QUEUE_ORDERED_NONE is now appropriate. bio-based DM was changed to work-around the previously misplaced next_ordered check with this commit: 99360b4c18f7675b50d283301d46d755affe75fd request-based DM does not yet support barriers but reacted to the above bio-based DM change with this commit: 5d67aa2366ccb8257d103d0b43df855605c3c086 The above changes are no longer needed given Neil Brown's recent fix to put the next_ordered check in the __make_request path: db64f680ba4b5c56c4be59f0698000df89ff0281 Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: NeilBrown <neilb@suse.de> Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>