summaryrefslogtreecommitdiff
path: root/drivers/md/raid5.c
AgeCommit message (Collapse)Author
2011-08-31md/raid5: fix a hang on device failure.NeilBrown
Waiting for a 'blocked' rdev to become unblocked in the raid5d thread cannot work with internal metadata as it is the raid5d thread which will clear the blocked flag. This wasn't a problem in 3.0 and earlier as we only set the blocked flag when external metadata was used then. However we now set it always, so we need to be more careful. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28md/raid5: Clear bad blocks on successful write.NeilBrown
On a successful write to a known bad block, flag the sh so that raid5d can remove the known bad block from the list. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28md/raid5. Don't write to known bad block on doubtful devices.NeilBrown
If a device has seen write errors, don't write to any known bad blocks on that device. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28md/raid5: write errors should be recorded as bad blocks if possible.NeilBrown
When a write error is detected, don't mark the device as failed immediately but rather record the fact for handle_stripe to deal with. Handle_stripe then attempts to record a bad block. Only if that fails does the device get marked as faulty. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28md/raid5: use bad-block log to improve handling of uncorrectable read errors.NeilBrown
If we get an uncorrectable read error - record a bad block rather than failing the device. And if these errors (which may be due to known bad blocks) cause recovery to be impossible, record a bad block on the recovering devices, or abort the recovery. As we might abort a recovery without failing a device we need to teach RAID5 about recovery_disabled handling. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28md/raid5: avoid reading from known bad blocks.NeilBrown
There are two times that we might read in raid5: 1/ when a read request fits within a chunk on a single working device. In this case, if there is any bad block in the range of the read, we simply fail the cache-bypass read and perform the read though the stripe cache. 2/ when reading into the stripe cache. In this case we mark as failed any device which has a bad block in that strip (1 page wide). Note that we will both avoid reading and avoid writing. This is correct (as we will never read from the block, there is no point writing), but not optimal (as writing could 'fix' the error) - that will be addressed later. If we have not seen any write errors on the device yet, we treat a bad block like a recent read error. This will encourage an attempt to fix the read error which will either generate a write error, or will ensure good data is stored there. We don't yet forget the bad block in that case. That comes later. Now that we honour bad blocks when reading we can allow devices with bad blocks into the array. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28md: make it easier to wait for bad blocks to be acknowledged.NeilBrown
It is only safe to choose not to write to a bad block if that bad block is safely recorded in metadata - i.e. if it has been 'acknowledged'. If it hasn't we need to wait for the acknowledgement. We support that using rdev->blocked wait and md_wait_for_blocked_rdev by introducing a new device flag 'BlockedBadBlock'. This flag is only advisory. It is cleared whenever we acknowledge a bad block, so that a waiter can re-check the particular bad blocks that it is interested it. It should be set by a caller when they find they need to wait. This (set after test) is inherently racy, but as md_wait_for_blocked_rdev already has a timeout, losing the race will have minimal impact. When we clear "Blocked" was also clear "BlockedBadBlocks" incase it was set incorrectly (see above race). We also modify the way we manage 'Blocked' to fit better with the new handling of 'BlockedBadBlocks' and to make it consistent between externally managed and internally managed metadata. This requires that each raidXd loop checks if the metadata needs to be written and triggers a write (md_check_recovery) if needed. Otherwise a queued write request might cause raidXd to wait for the metadata to write, and only that thread can write it. Before writing metadata, we set FaultRecorded for all devices that are Faulty, then after writing the metadata we clear Blocked for any device for which the Fault was certainly Recorded. The 'faulty' device flag now appears in sysfs if the device is faulty *or* it has unacknowledged bad blocks. So user-space which does not understand bad blocks can continue to function correctly. User space which does, should not assume a device is faulty until it sees the 'faulty' flag, and then sees the list of unacknowledged bad blocks is empty. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28md: don't allow arrays to contain devices with bad blocks.NeilBrown
As no personality understand bad block lists yet, we must reject any device that is known to contain bad blocks. As the personalities get taught, these tests can be removed. This only applies to raid1/raid5/raid10. For linear/raid0/multipath/faulty the whole concept of bad blocks doesn't mean anything so there is no point adding the checks. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27md/raid5: Avoid BUG caused by multiple failures.NeilBrown
While preparing to write a stripe we keep the parity block or blocks locked (R5_LOCKED) - towards the end of schedule_reconstruction. If the array is discovered to have failed before this write completes we can leave those blocks LOCKED, and init_stripe will notice that a free stripe still has a locked block and will complain. So clear the R5_LOCKED flag in handle_failed_stripe, and demote the 'BUG' to a 'WARN_ON'. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27md/raid5: move rdev->corrected_errors countingNamhyung Kim
Read errors are considered to corrected if write-back and re-read cycle is finished without further problems. Thus moving the rdev-> corrected_errors counting after the re-reading looks more reasonable IMHO. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27md: introduce link/unlink_rdev() helpersNamhyung Kim
There are places where sysfs links to rdev are handled in a same way. Add the helper functions to consolidate them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27md/raid: use printk_ratelimited instead of printk_ratelimitChristian Dietrich
As per printk_ratelimit comment, it should not be used. Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de> Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27md/raid5: finalise new merged handle_stripe.NeilBrown
handle_stripe5() and handle_stripe6() are now virtually identical. So discard one and rename the other to 'analyse_stripe()'. It always returns 0, so change it to 'void' and remove the 'done' variable in handle_stripe(). Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27md/raid5: move some more common code into handle_stripeNeilBrown
The RAID6 version of this code is usable for RAID5 providing: - we test "conf->max_degraded" rather than "2" as appropriate - we make sure s->failed_num[1] is meaningful (and not '-1') when s->failed > 1 The 'return 1' must become 'goto finish' in the new location. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27md/raid5: move more common code into handle_stripeNeilBrown
Apart from 'prexor' which can only be set for RAID5, and 'qd_idx' which can only be meaningful for RAID6, these two chunks of code are nearly the same. So combine them into one adding a test to call either handle_parity_checks5 or handle_parity_checks6 as appropriate. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6NeilBrown
RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is also allow 'read-modify-write' Apart from this difference, handle_stripe_dirtying[56] are nearly identical. So resolve these differences and create just one function. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27md/raid5: unite fetch_block5 and fetch_block6NeilBrown
Provided that ->failed_num[1] is not a valid device number (which is easily achieved) fetch_block6 provides all the functionality of fetch_block5. So remove the latter and rename the former to simply "fetch_block". Then handle_stripe_fill5 and handle_stripe_fill6 become the same and can similarly be united. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27md/raid5: rearrange a test in fetch_block6.NeilBrown
Next patch will unite fetch_block5 and fetch_block6. First I want to make the differences a little more clear. For RAID6 if we are writing at all and there is a failed device, then we need to load or compute every block so we can do a reconstruct-write. This case isn't needed for RAID5 - we will do a read-modify-write in that case. So make that test a separate test in fetch_block6 rather than merged with two other tests. Make a similar change in fetch_block5 so the one bit that is not needed for RAID6 is clearly separate. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27md/raid5: move more code into common handle_stripeNeilBrown
The difference between the RAID5 and RAID6 code here is easily resolved using conf->max_degraded. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27md/raid5: Move code for finishing a reconstruction into handle_stripe.NeilBrown
Prior to commit ab69ae12ceef7 the code in handle_stripe5 and handle_stripe6 to "Finish reconstruct operations initiated by the expansion process" was identical. That commit added an identical stanza of code to each function, but in different places. That was careless. The raid5 code was correct, so move that out into handle_stripe and remove raid6 version. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27md/raid5: Remove stripe_head_state arg from handle_stripe_expansion.NeilBrown
This arg is only used to differentiate between RAID5 and RAID6 but that is not needed. For RAID5, raid5_compute_sector will set qd_idx to "~0" so j with certainly not equals qd_idx, so there is no need for a guard on that condition. So remove the guard and remove the arg from the declaration and callers of handle_stripe_expansion. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-26md/raid5: move stripe_head_state and more code into handle_stripe.NeilBrown
By defining the 'stripe_head_state' in 'handle_stripe', we can move some common code out of handle_stripe[56]() and into handle_stripe. The means that all accesses for stripe_head_state in handle_stripe[56] need to be 's->' instead of 's.', but the compiler should inline those functions and just use a direct stack reference, and future patches while hoist most of this code up into handle_stripe() so we will revert to "s.". Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-26md/raid5: add some more fields to stripe_head_stateNeilBrown
Adding these three fields will allow more common code to be moved to handle_stripe() struct field rearrangement by Namhyung Kim. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-26md/raid5: unify stripe_head_state and r6_stateNeilBrown
'struct stripe_head_state' stores state about the 'current' stripe that is passed around while handling the stripe. For RAID6 there is an extension structure: r6_state, which is also passed around. There is no value in keeping these separate, so move the fields from the latter into the former. This means that all code now needs to treat s->failed_num as an small array, but this is a small cost. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-26md/raid5: move common code into handle_stripeNeilBrown
There is common code at the start of handle_stripe5 and handle_stripe6. Move it into handle_stripe. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-26md/raid5: replace sh->lock with an 'active' flag.NeilBrown
sh->lock is now mainly used to ensure that two threads aren't running in the locked part of handle_stripe[56] at the same time. That can more neatly be achieved with an 'active' flag which we set while running handle_stripe. If we find the flag is set, we simply requeue the stripe for later by setting STRIPE_HANDLE. For safety we take ->device_lock while examining the state of the stripe and creating a summary in 'stripe_head_state / r6_state'. This possibly isn't needed but as shared fields like ->toread, ->towrite are checked it is safer for now at least. We leave the label after the old 'unlock' called "unlock" because it will disappear in a few patches, so renaming seems pointless. This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE later, but that is not a problem. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-26md/raid5: Protect some more code with ->device_lock.NeilBrown
Other places that change or follow dev->towrite and dev->written take the device_lock as well as the sh->lock. So it should really be held in these places too. Also, doing so will allow sh->lock to be discarded. with merged fixes by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-26md/raid5: Remove use of sh->lock in sync_requestNeilBrown
This is the start of a series of patches to remove sh->lock. sync_request takes sh->lock before setting STRIPE_SYNCING to ensure there is no race with testing it in handle_stripe[56]. Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early in handle_stripe[56] (after getting the same lock) and perform the same set/clear operations if it was set. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-18md/raid5: get rid of duplicated call to bio_data_dir()Namhyung Kim
In raid5::make_request(), once bio_data_dir(@bi) is detected it never (and couldn't) be changed. Use the result always. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-18md/raid5: use kmem_cache_zalloc()Namhyung Kim
Replace kmem_cache_alloc + memset(,0,) to kmem_cache_zalloc. I think it's not harmful since @conf->slab_cache already knows actual size of struct stripe_head. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-06-14md/raid5: remove unusual use of bio_iovec_idx()Namhyung Kim
In the bio_for_each_segment loop, bvl always points current bio_vec, so the same as bio_iovec_idx(, i). Let's get rid of it. Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-06-14md/raid5: fix FUA request handling in ops_run_io()Namhyung Kim
Commit e9c7469bb4f5 ("md: implment REQ_FLUSH/FUA support") introduced R5_WantFUA flag and set rw to WRITE_FUA in that case. However remaining code still checks whether rw is exactly same as WRITE or not, so FUAed-write ends up with being treated as READ. Fix it. This bug has been present since 2.6.37 and the fix is suitable for any -stable kernel since then. It is not clear why this has not caused more problems. Cc: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-06-14md/raid5: fix raid5_set_bi_hw_segmentsNamhyung Kim
The @bio->bi_phys_segments consists of active stripes count in the lower 16 bits and processed stripes count in the upper 16 bits. So logical-OR operator should be bitwise one. This bug has been present since 2.6.27 and the fix is suitable for any -stable kernel since then. Fortunately the bad code is only used on error paths and is relatively unlikely to be hit. Cc: stable@kernel.org Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-06-09MD: raid5 do not set fullsyncJonathan Brassow
Add check to determine if a device needs full resync or if partial resync will do RAID 5 was assuming that if a device was not In_sync, it must undergo a full resync. We add a check to see if 'saved_raid_disk' is the same as 'raid_disk'. If it is, we can safely skip the full resync and rely on the bitmap for partial recovery instead. This is the legitimate purpose of 'saved_raid_disk', from md.h: int saved_raid_disk; /* role that device used to have in the * array and could again if we did a partial * resync from the bitmap */ Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-05-23Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) b43: fix comment typo reqest -> request Haavard Skinnemoen has left Atmel cris: typo in mach-fs Makefile Kconfig: fix copy/paste-ism for dell-wmi-aio driver doc: timers-howto: fix a typo ("unsgined") perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course'). treewide: fix a few typos in comments regulator: change debug statement be consistent with the style of the rest Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations" audit: acquire creds selectively to reduce atomic op overhead rtlwifi: don't touch with treewide double semicolon removal treewide: cleanup continuations and remove logging message whitespace ath9k_hw: don't touch with treewide double semicolon removal include/linux/leds-regulator.h: fix syntax in example code tty: fix typo in descripton of tty_termios_encode_baud_rate xtensa: remove obsolete BKL kernel option from defconfig m68k: fix comment typo 'occcured' arch:Kconfig.locks Remove unused config option. treewide: remove extra semicolons ...
2011-05-11md: allow resync_start to be set while an array is active.NeilBrown
The sysfs attribute 'resync_start' (known internally as recovery_cp), records where a resync is up to. A value of 0 means the array is not known to be in-sync at all. A value of MaxSector means the array is believed to be fully in-sync. When the size of member devices of an array (RAID1,RAID4/5/6) is increased, the array can be increased to match. This process sets resync_start to the old end-of-device offset so that the new part of the array gets resynced. However with RAID1 (and RAID6) a resync is not technically necessary and may be undesirable. So it would be good if the implied resync after the array is resized could be avoided. So: change 'resync_start' so the value can be changed while the array is active, and as a precaution only allow it to be changed while resync/recovery is 'frozen'. Changing it once resync has started is not going to be useful anyway. This allows the array to be resized without a resync by: write 'frozen' to 'sync_action' write new size to 'component_size' (this will set resync_start) write 'none' to 'resync_start' write 'idle' to 'sync_action'. Also slightly improve some tests on recovery_cp when resizing raid1/raid5. Now that an arbitrary value could be set we should be more careful in our tests. Signed-off-by: NeilBrown <neilb@suse.de>
2011-05-11md: make error_handler functions more uniform and correct.NeilBrown
- there is no need to test_bit Faulty, as that was already done in md_error which is the only caller of these functions. - MD_CHANGE_DEVS should be set *after* faulty is set to ensure metadata is updated correctly. - spinlock should be held while updating ->degraded. Signed-off-by: NeilBrown <neilb@suse.de>
2011-05-10md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course').Jesper Juhl
There's a small typo in a comment in drivers/md/raid5.c - 'Of course' is misspelled as 'Ofcourse'. This patch fixes the spelling error. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-04-21raid5: fix build error, sector_t usageRandy Dunlap
Change <sectors> from unsigned long long to sector_t. This matches its source field. ERROR: "__udivdi3" [drivers/md/raid456.ko] undefined! Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-20md: Fix dev_sectors on takeover from raid0 to raid4/5NeilBrown
A raid0 array doesn't set 'dev_sectors' as each device might contribute a different number of sectors. So when converting to a RAID4 or RAID5 we need to set dev_sectors as they need the number. We have already verified that in fact all devices do contribute the same number of sectors, so use that number. Signed-off-by: NeilBrown <neilb@suse.de>
2011-04-20md/raid5: remove setting of ->queue_lockNeilBrown
We previously needed to set ->queue_lock to match the raid5 device_lock so we could safely use queue_flag_* operations (e.g. for plugging). which test the ->queue_lock is in fact locked. However that need has completely gone away and is unlikely to come back to remove this now-pointless setting. Signed-off-by: NeilBrown <neilb@suse.de>
2011-04-18md: incorporate new plugging into raid5.NeilBrown
In raid5 plugging is used for 2 things: 1/ collecting writes that require a bitmap update 2/ collecting writes in the hope that we can create full stripes - or at least more-full. We now release these different sets of stripes when plug_cnt is zero. Also in make_request, we call mddev_check_plug to hopefully increase plug_cnt, and wake up the thread at the end if plugging wasn't achieved for some reason. Signed-off-by: NeilBrown <neilb@suse.de>
2011-04-18md - remove old plugging code.NeilBrown
md has some plugging infrastructure for RAID5 to use because the normal plugging infrastructure required a 'request_queue', and when called from dm, RAID5 doesn't have one of those available. This relied on the ->unplug_fn callback which doesn't exist any more. So remove all of that code, both in md and raid5. Subsequent patches with restore the plugging functionality. Signed-off-by: NeilBrown <neilb@suse.de>
2011-04-18md: use new plugging interface for RAID IO.NeilBrown
md/raid submits a lot of IO from the various raid threads. So adding start/finish plug calls to those so that some plugging happens. Signed-off-by: NeilBrown <neilb@suse.de>
2011-03-10block: remove per-queue pluggingJens Axboe
Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-31md: don't abort checking spares as soon as one cannot be added.NeilBrown
As spares can be added manually before a reshape starts, we need to find them all to mark some of them as in_sync. Previously we would abort looking for spares when we found an unallocated spare what could not be added to the array (implying there was no room for new spares). However already-added spares could be later in the list, so we need to keep searching. Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-31md: fix the test for finding spares in raid5_start_reshape.NeilBrown
As spares can be added to the array before the reshape is started, we need to find and count them when checking there are enough. The array could have been degraded, so we need to check all devices, no just those out side of the range of devices in the array before the reshape. So instead of checking the index, check the In_sync flag as that reliably tells if the device is a spare or this purpose. Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-31md: simplify some 'if' conditionals in raid5_start_reshape.NeilBrown
There are two consecutive 'if' statements. if (mddev->delta_disks >= 0) .... if (mddev->delta_disks > 0) The code in the second is equally valid if delta_disks == 0, and these two statements are the only place that 'added_devices' is used. So make them a single if statement, make added_devices a local variable, and re-indent it all. No functional change. Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-13md/raid5: handle manually-added spares in start_reshape.NeilBrown
It is possible to manually add spares to specific slots before starting a reshape. raid5_start_reshape should recognised this possibility and include it in the accounting. Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-13md: Don't let implementation detail of curr_resync leak out through sysfs.NeilBrown
mddev->curr_resync has artificial values of '1' and '2' which are used by the code which ensures only one resync is happening at a time on any given device. These values are internal and should never be exposed to user-space (except when translated appropriately as in the 'pending' status in /proc/mdstat). Unfortunately they are as ->curr_resync is assigned to ->curr_resync_completed and that value is directly visible through sysfs. So change the assignments to ->curr_resync_completed to get the same valued from elsewhere in a form that doesn't have the magic '1' or '2' values. Signed-off-by: NeilBrown <neilb@suse.de>