summaryrefslogtreecommitdiff
path: root/drivers/md/dm-cache-target.c
AgeCommit message (Collapse)Author
2015-07-29dm cache: fix device destroy hang due to improper prealloc_used accountingMike Snitzer
Commit 665022d72f9 ("dm cache: avoid calls to prealloc_free_structs() if possible") introduced a regression that caused the removal of a DM cache device to hang in cache_postsuspend()'s call to wait_for_migrations() with the following stack trace: [<ffffffff81651457>] schedule+0x37/0x80 [<ffffffffa041e21b>] cache_postsuspend+0xbb/0x470 [dm_cache] [<ffffffff810ba970>] ? prepare_to_wait_event+0xf0/0xf0 [<ffffffffa0006f77>] dm_table_postsuspend_targets+0x47/0x60 [dm_mod] [<ffffffffa0001eb5>] __dm_destroy+0x215/0x250 [dm_mod] [<ffffffffa0004113>] dm_destroy+0x13/0x20 [dm_mod] [<ffffffffa00098cd>] dev_remove+0x10d/0x170 [dm_mod] [<ffffffffa00097c0>] ? dev_suspend+0x240/0x240 [dm_mod] [<ffffffffa0009f85>] ctl_ioctl+0x255/0x4d0 [dm_mod] [<ffffffff8127ac00>] ? SYSC_semtimedop+0x280/0xe10 [<ffffffffa000a213>] dm_ctl_ioctl+0x13/0x20 [dm_mod] [<ffffffff811fd432>] do_vfs_ioctl+0x2d2/0x4b0 [<ffffffff81117d5f>] ? __audit_syscall_entry+0xaf/0x100 [<ffffffff81022636>] ? do_audit_syscall_entry+0x66/0x70 [<ffffffff811fd689>] SyS_ioctl+0x79/0x90 [<ffffffff81023e58>] ? syscall_trace_leave+0xb8/0x110 [<ffffffff81654f6e>] entry_SYSCALL_64_fastpath+0x12/0x71 Fix this by accounting for the call to prealloc_data_structs() immediately _before_ the call as opposed to after. This is needed because it is possible to break out of the control loop after the call to prealloc_data_structs() but before prealloc_used was set to true. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-29Revert "dm cache: do not wake_worker() in free_migration()"Mike Snitzer
This reverts commit 386cb7cdeeef97e0bf082a8d6bbfc07a2ccce07b. Taking the wake_worker() out of free_migration() will slow writeback dramatically, and hence adaptability. Say we have 10k blocks that need writing back, but are only able to issue 5 concurrently due to the migration bandwidth: it's imperative that we wake_worker() immediately after migration completion; waiting for the next 1 second wake up (via do_waker) means it'll take a long time to write that all back. Reported-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-17dm cache: avoid calls to prealloc_free_structs() if possibleMike Snitzer
If no work was performed then prealloc_data_structs() wasn't ever called so there isn't any need to call prealloc_free_structs(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-17dm cache: avoid preallocation if no work in writeback_some_dirty_blocks()Mike Snitzer
Refactor writeback_some_dirty_blocks() to avoid prealloc_data_structs() if the policy doesn't have any dirty blocks ready for writeback. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-17dm cache: do not wake_worker() in free_migration()Mike Snitzer
All methods that queue work call wake_worker() as you'd expect. E.g. cell_defer, defer_bio, quiesce_migration (which is called by writeback, promote, demote_then_promote, invalidate, discard, etc). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16dm cache: display 'needs_check' in status if it is setMike Snitzer
There is currently no way to see that the needs_check flag has been set in the metadata. Display 'needs_check' in the cache status if it is set in the cache metadata. Also, update cache documentation. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11dm cache: age and write back cache entries even without active IOJoe Thornber
The policy tick() method is normally called from interrupt context. Both the mq and smq policies do some bottom half work for the tick method in their map functions. However if no IO is going through the cache, then that bottom half work doesn't occur. With these policies this means recently hit entries do not age and do not get written back as early as we'd like. Fix this by introducing a new 'can_block' parameter to the tick() method. When this is set the bottom half work occurs immediately. 'can_block' is set when the tick method is called every second by the core target (not in interrupt context). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11dm cache: prefix all DMERR and DMINFO messages with cache device nameMike Snitzer
Having the DM device name associated with the ERR or INFO message is very helpful. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11dm cache: add fail io mode and needs_check flagJoe Thornber
If a cache metadata operation fails (e.g. transaction commit) the cache's metadata device will abort the current transaction, set a new needs_check flag, and the cache will transition to "read-only" mode. If aborting the transaction or setting the needs_check flag fails the cache will transition to "fail-io" mode. Once needs_check is set the cache device will not be allowed to activate. Activation requires write access to metadata. Future work is needed to add proper support for running the cache in read-only mode. Once in fail-io mode the cache will report a status of "Fail". Also, add commit() wrapper that will disallow commits if in read_only or fail mode. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11dm cache: wake the worker thread every time we free a migration objectJoe Thornber
When the cache is idle, writeback work was only being issued every second. With this change outstanding writebacks are streamed constantly. This offers a writeback performance improvement. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29dm cache: boost promotion of blocks that will be overwrittenJoe Thornber
When considering whether to move a block to the cache we already give preferential treatment to discarded blocks, since they are cheap to promote (no read of the origin required since the data is junk). The same is true of blocks that are about to be completely overwritten, so we likewise boost their promotion chances. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29dm cache: defer whole cellsJoe Thornber
Currently individual bios are deferred to the worker thread if they cannot be processed immediately (eg, a block is in the process of being moved to the fast device). This patch passes whole cells across to the worker. This saves reaquiring the cell, and also collects bios destined for the same block together, which allows them to be mapped with a single look up to the policy. This reduces the overhead of using dm-cache. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29dm cache: pull out some bitset utility functions for reuseJoe Thornber
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29dm cache: pass a new 'critical' flag to the policies when requesting ↵Joe Thornber
writeback work We only allow non critical writeback if the origin is idle. It is up to the policy to decide what writeback work is critical. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29dm cache: track IO to the origin device using io_trackerJoe Thornber
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29dm cache: add io_trackerJoe Thornber
A little class that keeps track of the volume of io that is in flight, and the length of time that a device has been idle for. FIXME: rather than jiffes, may be best to use ktime_t (to support faster devices). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29dm cache: fix race when issuing a POLICY_REPLACE operationJoe Thornber
There is a race between a policy deciding to replace a cache entry, the core target writing back any dirty data from this block, and other IO threads doing IO to the same block. This sort of problem is avoided most of the time by the core target grabbing a bio prison cell before making the request to the policy. But for a demotion the core target doesn't know which block will be demoted, so can't do this in advance. Fix this demotion race by introducing a callback to the policy interface that allows the policy to grab the cell on behalf of the core target. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-05-22block: remove management of bi_remaining when restoring original bi_end_ioMike Snitzer
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") regressed all existing callers that followed this pattern: 1) saving a bio's original bi_end_io 2) wiring up an intermediate bi_end_io 3) restoring the original bi_end_io from intermediate bi_end_io 4) calling bio_endio() to execute the restored original bi_end_io The regression was due to BIO_CHAIN only ever getting set if bio_inc_remaining() is called. For the above pattern it isn't set until step 3 above (step 2 would've needed to establish BIO_CHAIN). As such the first bio_endio(), in step 2 above, never decremented __bi_remaining before calling the intermediate bi_end_io -- leaving __bi_remaining with the value 1 instead of 0. When bio_inc_remaining() occurred during step 3 it brought it to a value of 2. When the second bio_endio() was called, in step 4 above, it should've called the original bi_end_io but it didn't because there was an extra reference that wasn't dropped (due to atomic operations being optimized away since BIO_CHAIN wasn't set upfront). Fix this issue by removing the __bi_remaining management complexity for all callers that use the above pattern -- bio_chain() is the only interface that _needs_ to be concerned with __bi_remaining. For the above pattern callers just expect the bi_end_io they set to get called! Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls that aren't associated with the bio_chain() interface. Also, the bio_inc_remaining() interface has been moved local to bio.c. Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05bio: skip atomic inc/dec of ->bi_remaining for non-chainsJens Axboe
Struct bio has an atomic ref count for chained bio's, and we use this to know when to end IO on the bio. However, most bio's are not chained, so we don't need to always introduce this atomic operation as part of ending IO. Add a helper to elevate the bi_remaining count, and flag the bio as now actually needing the decrement at end_io time. Rename the field to __bi_remaining to catch any current users of this doing the incrementing manually. For high IOPS workloads, this reduces the overhead of bio_endio() substantially. Tested-by: Robert Elliott <elliott@hp.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-02-13Merge tag 'dm-3.20-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper changes from Mike Snitzer: - The most significant change this cycle is request-based DM now supports stacking ontop of blk-mq devices. This blk-mq support changes the model request-based DM uses for cloning a request to relying on calling blk_get_request() directly from the underlying blk-mq device. An early consumer of this code is Intel's emerging NVMe hardware; thanks to Keith Busch for working on, and pushing for, these changes. - A few other small fixes and cleanups across other DM targets. * tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: inherit QUEUE_FLAG_SG_GAPS flags from underlying queues dm snapshot: remove unnecessary NULL checks before vfree() calls dm mpath: simplify failure path of dm_multipath_init() dm thin metadata: remove unused dm_pool_get_data_block_size() dm ioctl: fix stale comment above dm_get_inactive_table() dm crypt: update url in CONFIG_DM_CRYPT help text dm bufio: fix time comparison to use time_after_eq() dm: use time_in_range() and time_after() dm raid: fix a couple integer overflows dm table: train hybrid target type detection to select blk-mq if appropriate dm: allocate requests in target when stacking on blk-mq devices dm: prepare for allocating blk-mq clone requests in target dm: submit stacked requests in irq enabled context dm: split request structure out from dm_rq_target_io structure dm: remove exports for request-based interfaces without external callers
2015-02-09dm: use time_in_range() and time_after()Manuel Schölling
To be future-proof and for better readability the time comparisons are modified to use time_in_range() and time_after() instead of plain, error-prone math. Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-01-23dm cache: fix problematic dual use of a single migration count variableJoe Thornber
Introduce a new variable to count the number of allocated migration structures. The existing variable cache->nr_migrations became overloaded. It was used to: i) track of the number of migrations in flight for the purposes of quiescing during suspend. ii) to estimate the amount of background IO occuring. Recent discard changes meant that REQ_DISCARD bios are processed with a migration. Discards are not background IO so nr_migrations was not incremented. However this could cause quiescing to complete early. (i) is now handled with a new variable cache->nr_allocated_migrations. cache->nr_migrations has been renamed cache->nr_io_migrations. cleanup_migration() is now called free_io_migration(), since it decrements that variable. Also, remove the unused cache->next_migration variable that got replaced with with prealloc_structs a while ago. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-12-01dm cache: fix spurious cell_defer when dealing with partial block at end of ↵Joe Thornber
device We never bother caching a partial block that is at the back end of the origin device. No cell ever gets locked, but the calling code was assuming it was and trying to release it. Now the code only releases if the cell has been set to a non NULL value. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-12-01dm cache: dirty flag was mistakenly being cleared when promoting via overwriteJoe Thornber
If the incoming bio is a WRITE and completely covers a block then we don't bother to do any copying for a promotion operation. Once this is done the cache block and origin block will be different, so we need to set it to 'dirty'. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-12-01dm cache: only use overwrite optimisation for promotion when in writeback modeJoe Thornber
Overwrite causes the cache block and origin blocks to diverge, which is only allowed in writeback mode. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-12-01dm cache: discard block size must be a multiple of cache block sizeJoe Thornber
Otherwise the cache blocks may span two discard blocks, which we don't handle when doing the discard lookup. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01dm cache: fix a harmless race when working out if a block is discardedJoe Thornber
It is more correct to hold the cell before checking the discard state. These flags are only used as hints to the policy so this change will have negligable effect. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01dm cache: when reloading a discard bitset allow for a different discard ↵Joe Thornber
block size The discard block size can change if the origin changes size or if an old DM cache is upgraded from using a discard block size that was equal to cache block size. To fix this an extent of discarded blocks is established for the purpose of translating the old discard block size to the new in-core discard block size and set bits. The old (potentially huge) discard bitset is left ondisk until it is re-written using the new in-core information on the next successful DM cache shutdown. Fixes: 7ae34e777896 ("dm cache: improve discard support") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01dm cache: fix some issues with the new discard range supportJoe Thornber
Commit 7ae34e777 ("dm cache: improve discard support") needed to also: - discontinue having DM core split the discard bios on cache block boundaries - calculate the cache's discard_nr_blocks relative to the determined discard_block_size rather than using oblock_to_dblock() Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-13dm cache: emit a warning message if there are a lot of cache blocksJoe Thornber
Loading and saving millions of block mappings takes time. We may as well explain what's going on, and encourage people to use a larger cache block size. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10dm cache: improve discard supportJoe Thornber
Safely allow the discard blocksize to be larger than the cache blocksize by using the bio prison's range locking support. This also improves discard performance considerly because larger discards are issued to the dm-cache device. The discard blocksize was always intended to be greater than the cache blocksize. But until now it wasn't implemented safely. Also, by safely restoring the ability to have discard blocksize larger than cache blocksize we're able to significantly reduce the memory used for the cache's discard bitset. Before, with a small discard blocksize, the discard bitset could get quite large because its size is a function of the discard blocksize and the origin device's size. For example, previously, using a 32KB cache blocksize with a 40TB origin resulted in 1280MB of incore memory use for the discard bitset! Now, the discard blocksize is scaled up accordingly to ensure the discard bitset is capped at 2**14 bits, or 16KB. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10dm cache: revert "prevent corruption caused by discard_block_size > ↵Joe Thornber
cache_block_size" This reverts commit d132cc6d9e92424bb9d4fd35f5bd0e55d583f4be because we actually do want to allow the discard blocksize to be larger than the cache blocksize. Further dm-cache discard changes will make this possible. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10dm cache: revert "remove remainder of distinct discard block size"Joe Thornber
This reverts commit 64ab346a360a4b15c28fb8531918d4a01f4eabd9 because we actually do want to allow the discard blocksize to be larger than the cache blocksize. Further dm-cache discard changes will make this possible. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10dm bio prison: introduce support for locking ranges of blocksJoe Thornber
Ranges will be placed in the same cell if they overlap. Range locking is a prerequisite for more efficient multi-block discard support in both the cache and thin-provisioning targets. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10dm bio prison: switch to using a red black treeJoe Thornber
Previously it was using a fixed sized hash table. There are times when very many concurrent cells are held (such as when processing a very large discard). When this happens the hash table performance becomes very poor. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-09-10dm cache: fix race causing dirty blocks to be marked as cleanAnssi Hannula
When a writeback or a promotion of a block is completed, the cell of that block is removed from the prison, the block is marked as clean, and the clear_dirty() callback of the cache policy is called. Unfortunately, performing those actions in this order allows an incoming new write bio for that block to come in before clearing the dirty status is completed and therefore possibly causing one of these two scenarios: Scenario A: Thread 1 Thread 2 cell_defer() . - cell removed from prison . - detained bios queued . . incoming write bio . remapped to cache . set_dirty() called, . but block already dirty . => it does nothing clear_dirty() . - block marked clean . - policy clear_dirty() called . Result: Block is marked clean even though it is actually dirty. No writeback will occur. Scenario B: Thread 1 Thread 2 cell_defer() . - cell removed from prison . - detained bios queued . clear_dirty() . - block marked clean . . incoming write bio . remapped to cache . set_dirty() called . - block marked dirty . - policy set_dirty() called - policy clear_dirty() called . Result: Block is properly marked as dirty, but policy thinks it is clean and therefore never asks us to writeback it. This case is visible in "dmsetup status" dirty block count (which normally decreases to 0 on a quiet device). Fix these issues by calling clear_dirty() before calling cell_defer(). Incoming bios for that block will then be detained in the cell and released only after clear_dirty() has completed, so the race will not occur. Found by inspecting the code after noticing spurious dirty counts (scenario B). Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-08-01dm cache: set minimum_io_size to cache's data block sizeMike Snitzer
Before, if the block layer's limit stacking didn't establish an optimal_io_size that was compatible with the cache's data block size we'd set optimal_io_size to the data block size and minimum_io_size to 0 (which the block layer adjusts to be physical_block_size). Update cache_io_hints() to set both minimum_io_size and optimal_io_size to the cache's data block size. This fixes an issue where mkfs.xfs would create more XFS Allocation Groups on cache volumes than on a normal linear LV of comparable size. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01dm cache metadata: use dm-space-map-metadata.h defined size limitsMike Snitzer
Commit 7d48935e cleaned up the persistent-data's space-map-metadata limits by elevating them to dm-space-map-metadata.h. Update dm-cache-metadata to use these same limits. The calculation for DM_CACHE_METADATA_MAX_SECTORS didn't account for the sizeof the disk_bitmap_header. So the supported maximum metadata size is a bit smaller (reduced from 33423360 to 33292800 sectors). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2014-08-01dm cache: fail migrations in the do_worker error pathJoe Thornber
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01dm cache: simplify deferred set reference count incrementsJoe Thornber
Factor out inc_and_issue and inc_ds helpers to simplify deferred set reference count increments. Also cleanup cache_map to consistently call cell_defer and inc_ds when the bio is DM_MAPIO_REMAPPED. No functional change. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01dm cache: fix race affecting dirty block countAnssi Hannula
nr_dirty is updated without locking, causing it to drift so that it is non-zero (either a small positive integer, or a very large one when an underflow occurs) even when there are no actual dirty blocks. This was due to a race between the workqueue and map function accessing nr_dirty in parallel without proper protection. People were seeing under runs due to a race on increment/decrement of nr_dirty, see: https://lkml.org/lkml/2014/6/3/648 Fix this by using an atomic_t for nr_dirty. Reported-by: roma1390@gmail.com Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-05-27dm cache: always split discards on cache block boundariesHeinz Mauelshagen
The DM cache target cannot cope with discards that span multiple cache blocks, so each discard bio that spans more than one cache block must get split by the DM core. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v3.9+
2014-05-01dm cache: fix writethrough mode quiescing in cache_mapMike Snitzer
Commit 2ee57d58735 ("dm cache: add passthrough mode") inadvertently removed the deferred set reference that was taken in cache_map()'s writethrough mode support. Restore taking this reference. This issue was found with code inspection. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org # 3.13+
2014-04-04dm cache: fix a lock-inversionJoe Thornber
When suspending a cache the policy is walked and the individual policy hints written to the metadata via sync_metadata(). This led to this lock order: policy->lock cache_metadata->root_lock When loading the cache target the policy is populated while the metadata lock is held: cache_metadata->root_lock policy->lock Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by ensuring the cache_metadata root_lock is held whilst all the hints are written, rather than being repeatedly locked while policy->lock is held (as was the case with each callout that policy_walk_mappings() made to the old save_hint() method). Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove locking correctness") build option. However, it is not clear how the LOCKDEP reported paths can lead to a deadlock since the two paths, suspending a target and loading a target, never occur at the same time. But that doesn't mean the same lock-inversion couldn't have occurred elsewhere. Reported-by: Marian Csontos <mcsontos@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-03-27dm cache: remove remainder of distinct discard block sizeHeinz Mauelshagen
Discard block size not being equal to cache block size causes data corruption by erroneously avoiding migrations in issue_copy() because the discard state is being cleared for a group of cache blocks when it should not. Completely remove all code that enabled a distinction between the cache block size and discard block size. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27dm cache: prevent corruption caused by discard_block_size > cache_block_sizeMike Snitzer
If the discard block size is larger than the cache block size we will not properly quiesce IO to a region that is about to be discarded. This results in a race between a cache migration where no copy is needed, and a write to an adjacent cache block that's within the same large discard block. Workaround this by limiting the discard_block_size to cache_block_size. Also limit the max_discard_sectors to cache_block_size. A more comprehensive fix that introduces range locking support in the bio_prison and proper quiescing of a discard range that spans multiple cache blocks is already in development. Reported-by: Morgan Mears <Morgan.Mears@netapp.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Cc: stable@vger.kernel.org
2014-03-12dm cache: fix access beyond end of origin deviceHeinz Mauelshagen
In order to avoid wasting cache space a partial block at the end of the origin device is not cached. Unfortunately, the check for such a partial block at the end of the origin device was flawed. Fix accesses beyond the end of the origin device that occured due to attempted promotion of an undetected partial block by: - initializing the per bio data struct to allow cache_end_io to work properly - recognizing access to the partial block at the end of the origin device - avoiding out of bounds access to the discard bitset Otherwise, users can experience errors like the following: attempt to access beyond end of device dm-5: rw=0, want=20971520, limit=20971456 ... device-mapper: cache: promotion failed; couldn't copy block Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-03-12dm cache: fix truncation bug when copying a block to/from >2TB fast deviceHeinz Mauelshagen
During demotion or promotion to a cache's >2TB fast device we must not truncate the cache block's associated sector to 32bits. The 32bit temporary result of from_cblock() caused a 32bit multiplication when calculating the sector of the fast device in issue_copy_real(). Use an intermediate 64bit type to store the 32bit from_cblock() to allow for proper 64bit multiplication. Here is an example of how this bug manifests on an ext4 filesystem: EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt. JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-02-28dm cache: fix truncation bug when mapping I/O to >2TB fast deviceHeinz Mauelshagen
When remapping a block to the cache's fast device that is larger than 2TB we must not truncate the destination sector to 32bits. The 32bit temporary result of from_cblock() was being overflowed in remap_to_cache() due to the logical left shift. Use an intermediate 64bit type to store the 32bit from_cblock() result to fix the overflow. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-02-17dm cache: do not add migration to completed list before unhooking bioMike Snitzer
When completing an overwrite bio, in overwrite_endio(), the associated migration should not be added to the 'completed_migrations' until the bio's fields are restored with dm_unhook_bio(). Otherwise, do_worker() can race to process 'completed_migrations' before dm_unhook_bio() -- so the bio's bi_end_io is incorrect. This is unlikely to cause any problems given the current code but should be fixed on the basis of correctness. Also, the cache's spinlock only needs to be held when manipulating the 'completed_migrations' list -- other changes don't need protection. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>