Age | Commit message (Collapse) | Author |
|
|
|
commit 1acacc0784aab45627b6009e0e9224886279ac0b upstream.
dm_pool_close_thin_device() must be called if dm_set_target_max_io_len()
fails in thin_ctr(). Otherwise __pool_destroy() will fail because the
pool will still have an open thin device:
device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open
device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed.
Also, must establish error code if failing thin_ctr() because the pool
is in fail_io mode.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 4d1662a30dde6e545086fe0e8fd7e474c4e0b639 upstream.
Commit 905e51b ("dm thin: commit outstanding data every second")
introduced a periodic commit. This commit occurs regardless of whether
any thin devices have made changes.
Fix the periodic commit to check if any of a pool's thin devices have
changed using dm_pool_changed_this_transaction().
Reported-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 8b64e881eb40ac8b9bfcbce068a97eef819044ee upstream.
The pool mode must not be switched until after the corresponding pool
process_* methods have been established. Otherwise, because
set_pool_mode() isn't interlocked with the IO path for performance
reasons, the IO path can end up executing process_* operations that
don't match the mode. This patch eliminates problems like the following
(as seen on really fast PCIe SSD storage when transitioning the pool's
mode from PM_READ_ONLY to PM_WRITE):
kernel: device-mapper: thin: 253:2: reached low water mark for data device: sending event.
kernel: device-mapper: thin: 253:2: no free data space available.
kernel: device-mapper: thin: 253:2: switching pool to read-only mode
kernel: device-mapper: thin: 253:2: switching pool to write mode
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 11 PID: 7564 at drivers/md/dm-thin.c:995 handle_unserviceable_bio+0x146/0x160 [dm_thin_pool]()
...
kernel: Workqueue: dm-thin do_worker [dm_thin_pool]
kernel: 00000000000003e3 ffff880308831cc8 ffffffff8152ebcb 00000000000003e3
kernel: 0000000000000000 ffff880308831d08 ffffffff8104c46c ffff88032502a800
kernel: ffff880036409000 ffff88030ec7ce00 0000000000000001 00000000ffffffc3
kernel: Call Trace:
kernel: [<ffffffff8152ebcb>] dump_stack+0x49/0x5e
kernel: [<ffffffff8104c46c>] warn_slowpath_common+0x8c/0xc0
kernel: [<ffffffff8104c4ba>] warn_slowpath_null+0x1a/0x20
kernel: [<ffffffffa001e2c6>] handle_unserviceable_bio+0x146/0x160 [dm_thin_pool]
kernel: [<ffffffffa001f276>] process_bio_read_only+0x136/0x180 [dm_thin_pool]
kernel: [<ffffffffa0020b75>] process_deferred_bios+0xc5/0x230 [dm_thin_pool]
kernel: [<ffffffffa0020d31>] do_worker+0x51/0x60 [dm_thin_pool]
kernel: [<ffffffff81067823>] process_one_work+0x183/0x490
kernel: [<ffffffff81068c70>] worker_thread+0x120/0x3a0
kernel: [<ffffffff81068b50>] ? manage_workers+0x160/0x160
kernel: [<ffffffff8106e86e>] kthread+0xce/0xf0
kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70
kernel: [<ffffffff8153b3ec>] ret_from_fork+0x7c/0xb0
kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70
kernel: ---[ end trace 3f00528e08ffa55c ]---
kernel: device-mapper: thin: pool mode is PM_WRITE not PM_READ_ONLY like expected!?
dm-thin.c:995 was the WARN_ON_ONCE(get_pool_mode(pool) != PM_READ_ONLY);
at the top of handle_unserviceable_bio(). And as the additional
debugging I had conveys: the pool mode was _not_ PM_READ_ONLY like
expected, it was already PM_WRITE, yet pool->process_bio was still set
to process_bio_read_only().
Also, while fixing this up, reduce logging of redundant pool mode
transitions by checking new_mode is different from old_mode.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 16961b042db8cc5cf75d782b4255193ad56e1d4f upstream.
As additional members are added to the dm_thin_new_mapping structure
care should be taken to make sure they get initialized before use.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 19fa1a6756ed9e92daa9537c03b47d6b55cc2316 upstream.
If a snapshot is created and later deleted the origin dm_thin_device's
snapshotted_time will have been updated to reflect the snapshot's
creation time. The 'shared' flag in the dm_thin_lookup_result struct
returned from dm_thin_find_block() is an approximation based on
snapshotted_time -- this is done to avoid 0(n), or worse, time
complexity. In this case, the shared flag would be true.
But because the 'shared' flag reflects an approximation a block can be
incorrectly assumed to be shared (e.g. false positive for 'shared'
because the snapshot no longer exists). This could result in discards
issued to a thin device not being passed down to the pool's underlying
data device.
To fix this we double check that a thin block is really still in-use
after a mapping is removed using dm_pool_block_is_used(). If the
reference count for a block is now zero the discard is allowed to be
passed down.
Also add a 'definitely_not_shared' member to the dm_thin_new_mapping
structure -- reflects that the 'shared' flag in the response from
dm_thin_find_block() can only be held as definitive if false is
returned.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1043527
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9b7aaa64f96f7ca280d75326fca42f42017b89ef upstream.
A thin-pool may be in read-only mode because the pool's data or metadata
space was exhausted. To allow for recovery, by adding more space to the
pool, we must allow a pool to transition from PM_READ_ONLY to PM_WRITE
mode. Otherwise, running out of space will render the pool permanently
read-only.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5383ef3a929a1366e2ced45cd6d74be7aa2a2281 upstream.
If the thin-pool transitioned to fail mode and the thin-pool's table
were reloaded for some reason: the new table's default pool mode would
be read-write, though it will transition to fail mode during resume.
When the pool mode transitions directly from PM_WRITE to PM_FAIL we need
to re-establish the intermediate read-only state in both the metadata
and persistent-data block manager (as is usually done with the normal
pool mode transition sequence: PM_WRITE -> PM_READ_ONLY -> PM_FAIL).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 020cc3b5e28c2e24f59f53a9154faf08564f308e upstream.
Rename commit_or_fallback() to commit(). Now all previous calls to
commit() will trigger the pool mode to fallback if the commit fails.
Also, check the error returned from commit() in alloc_data_block().
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4a02b34e0cf1d0d0dd3737702841da4bf615a50a upstream.
Switch the thin pool to read-only mode in alloc_data_block() if
dm_pool_alloc_data_block() fails because the pool's metadata space is
exhausted.
Differentiate between data and metadata space in messages about no
free space available.
This issue was noticed with the device-mapper-test-suite using:
dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/
The quantity of errors logged in this case must be reduced.
before patch:
device-mapper: thin: 253:4: reached low water mark for metadata device: sending event.
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
<snip ... these repeat for a _very_ long while ... >
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: commit failed: error = -28
device-mapper: thin: 253:4: switching pool to read-only mode
after patch:
device-mapper: thin: 253:4: reached low water mark for metadata device: sending event.
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: no free metadata space available.
device-mapper: thin: 253:4: switching pool to read-only mode
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fafc7a815e40255d24e80a1cb7365892362fa398 upstream.
Switch the thin pool to read-only mode when dm_thin_insert_block() fails
since there is little reason to expect the cause of the failure to be
resolved without further action by user space.
This issue was noticed with the device-mapper-test-suite using:
dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/
The quantity of errors logged in this case must be reduced.
before patch:
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
<snip ... these repeat for a long while ... >
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: thin: 253:4: no free metadata space available.
device-mapper: thin: 253:4: switching pool to read-only mode
after patch:
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: dm_thin_insert_block() failed: error = -28
device-mapper: thin: 253:4: switching pool to read-only mode
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Fix issue where the block layer would stack the discard limits of the
pool's data device even if the "ignore_discard" pool feature was
specified.
The pool and thin device(s) still had discards disabled because the
QUEUE_FLAG_DISCARD request_queue flag wasn't set. But to avoid user
confusion when "ignore_discard" is used: both the pool device and the
thin device(s) have zeroes for all discard limits.
Also, always set discard_zeroes_data_unsupported in targets because they
should never advertise the 'discard_zeroes_data' capability (even if the
pool's data device supports it).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
|
If pool has 'no_free_space' set it means a previous allocation already
determined the pool has no free space (and failed that allocation with
-ENOSPC). By always returning -ENOSPC if 'no_free_space' is set, we do
not allow the pool to oscillate between allocating blocks and then not.
But a side-effect of this determinism is that if a user wants to be able
to allocate new blocks they'll need to reload the pool's table (to clear
the 'no_free_space' flag). This reload will happen automatically if the
pool's data volume is resized. But if the user takes action to free a
lot of space by deleting snapshot volumes, etc the pool will no longer
allow data allocations to continue without an intervening table reload.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
break_sharing() now handles an arbitrary alloc_data_block() error
the same way as provision_block(): marks pool read-only and errors the
cell.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Useful to know which pool is experiencing the error.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Do not blindly override the queue limits (specifically io_min and
io_opt). Allow traditional stacking of these limits if io_opt is a
factor of the thin-pool's data block size.
Without this patch mkfs.xfs does not recognize the thin device's
provided limits as a useful geometry (e.g. raid) so these hints are
ignored. This was due to setting io_min to a useless value.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
|
Fix detection of the need to resize the dm thin metadata device.
The code incorrectly tried to extend the metadata device when it
didn't need to due to a merging error with patch 24347e9 ("dm thin:
detect metadata device resizing").
device-mapper: transaction manager: couldn't open metadata space map
device-mapper: thin metadata: tm_open_with_sm failed
device-mapper: thin: aborting transaction failed
device-mapper: thin: switching pool to failure mode
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Generate a dm event when the amount of remaining thin pool metadata
space falls below a certain level.
The threshold is taken to be a quarter of the size of the metadata
device with a minimum threshold of 4MB.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Allow the dm thin pool metadata device to be extended.
Whenever a pool is resumed, detect whether the size of the metadata
device has increased, and if so, extend the metadata to use the new
space.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
If a thin pool is created in read-only-metadata mode then only open the
metadata device read-only.
Previously it was always opened with FMODE_READ | FMODE_WRITE.
(Note that dm_get_device() still allows read-only dm devices to be used
read-write at the moment: If I create a read-only linear device for the
metadata, via dmsetup load --readonly, then I can still create a rw pool
out of it.)
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Refactor device size functions in preparation for similar metadata
device resizing functions.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Fix a discard granularity calculation to work for non power of 2 block sizes.
In order for thinp to passdown discard bios to the underlying data
device, the data device must have a discard granularity that is a
factor of the thinp block size. Originally this check was done by
using bitops since the block_size was known to be a power of two.
Introduced by commit f13945d75730081830b6f3360266950e2b7c9067
("dm thin: support a non power of 2 discard_granularity").
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Fix a bug in dm_btree_remove that could leave leaf values with incorrect
reference counts. The effect of this was that removal of a shared block
could result in the space maps thinking the block was no longer used.
More concretely, if you have a thin device and a snapshot of it, sending
a discard to a shared region of the thin could corrupt the snapshot.
Thinp uses a 2-level nested btree to store it's mappings. This first
level is indexed by thin device, and the second level by logical
block.
Often when we're removing an entry in this mapping tree we need to
rebalance nodes, which can involve shadowing them, possibly creating a
copy if the block is shared. If we do create a copy then children of
that node need to have their reference counts incremented. In this
way reference counts percolate down the tree as shared trees diverge.
The rebalance functions were incrementing the children at the
appropriate time, but they were always assuming the children were
internal nodes. This meant the leaf values (in our case packed
block/flags entries) were not being incremented.
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
This patch takes advantage of the new bio-prison interface where the
memory is now passed in rather than using a mempool in bio-prison.
This allows the map function to avoid performing potentially-blocking
allocations that could lead to deadlocks: We want to avoid the cell
allocation that is done in bio_detain.
(The potential for mempool deadlocks still remains in other functions
that use bio_detain.)
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Change the dm_bio_prison interface so that instead of allocating memory
internally, dm_bio_detain is supplied with a pre-allocated cell each
time it is called.
This enables a subsequent patch to move the allocation of the struct
dm_bio_prison_cell outside the thin target's mapping function so it can
no longer block there.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
This patch allows the administrator to reduce the rate at which kcopyd
issues I/O.
Each module that uses kcopyd acquires a throttle parameter that can be
set in /sys/module/*/parameters.
We maintain a history of kcopyd usage by each module in the variables
io_period and total_period in struct dm_kcopyd_throttle. The actual
kcopyd activity is calculated as a percentage of time equal to
"(100 * io_period / total_period)". This is compared with the user-defined
throttle percentage threshold and if it is exceeded, we sleep.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Use 'bio' in the name of variables and functions that deal with
bios rather than 'request' to avoid confusion with the normal
block layer use of 'request'.
No functional changes.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Use block_size_is_power_of_two() rather than checking
sectors_per_block_shift directly. Also introduce local pool variable in
get_bio_block() to eliminate redundant tc->pool dereferences.
No functional change.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Support a non-power-of-2 discard granularity in dm-thin, now that the block
layer supports this(via 8dd2cb7e880d2f77fba53b523c99133ad5054cfd "block:
discard granularity might not be power of 2" and
59771079c18c44e39106f0f30054025acafadb41 "blk: avoid divide-by-zero with zero
discard granularity").
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Avoid returning a truncated table or status string instead of setting
the DM_BUFFER_FULL_FLAG when the last target of a table fills the
buffer.
When processing a table or status request, the function retrieve_status
calls ti->type->status. If ti->type->status returns non-zero,
retrieve_status assumes that the buffer overflowed and sets
DM_BUFFER_FULL_FLAG.
However, targets don't return non-zero values from their status method
on overflow. Most targets returns always zero.
If a buffer overflow happens in a target that is not the last in the
table, it gets noticed during the next iteration of the loop in
retrieve_status; but if a buffer overflow happens in the last target, it
goes unnoticed and erroneously truncated data is returned.
In the current code, the targets behave in the following way:
* dm-crypt returns -ENOMEM if there is not enough space to store the
key, but it returns 0 on all other overflows.
* dm-thin returns errors from the status method if a disk error happened.
This is incorrect because retrieve_status doesn't check the error
code, it assumes that all non-zero values mean buffer overflow.
* all the other targets always return 0.
This patch changes the ti->type->status function to return void (because
most targets don't use the return code). Overflow is detected in
retrieve_status: if the status method fills up the remaining space
completely, it is assumed that buffer overflow happened.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
thin_io_hints() is blindly copying the queue limits from the thin-pool
which can lead to incorrect limits being set. The fix here simply
deletes the thin_io_hints() hook which leaves the existing stacking
infrastructure to set the limits correctly.
When a thin-pool uses an MD device for the data device a thin device
from the thin-pool must respect MD's constraints about disallowing a bio
from spanning multiple chunks. Otherwise we can see problems. If the raid0
chunksize is 1152K and thin-pool chunksize is 256K I see the following
md/raid0 error (with extra debug tracing added to thin_endio) when
mkfs.xfs is executed against the thin device:
md/raid0:md99: make_request bug: can't convert block across chunks or bigger than 1152k 6688 127
device-mapper: thin: bio sector=2080 err=-5 bi_size=130560 bi_rw=17 bi_vcnt=32 bi_idx=0
This extra DM debugging shows that the failing bio is spanning across
the first and second logical 1152K chunk (sector 2080 + 255 takes the
bio beyond the first chunk's boundary of sector 2304). So the bio
splitting that DM is doing clearly isn't respecting the MD limits.
max_hw_sectors_kb is 127 for both the thin-pool and thin device
(queue_max_hw_sectors returns 255 so we'll excuse sysfs's lack of
precision). So this explains why bi_size is 130560.
But the thin device's max_hw_sectors_kb should be 4 (PAGE_SIZE) given
that it doesn't have a .merge function (for bio_add_page to consult
indirectly via dm_merge_bvec) yet the thin-pool does sit above an MD
device that has a compulsory merge_bvec_fn. This scenario is exactly
why DM must resort to sending single PAGE_SIZE bios to the underlying
layer. Some additional context for this is available in the header for
commit 8cbeb67a ("dm: avoid unsupported spanning of md stripe boundaries").
Long story short, the reason a thin device doesn't properly get
configured to have a max_hw_sectors_kb of 4 (PAGE_SIZE) is that
thin_io_hints() is blindly copying the queue limits from the thin-pool
device directly to the thin device's queue limits.
Fix this by eliminating thin_io_hints. Doing so is safe because the
block layer's queue limits stacking already enables the upper level thin
device to inherit the thin-pool device's discard and minimum_io_size and
optimal_io_size limits that get set in pool_io_hints. But avoiding the
queue limits copy allows the thin and thin-pool limits to be different
where it is important, namely max_hw_sectors_kb.
Reported-by: Daniel Browning <db@kavod.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
This patch removes map_info from bio-based device mapper targets.
map_info is still used for request-based targets.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
This patch removes endio_hook_pool from dm-thin and uses per-bio data instead.
This patch removes any use of map_info in preparation for the next patch
that removes map_info from bio-based device mapper.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Add WRITE SAME support to dm-io and make it accessible to
dm_kcopyd_zero(). dm_kcopyd_zero() provides an asynchronous interface
whereas the blkdev_issue_write_same() interface is synchronous.
WRITE SAME is a SCSI command that can be leveraged for more efficient
zeroing of a specified logical extent of a device which supports it.
Only a single zeroed logical block is transfered to the target for each
WRITE SAME and the target then writes that same block across the
specified extent.
The dm thin target uses this.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Throttle all errors logged from the IO path by dm thin.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Remove unused @data_block parameter from cell_defer.
Change thin_bio_map to use many returns rather than setting a variable.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Rename cell_defer_except() to cell_defer_no_holder() which describes
its function more clearly.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
If "ignore_discard" is specified when creating the thin pool device then
discard support is disabled for that device. The pool device's status
should reflect this fact rather than stating "no_discard_passdown"
(which implies discards are enabled but passdown is disabled).
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
When discards are prepared it is best to directly wake the worker that
will process them. The worker will be woken anyway, via periodic
commit, but there is no reason to not wake_worker here.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
There is a race when discard bios and non-discard bios are issued
simultaneously to the same block.
Discard support is expensive for all thin devices precisely because you
have to be careful to quiesce the area you're discarding. DM thin must
handle this conflicting IO pattern (simultaneous non-discard vs discard)
even though a sane application shouldn't be issuing such IO.
The race manifests as follows:
1. A non-discard bio is mapped in thin_bio_map.
This doesn't lock out parallel activity to the same block.
2. A discard bio is issued to the same block as the non-discard bio.
3. The discard bio is locked in a dm_bio_prison_cell in process_discard
to lock out parallel activity against the same block.
4. The non-discard bio's mapping continues and its all_io_entry is
incremented so the bio is accounted for in the thin pool's all_io_ds
which is a dm_deferred_set used to track time locality of non-discard IO.
5. The non-discard bio is finally locked in a dm_bio_prison_cell in
process_bio.
The race can result in deadlock, leaving the block layer hanging waiting
for completion of a discard bio that never completes, e.g.:
INFO: task ruby:15354 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ruby D ffffffff8160f0e0 0 15354 15314 0x00000000
ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900
ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900
ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0
Call Trace:
[<ffffffff814e5a19>] schedule+0x29/0x70
[<ffffffff814e3d85>] schedule_timeout+0x195/0x220
[<ffffffffa06b9bc1>] ? _dm_request+0x111/0x160 [dm_mod]
[<ffffffff814e589e>] wait_for_common+0x11e/0x190
[<ffffffff8107a170>] ? try_to_wake_up+0x2b0/0x2b0
[<ffffffff814e59ed>] wait_for_completion+0x1d/0x20
[<ffffffff81233289>] blkdev_issue_discard+0x219/0x260
[<ffffffff81233e79>] blkdev_ioctl+0x6e9/0x7b0
[<ffffffff8119a65c>] block_ioctl+0x3c/0x40
[<ffffffff8117539c>] do_vfs_ioctl+0x8c/0x340
[<ffffffff8119a547>] ? block_llseek+0x67/0xb0
[<ffffffff811756f1>] sys_ioctl+0xa1/0xb0
[<ffffffff810561f6>] ? sys_rt_sigprocmask+0x86/0xd0
[<ffffffff814ef099>] system_call_fastpath+0x16/0x1b
The thinp-test-suite's test_discard_random_sectors reliably hits this
deadlock on fast SSD storage.
The fix for this race is that the all_io_entry for a bio must be
incremented whilst the dm_bio_prison_cell is held for the bio's
associated virtual and physical blocks. That cell locking wasn't
occurring early enough in thin_bio_map. This patch fixes this.
Care is taken to always call the new function inc_all_io_entry() with
the relevant cells locked, but they are generally unlocked before
calling issue() to try to avoid holding the cells locked across
generic_submit_request.
Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
longer the only thread that will do so. Because of this we must be sure
to use cell_defer_except() to release all non-holder entries, that
were added by the other thread, because they must be deferred.
This patch depends on "dm thin: replace dm_cell_release_singleton with
cell_defer_except".
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@vger.kernel.org
|
|
Change existing users of the function dm_cell_release_singleton to share
cell_defer_except instead, and then remove the now-unused function.
Everywhere that calls dm_cell_release_singleton, the bio in question
is the holder of the cell.
If there are no non-holder entries in the cell then cell_defer_except
behaves exactly like dm_cell_release_singleton. Conversely, if there
*are* non-holder entries then dm_cell_release_singleton must not be used
because those entries would need to be deferred.
Consequently, it is safe to replace use of dm_cell_release_singleton
with cell_defer_except.
This patch is a pre-requisite for "dm thin: fix race between
simultaneous io and discards to same block".
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
The bio prison code will be useful to other future DM targets so
move it to a separate module.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
The bio prison code will be useful to share with future DM targets.
Prepare to move this code into a separate module, adding a dm prefix
to structures and functions that will be exported.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Support discards when the pool's block size is not a power of 2.
The block layer assumes discard_granularity is a power of 2 (in
blkdev_issue_discard), so we set this to the largest power of 2 that is
a divides into the number of sectors in each block, but never less than
DATA_DEV_BLOCK_SIZE_MIN_SECTORS.
This patch eliminates the "Discard support must be disabled when the
block size is not a power of 2" constraint that was imposed in commit
55f2b8b ("dm thin: support for non power of 2 pool blocksize"). That
commit was incomplete: using a block size that is not a power of 2
shouldn't mean disabling discard support on the device completely.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
The discard limits that get established for a thin-pool or thin device
may be incompatible with the pool's data device. Avoid this by checking
the discard limits of the pool's data device. If an incompatibility is
found then the pool's 'discard passdown' feature is disabled.
Change thin_io_hints to ensure that a thin device always uses the same
queue limits as its pool device.
Introduce requested_pf to track whether or not the table line originally
contained the no_discard_passdown flag and use this directly for table
output. We prepare the correct setting for discard_passdown directly in
bind_control_target (called from pool_io_hints) and store it in
adjusted_pf rather than waiting until we have access to pool->pf in
pool_preresume.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
A little thin discard code refactoring to make the next patch (dm thin:
fix discard support for data devices) more readable.
Pull out a couple of functions (and uses bools instead of unsigned for
features).
No functional changes.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
The dm thin pool target claims to support the zeroing of discarded
data areas. This turns out to be incorrect when processing discards
that do not exactly cover a complete number of blocks, so the target
must always set discard_zeroes_data_unsupported.
The thin pool target will zero blocks when they are allocated if the
skip_block_zeroing feature is not specified. The block layer
may send a discard that only partly covers a block. If a thin pool
block is partially discarded then there is no guarantee that the
discarded data will get zeroed before it is accessed again.
Due to this, thin devices cannot claim discards will always zero data.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Commit outstanding metadata before returning the status for a dm thin
pool so that the numbers reported are as up-to-date as possible.
The commit is not performed if the device is suspended or if
the DM_NOFLUSH_FLAG is supplied by userspace and passed to the target
through a new 'status_flags' parameter in the target's dm_status_fn.
The userspace dmsetup tool will support the --noflush flag with the
'dmsetup status' and 'dmsetup wait' commands from version 1.02.76
onwards.
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Add read-only and fail-io modes to thin provisioning.
If a transaction commit fails the pool's metadata device will transition
to "read-only" mode. If a commit fails once already in read-only mode
the transition to "fail-io" mode occurs.
Once in fail-io mode the pool and all associated thin devices will
report a status of "Fail".
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|
|
Reduce the number of metadata commits by using
dm_thin_changed_this_transaction to check if metadata was changed on a
per thin device granularity.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
|