summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2013-03-01dm: add target num_write_bios fnAlasdair G Kergon
Add a num_write_bios function to struct target. If an instance of a target sets this, it will be queried before the target's mapping function is called on a write bio, and the response controls the number of copies of the write bio that the target will receive. This provides a convenient way for a target to send the same data to more than one device. The new cache target uses this in writethrough mode, to send the data both to the cache and the backing device. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm kcopyd: introduce configurable throttlingMikulas Patocka
This patch allows the administrator to reduce the rate at which kcopyd issues I/O. Each module that uses kcopyd acquires a throttle parameter that can be set in /sys/module/*/parameters. We maintain a history of kcopyd usage by each module in the variables io_period and total_period in struct dm_kcopyd_throttle. The actual kcopyd activity is calculated as a percentage of time equal to "(100 * io_period / total_period)". This is compared with the user-defined throttle percentage threshold and if it is exceeded, we sleep. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm ioctl: allow message to return dataMikulas Patocka
This patch introduces enhanced message support that allows the device-mapper core to recognise messages that are common to all devices, and for messages to return data to userspace. Core messages are processed by the function "message_for_md". If the device mapper doesn't support the message, it is passed to the target driver. If the message returns data, the kernel sets the flag DM_MESSAGE_OUT_FLAG. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm ioctl: optimize functions without variable paramsMikulas Patocka
Device-mapper ioctls receive and send data in a buffer supplied by userspace. The buffer has two parts. The first part contains a 'struct dm_ioctl' and has a fixed size. The second part depends on the ioctl and has a variable size. This patch recognises the specific ioctls that do not use the variable part of the buffer and skips allocating memory for it. In particular, when a device is suspended and a resume ioctl is sent, this now avoid memory allocation completely. The variable "struct dm_ioctl tmp" is moved from the function copy_params to its caller ctl_ioctl and renamed to param_kernel. It is used directly when the ioctl function doesn't need any arguments. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm ioctl: introduce ioctl_flagsMikulas Patocka
This patch introduces flags for each ioctl function. So far, one flag is defined, IOCTL_FLAGS_NO_PARAMS. It is set if the function processing the ioctl doesn't take or produce any parameters in the section of the data buffer that has a variable size. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm: merge io_pool and tio_poolJun'ichi Nomura
This patch merges io_pool and tio_pool into io_pool and cleans up related functions. Though device-mapper used to have 2 pools of objects for each dm device, the use of bioset frontbad for per-bio data has shrunk the number of pools to 1 for both bio-based and request-based device types. (See c0820cf5 "dm: introduce per_bio_data" and 94818742 "dm: Use bioset's front_pad for dm_rq_clone_bio_info") So dm no longer has to maintain 2 different pointers. No functional changes. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm: remove unused _rq_bio_info_cacheJun'ichi Nomura
Remove _rq_bio_info_cache, which is no longer used. No functional changes. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm: fix limits initialization when there are no data devicesMike Christie
dm_calculate_queue_limits will first reset the provided limits to defaults using blk_set_stacking_limits; whereby defeating the purpose of retaining the original live table's limits -- as was intended via commit 3ae706561637331aa578e52bb89ecbba5edcb7a9 ("dm: retain table limits when swapping to new table with no devices"). Fix this improper limits initialization (in the no data devices case) by avoiding the call to dm_calculate_queue_limits. [patch header revised by Mike Snitzer] Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v3.6+ Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm snapshot: add missing module aliasesMikulas Patocka
Add module aliases so that autoloading works correctly if the user tries to activate "snapshot-origin" or "snapshot-merge" targets. Reference: https://bugzilla.redhat.com/889973 Reported-by: Chao Yang <chyang@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm persistent data: set some btree fn parms constMike Snitzer
Mark some constant parameters constant in some dm-btree functions. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm: refactor bio cloningAlasdair G Kergon
Refactor part of the bio splitting and cloning code to try to make it easier to understand. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm: rename bio cloning functionsAlasdair G Kergon
Rename functions involved in splitting and cloning bios. The sequence of functions is now: (1) __split_and_process* - entry point that selects the processing strategy (2) __send* - prepare the details for each bio needed and loop through them (3) __clone_and_map* - creates a clone and maps it Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm: rename request variables to biosAlasdair G Kergon
Use 'bio' in the name of variables and functions that deal with bios rather than 'request' to avoid confusion with the normal block layer use of 'request'. No functional changes. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm: clean up clone_bioAlasdair G Kergon
Remove the no-longer-used struct bio_set argument from clone_bio and split_bvec. Use tio->ti in __map_bio() instead of passing in ti. Factor out some code for setting up cloned bios. Take target_request_nr as a parameter to alloc_tio(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm persistent data: remove CONFIG_EXPERIMENTALKees Cook
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm: remove CONFIG_EXPERIMENTALAlasdair G Kergon
Remove EXPERIMENTAL from all existing device-mapper targets. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm thin: use block_size_is_power_of_twoMike Snitzer
Use block_size_is_power_of_two() rather than checking sectors_per_block_shift directly. Also introduce local pool variable in get_bio_block() to eliminate redundant tc->pool dereferences. No functional change. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm bufio: use WRITE_FLUSH instead of REQ_FLUSHMikulas Patocka
Use WRITE_FLUSH instead of REQ_FLUSH for submitted requests to make it consistent with the rest of the kernel. There is no functional change. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm table: remove superfluous variable resetWang Sheng-Hui
If allocation fails, the local var *t is not used any more after kfree. Don't need to reset it to NULL. Remove the unnecesary NULL set here. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm thin: support a non power of 2 discard_granularityMike Snitzer
Support a non-power-of-2 discard granularity in dm-thin, now that the block layer supports this(via 8dd2cb7e880d2f77fba53b523c99133ad5054cfd "block: discard granularity might not be power of 2" and 59771079c18c44e39106f0f30054025acafadb41 "blk: avoid divide-by-zero with zero discard granularity"). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm: fix truncated status stringsMikulas Patocka
Avoid returning a truncated table or status string instead of setting the DM_BUFFER_FULL_FLAG when the last target of a table fills the buffer. When processing a table or status request, the function retrieve_status calls ti->type->status. If ti->type->status returns non-zero, retrieve_status assumes that the buffer overflowed and sets DM_BUFFER_FULL_FLAG. However, targets don't return non-zero values from their status method on overflow. Most targets returns always zero. If a buffer overflow happens in a target that is not the last in the table, it gets noticed during the next iteration of the loop in retrieve_status; but if a buffer overflow happens in the last target, it goes unnoticed and erroneously truncated data is returned. In the current code, the targets behave in the following way: * dm-crypt returns -ENOMEM if there is not enough space to store the key, but it returns 0 on all other overflows. * dm-thin returns errors from the status method if a disk error happened. This is incorrect because retrieve_status doesn't check the error code, it assumes that all non-zero values mean buffer overflow. * all the other targets always return 0. This patch changes the ti->type->status function to return void (because most targets don't use the return code). Overflow is detected in retrieve_status: if the status method fills up the remaining space completely, it is assumed that buffer overflow happened. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm: do not replace bioset for request based dmJun'ichi Nomura
This patch fixes a regression introduced in v3.8, which causes oops like this when dm-multipath is used: general protection fault: 0000 [#1] SMP RIP: 0010:[<ffffffff810fe754>] [<ffffffff810fe754>] mempool_free+0x24/0xb0 Call Trace: <IRQ> [<ffffffff81187417>] bio_put+0x97/0xc0 [<ffffffffa02247a5>] end_clone_bio+0x35/0x90 [dm_mod] [<ffffffff81185efd>] bio_endio+0x1d/0x30 [<ffffffff811f03a3>] req_bio_endio.isra.51+0xa3/0xe0 [<ffffffff811f2f68>] blk_update_request+0x118/0x520 [<ffffffff811f3397>] blk_update_bidi_request+0x27/0xa0 [<ffffffff811f343c>] blk_end_bidi_request+0x2c/0x80 [<ffffffff811f34d0>] blk_end_request+0x10/0x20 [<ffffffffa000b32b>] scsi_io_completion+0xfb/0x6c0 [scsi_mod] [<ffffffffa000107d>] scsi_finish_command+0xbd/0x120 [scsi_mod] [<ffffffffa000b12f>] scsi_softirq_done+0x13f/0x160 [scsi_mod] [<ffffffff811f9fd0>] blk_done_softirq+0x80/0xa0 [<ffffffff81044551>] __do_softirq+0xf1/0x250 [<ffffffff8142ee8c>] call_softirq+0x1c/0x30 [<ffffffff8100420d>] do_softirq+0x8d/0xc0 [<ffffffff81044885>] irq_exit+0xd5/0xe0 [<ffffffff8142f3e3>] do_IRQ+0x63/0xe0 [<ffffffff814257af>] common_interrupt+0x6f/0x6f <EOI> [<ffffffffa021737c>] srp_queuecommand+0x8c/0xcb0 [ib_srp] [<ffffffffa0002f18>] scsi_dispatch_cmd+0x148/0x310 [scsi_mod] [<ffffffffa000a38e>] scsi_request_fn+0x31e/0x520 [scsi_mod] [<ffffffff811f1e57>] __blk_run_queue+0x37/0x50 [<ffffffff811f1f69>] blk_delay_work+0x29/0x40 [<ffffffff81059003>] process_one_work+0x1c3/0x5c0 [<ffffffff8105b22e>] worker_thread+0x15e/0x440 [<ffffffff8106164b>] kthread+0xdb/0xe0 [<ffffffff8142db9c>] ret_from_fork+0x7c/0xb0 The regression was introduced by the change c0820cf5 "dm: introduce per_bio_data", where dm started to replace bioset during table replacement. For bio-based dm, it is good because clone bios do not exist during the table replacement. For request-based dm, however, (not-yet-mapped) clone bios may stay in request queue and survive during the table replacement. So freeing the old bioset could cause the oops in bio_put(). Since the size of front_pad may change only with bio-based dm, it is not necessary to replace bioset for request-based dm. Reported-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-02-28Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block IO core bits from Jens Axboe: "Below are the core block IO bits for 3.9. It was delayed a few days since my workstation kept crashing every 2-8h after pulling it into current -git, but turns out it is a bug in the new pstate code (divide by zero, will report separately). In any case, it contains: - The big cfq/blkcg update from Tejun and and Vivek. - Additional block and writeback tracepoints from Tejun. - Improvement of the should sort (based on queues) logic in the plug flushing. - _io() variants of the wait_for_completion() interface, using io_schedule() instead of schedule() to contribute to io wait properly. - Various little fixes. You'll get two trivial merge conflicts, which should be easy enough to fix up" Fix up the trivial conflicts due to hlist traversal cleanups (commit b67bfe0d42ca: "hlist: drop the node parameter from iterators"). * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits) block: remove redundant check to bd_openers() block: use i_size_write() in bd_set_size() cfq: fix lock imbalance with failed allocations drivers/block/swim3.c: fix null pointer dereference block: don't select PERCPU_RWSEM block: account iowait time when waiting for completion of IO request sched: add wait_for_completion_io[_timeout] writeback: add more tracepoints block: add block_{touch|dirty}_buffer tracepoint buffer: make touch_buffer() an exported function block: add @req to bio_{front|back}_merge tracepoints block: add missing block_bio_complete() tracepoint block: Remove should_sort judgement when flush blk_plug block,elevator: use new hashtable implementation cfq-iosched: add hierarchical cfq_group statistics cfq-iosched: collect stats from dead cfqgs cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock block: RCU free request_queue blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() ...
2013-02-28hlist: drop the node parameter from iteratorsSasha Levin
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28dm: convert to idr_alloc()Tejun Heo
Convert to the much saner new idr interface. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Alasdair Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28dm: don't use idr_remove_all()Tejun Heo
idr_destroy() can destroy idr by itself and idr_remove_all() is being deprecated. Drop its usage. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Alasdair Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28md: expedite metadata update when switching read-auto -> activeNeilBrown
If something has failed while the array was read-auto, then when we switch to 'active' we need to update the metadata. This will happen anyway but it is good to expedite it, and also to ensure any failed device has been released by the underlying device before we try to action the ioctl which caused us to switch to 'active' mode. Reported-by: Joe Lawrence <Joe.Lawrence@stratus.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-27md: remove CONFIG_MULTICORE_RAID456NeilBrown
This doesn't seem to actually help and we have an alternate multi-threading approach waiting in the wings, so just get rid of this config option and associated code. As a bonus, we remove one use of CONFIG_EXPERIMENTAL Cc: Dan Williams <djbw@fb.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-27Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle *and* ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...
2013-02-26md/raid1,raid10: fix deadlock with freeze_array()NeilBrown
When raid1/raid10 needs to fix a read error, it first drains all pending requests by calling freeze_array(). This calls flush_pending_writes() if it needs to sleep, but some writes may be pending in a per-process plug rather than in the per-array request queue. When raid1{,0}_unplug() moves the request from the per-process plug to the per-array request queue (from which flush_pending_writes() can flush them), it needs to wake up freeze_array(), or freeze_array() will never flush them and so it will block forever. So add the requires wake_up() calls. This bug was introduced by commit f54a9d0e59c4bea3db733921ca9147612a6f292c for raid1 and a similar commit for RAID10, and so has been present since linux-3.6. As the bug causes a deadlock I believe this fix is suitable for -stable. Cc: stable@vger.kernel.org (3.6.y 3.7.y 3.8.y) Reported-by: Tregaron Bayly <tbayly@bluehost.com> Tested-by: Tregaron Bayly <tbayly@bluehost.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26md/raid0: improve error message when converting RAID4-with-spares to RAID0NeilBrown
Mentioning "bad disk number -1" exposes irrelevant internal detail. Just say they are inactive and must be removed. Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26md: raid0: fix error return from create_stripe_zones.NeilBrown
Create_stripe_zones returns an error slightly differently to raid0_run and to raid0_takeover_*. The error returned used by the second was wrong and an error would result in mddev->private being set to NULL and sooner or later a crash. So never return NULL, return ERR_PTR(err), not NULL from create_stripe_zones. This bug has been present since 2.6.35 so the fix is suitable for any kernel since then. Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26md: fix two bugs when attempting to resize RAID0 array.NeilBrown
You cannot resize a RAID0 array (in terms of making the devices bigger), but the code doesn't entirely stop you. So: disable setting of the available size on each device for RAID0 and Linear devices. This must not change as doing so can change the effective layout of data. Make sure that the size that raid0_size() reports is accurate, but rounding devices sizes to chunk sizes. As the device sizes cannot change now, this isn't so important, but it is best to be safe. Without this change: mdadm --grow /dev/md0 -z max mdadm --grow /dev/md0 -Z max then read to the end of the array can cause a BUG in a RAID0 array. These bugs have been present ever since it became possible to resize any device, which is a long time. So the fix is suitable for any -stable kerenl. Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26DM RAID: Add support for MD's RAID10 "far" and "offset" algorithmsJonathan Brassow
DM RAID: Add support for MD's RAID10 "far" and "offset" algorithms Until now, dm-raid.c only supported the "near" algorthm of MD's RAID10 implementation. This patch adds support for the "far" and "offset" algorithms, but only with the improved redundancy that is brought with the introduction of the 'use_far_sets' bit, which shifts copied stripes according to smaller sets vs the entire array. That is, the 17th bit of the 'layout' variable that defines the RAID10 implementation will always be set. (More information on how the 'layout' variable selects the RAID10 algorithm can be found in the opening comments of drivers/md/raid10.c.) Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2)Jonathan Brassow
MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2) This patch addresses raid arrays that have a number of devices that cannot be evenly divided by 'far_copies'. (E.g. 5 devices, far_copies = 2) This case must be handled differently because it causes that last set to be of a different size than the rest of the sets. We must compute a new modulo for this last set so that copied chunks are properly wrapped around. Example use_far_sets=1, far_copies=2, near_copies=1, devices=5: "far" algorithm dev1 dev2 dev3 dev4 dev5 ==== ==== ==== ==== ==== [ A B ] [ C D E ] [ G H ] [ I J K ] ... [ B A ] [ E C D ] --> nominal set of 2 and last set of 3 [ H G ] [ K I J ] []'s show far/offset sets Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)Jonathan Brassow
The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe widths - copying them to a different location on the same devices after shifting the stripe. An example layout of each follows below: "far" algorithm dev1 dev2 dev3 dev4 dev5 dev6 ==== ==== ==== ==== ==== ==== A B C D E F G H I J K L ... F A B C D E --> Copy of stripe0, but shifted by 1 L G H I J K ... "offset" algorithm dev1 dev2 dev3 dev4 dev5 dev6 ==== ==== ==== ==== ==== ==== A B C D E F F A B C D E --> Copy of stripe0, but shifted by 1 G H I J K L L G H I J K ... Redundancy for these algorithms is gained by shifting the copied stripes one device to the right. This patch proposes that array be divided into sets of adjacent devices and when the stripe copies are shifted, they wrap on set boundaries rather than the array size boundary. That is, for the purposes of shifting, the copies are confined to their sets within the array. The sets are 'near_copies * far_copies' in size. The above "far" algorithm example would change to: "far" algorithm dev1 dev2 dev3 dev4 dev5 dev6 ==== ==== ==== ==== ==== ==== A B C D E F G H I J K L ... B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets H G J I L K Dev sets are 1-2, 3-4, 5-6 ... This has the affect of improving the redundancy of the array. We can always sustain at least one failure, but sometimes more than one can be handled. In the first examples, the pairs of devices that CANNOT fail together are: (1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs] In the example where the copies are confined to sets, the pairs of devices that cannot fail together are: (1,2) (3,4) (5,6) [20% of possible pairs] We cannot simply replace the old algorithms, so the 17th bit of the 'layout' variable is used to indicate whether we use the old or new method of computing the shift. (This is similar to the way the 16th bit indicates whether the "far" algorithm or the "offset" algorithm is being used.) This patch only handles the cases where the number of total raid disks is a multiple of 'far_copies'. A follow-on patch addresses the condition where this is not true. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26MD RAID10: Minor non-functional code changesJonathan Brassow
Changes include assigning 'addr' from 's' instead of 'sector' to be consistent with the way the code does it just a few lines later and using '%=' vs a conditional and subtraction. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-26md: raid1,10: Handle REQ_WRITE_SAME flag in write biosJoe Lawrence
Set mddev queue's max_write_same_sectors to its chunk_sector value (before disk_stack_limits merges the underlying disk limits.) With that in place, be sure to handle writes coming down from the block layer that have the REQ_WRITE_SAME flag set. That flag needs to be copied into any newly cloned write bio. Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-24drivers/md/persistent-data/dm-transaction-manager.c: rename HASH_SIZEAndrew Morton
Fix the warning: drivers/md/persistent-data/dm-transaction-manager.c:28:1: warning: "HASH_SIZE" redefined In file included from include/linux/elevator.h:5, from include/linux/blkdev.h:216, from drivers/md/persistent-data/dm-block-manager.h:11, from drivers/md/persistent-data/dm-transaction-manager.h:10, from drivers/md/persistent-data/dm-transaction-manager.c:6: include/linux/hashtable.h:22:1: warning: this is the location of the previous definition Cc: Alasdair Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23new helper: file_inode(file)Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-21md: protect against crash upon fsync on ro arraySebastian Riemer
If an fsync occurs on a read-only array, we need to send a completion for the IO and may not increment the active IO count. Otherwise, we hit a bug trace and can't stop the MD array anymore. By advice of Christoph Hellwig we return success upon a flush request but we return -EROFS for other writes. We detect flush requests by checking if the bio has zero sectors. This patch is suitable to any -stable kernel to which it applies. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: NeilBrown <neilb@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Sebastian Riemer <sebastian.riemer@profitbricks.com> Reported-by: Ben Hutchings <ben@decadent.org.uk> Acked-by: Paul Menzel <paulepanter@users.sourceforge.net> Signed-off-by: NeilBrown <neilb@suse.de>
2013-02-01Merge tag 'dm-3.8-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm Pull more device-mapper fixes from Alasdair G Kergon: "A fix for stacked dm thin devices and a fix for the new dm WRITE SAME support." * tag 'dm-3.8-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: dm: fix write same requests counting dm thin: fix queue limits stacking
2013-01-31dm: fix write same requests countingAlasdair G Kergon
When processing write same requests, fix dm to send the configured number of WRITE SAME requests to the target rather than the number of discards, which is not always the same. Device-mapper WRITE SAME support was introduced by commit 23508a96cd2e857d57044a2ed7d305f2d9daf441 ("dm: add WRITE SAME support"). Signed-off-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com>
2013-01-31dm thin: fix queue limits stackingMike Snitzer
thin_io_hints() is blindly copying the queue limits from the thin-pool which can lead to incorrect limits being set. The fix here simply deletes the thin_io_hints() hook which leaves the existing stacking infrastructure to set the limits correctly. When a thin-pool uses an MD device for the data device a thin device from the thin-pool must respect MD's constraints about disallowing a bio from spanning multiple chunks. Otherwise we can see problems. If the raid0 chunksize is 1152K and thin-pool chunksize is 256K I see the following md/raid0 error (with extra debug tracing added to thin_endio) when mkfs.xfs is executed against the thin device: md/raid0:md99: make_request bug: can't convert block across chunks or bigger than 1152k 6688 127 device-mapper: thin: bio sector=2080 err=-5 bi_size=130560 bi_rw=17 bi_vcnt=32 bi_idx=0 This extra DM debugging shows that the failing bio is spanning across the first and second logical 1152K chunk (sector 2080 + 255 takes the bio beyond the first chunk's boundary of sector 2304). So the bio splitting that DM is doing clearly isn't respecting the MD limits. max_hw_sectors_kb is 127 for both the thin-pool and thin device (queue_max_hw_sectors returns 255 so we'll excuse sysfs's lack of precision). So this explains why bi_size is 130560. But the thin device's max_hw_sectors_kb should be 4 (PAGE_SIZE) given that it doesn't have a .merge function (for bio_add_page to consult indirectly via dm_merge_bvec) yet the thin-pool does sit above an MD device that has a compulsory merge_bvec_fn. This scenario is exactly why DM must resort to sending single PAGE_SIZE bios to the underlying layer. Some additional context for this is available in the header for commit 8cbeb67a ("dm: avoid unsupported spanning of md stripe boundaries"). Long story short, the reason a thin device doesn't properly get configured to have a max_hw_sectors_kb of 4 (PAGE_SIZE) is that thin_io_hints() is blindly copying the queue limits from the thin-pool device directly to the thin device's queue limits. Fix this by eliminating thin_io_hints. Doing so is safe because the block layer's queue limits stacking already enables the upper level thin device to inherit the thin-pool device's discard and minimum_io_size and optimal_io_size limits that get set in pool_io_hints. But avoiding the queue limits copy allows the thin and thin-pool limits to be different where it is important, namely max_hw_sectors_kb. Reported-by: Daniel Browning <db@kavod.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-01-24DM-RAID: Fix RAID10's check for sufficient redundancyJonathan Brassow
Before attempting to activate a RAID array, it is checked for sufficient redundancy. That is, we make sure that there are not too many failed devices - or devices specified for rebuild - to undermine our ability to activate the array. The current code performs this check twice - once to ensure there were not too many devices specified for rebuild by the user ('validate_rebuild_devices') and again after possibly experiencing a failure to read the superblock ('analyse_superblocks'). Neither of these checks are sufficient. The first check is done properly but with insufficient information about the possible failure state of the devices to make a good determination if the array can be activated. The second check is simply done wrong in the case of RAID10 because it doesn't account for the independence of the stripes (i.e. mirror sets). The solution is to use the properly written check ('validate_rebuild_devices'), but perform the check after the superblocks have been read and we know which devices have failed. This gives us one check instead of two and performs it in a location where it can be done right. Only RAID10 was affected and it was affected in the following ways: - the code did not properly catch the condition where a user specified a device for rebuild that already had a failed device in the same mirror set. (This condition would, however, be caught at a deeper level in MD.) - the code triggers a false positive and denies activation when devices in independent mirror sets have failed - counting the failures as though they were all in the same set. The most likely place this error was introduced (or this patch should have been included) is in commit 4ec1e369 - first introduced in v3.7-rc1. Consequently this fix should also go in v3.7.y, however there is a small conflict on the .version in raid_target, so I'll submit a separate patch to -stable. Cc: stable@vger.kernel.org Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-01-14block: add missing block_bio_complete() tracepointTejun Heo
bio completion didn't kick block_bio_complete TP. Only dm was explicitly triggering the TP on IO completion. This makes block_bio_complete TP useless for tracers which want to know about bios, and all other bio based drivers skip generating blktrace completion events. This patch makes all bio completions via bio_endio() generate block_bio_complete TP. * Explicit trace_block_bio_complete() invocation removed from dm and the trace point is unexported. * @rq dropped from trace_block_bio_complete(). bios may fly around w/o queue associated. Verifying and accessing the assocaited queue belongs to TP probes. * blktrace now gets both request and bio completions. Make it ignore bio completions if request completion path is happening. This makes all bio based drivers generate blktrace completion events properly and makes the block_bio_complete TP actually useful. v2: With this change, block_bio_complete TP could be invoked on sg commands which have bio's with %NULL bi_bdev. Update TP assignment code to check whether bio->bi_bdev is %NULL before dereferencing. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Namhyung Kim <namhyung@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-22Merge tag 'dm-3.8-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm Pull dm update from Alasdair G Kergon: "Miscellaneous device-mapper fixes, cleanups and performance improvements. Of particular note: - Disable broken WRITE SAME support in all targets except linear and striped. Use it when kcopyd is zeroing blocks. - Remove several mempools from targets by moving the data into the bio's new front_pad area(which dm calls 'per_bio_data'). - Fix a race in thin provisioning if discards are misused. - Prevent userspace from interfering with the ioctl parameters and use kmalloc for the data buffer if it's small instead of vmalloc. - Throttle some annoying error messages when I/O fails." * tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: (36 commits) dm stripe: add WRITE SAME support dm: remove map_info dm snapshot: do not use map_context dm thin: dont use map_context dm raid1: dont use map_context dm flakey: dont use map_context dm raid1: rename read_record to bio_record dm: move target request nr to dm_target_io dm snapshot: use per_bio_data dm verity: use per_bio_data dm raid1: use per_bio_data dm: introduce per_bio_data dm kcopyd: add WRITE SAME support to dm_kcopyd_zero dm linear: add WRITE SAME support dm: add WRITE SAME support dm: prepare to support WRITE SAME dm ioctl: use kmalloc if possible dm ioctl: remove PF_MEMALLOC dm persistent data: improve improve space map block alloc failure message dm thin: use DMERR_LIMIT for errors ...
2012-12-21dm stripe: add WRITE SAME supportMike Snitzer
Rename stripe_map_discard to stripe_map_range and reuse it for WRITE SAME bio processing. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm: remove map_infoMikulas Patocka
This patch removes map_info from bio-based device mapper targets. map_info is still used for request-based targets. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm snapshot: do not use map_contextMikulas Patocka
Eliminate struct map_info from dm-snap. map_info->ptr was used in dm-snap to indicate if the bio was tracked. If map_info->ptr was non-NULL, the bio was linked in tracked_chunk_hash. This patch removes the use of map_info->ptr. We determine if the bio was tracked based on hlist_unhashed(&c->node). If hlist_unhashed is true, the bio is not tracked, if it is false, the bio is tracked. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>