summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2011-12-13block, cfq: move icq cache management to block coreTejun Heo
Let elevators set ->icq_size and ->icq_align in elevator_type and elv_register() and elv_unregister() respectively create and destroy kmem_cache for icq. * elv_register() now can return failure. All callers updated. * icq caches are automatically named "ELVNAME_io_cq". * cfq_slab_setup/kill() are collapsed into cfq_init/exit(). * While at it, minor indentation change for iosched_cfq.elevator_name for consistency. This will help moving icq management to block core. This doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block, cfq: move io_cq lookup to blk-ioc.cTejun Heo
Now that all io_cq related data structures are in block core layer, io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c. Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with parameter return type changes (cfqd -> request_queue, cfq_io_cq -> io_cq) and cfq_cic_lookup() becomes thin wrapper around cfq_cic_lookup(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block, cfq: move cfqd->icq_list to request_queue and add request->elv.icqTejun Heo
Most of icq management is about to be moved out of cfq into blk-ioc. This patch prepares for it. * Move cfqd->icq_list to request_queue->icq_list * Make request explicitly point to icq instead of through elevator private data. ->elevator_private[3] is replaced with sub struct elv which contains icq pointer and priv[2]. cfq is updated accordingly. * Meaningless clearing of ->elevator_private[0] removed from elv_set_request(). At that point in code, the field was guaranteed to be %NULL anyway. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block, cfq: reorganize cfq_io_context into generic and cfq specific partsTejun Heo
Currently io_context and cfq logics are mixed without clear boundary. Most of io_context is independent from cfq but cfq_io_context handling logic is dispersed between generic ioc code and cfq. cfq_io_context represents association between an io_context and a request_queue, which is a concept useful outside of cfq, but it also contains fields which are useful only to cfq. This patch takes out generic part and put it into io_cq (io context-queue) and the rest into cfq_io_cq (cic moniker remains the same) which contains io_cq. The following changes are made together. * cfq_ttime and cfq_io_cq now live in cfq-iosched.c. * All related fields, functions and constants are renamed accordingly. * ioc->ioc_data is now "struct io_cq *" instead of "void *" and renamed to icq_hint. This prepares for io_context API cleanup. Documentation is currently sparse. It will be added later. Changes in this patch are mechanical and don't cause functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block: remove elevator_queue->opsTejun Heo
elevator_queue->ops points to the same ops struct ->elevator_type.ops is pointing to. The only effect of caching it in elevator_queue is shorter notation - it doesn't save any indirect derefence. Relocate elevator_type->list which used only during module init/exit to the end of the structure, rename elevator_queue->elevator_type to ->type, and replace elevator_queue->ops with elevator_queue->type.ops. This doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block: reorder elevator switch sequenceTejun Heo
Elevator switch sequence first attached the new elevator, then tried registering it (sysfs) and if that failed attached back the old elevator. However, sysfs registration doesn't require the elevator to be attached, so there is no reason to do the "detach, attach new, register, maybe re-attach old" sequence. It can just do "register, detach, attach". * elevator_init_queue() is updated to set ->elevator_data directly and return 0 / -errno. This allows elevator_exit() on an unattached elevator. * __elv_unregister_queue() which was necessary to unregister unattached q is removed in favor of __elv_register_queue() which can register unattached q. * elevator_attach() becomes a single assignment and obscures more then it helps. Dropped. This will help cleaning up io_context handling across elevator switch. This patch doesn't introduce visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block, cfq: replace current_io_context() with create_io_context()Tejun Heo
When called under queue_lock, current_io_context() triggers lockdep warning if it hits allocation path. This is because io_context installation is protected by task_lock which is not IRQ safe, so it triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock deadlock warning. Given the restriction, accessor + creator rolled into one doesn't work too well. Drop current_io_context() and let the users access task->io_context directly inside queue_lock combined with explicit creation using create_io_context(). Future ioc updates will further consolidate ioc access and the create interface will be unexported. While at it, relocate ioc internal interface declarations in blk.h and add section comments before and after. This patch does not introduce functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block, cfq: kill cic->keyTejun Heo
Now that lazy paths are removed, cfqd_dead_key() is meaningless and cic->q can be used whereever cic->key is used. Kill cic->key. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block, cfq: kill ioc_goneTejun Heo
Now that cic's are immediately unlinked under both locks, there's no need to count and drain cic's before module unload. RCU callback completion is waited with rcu_barrier(). While at it, remove residual RCU operations on cic_list. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block, cfq: remove delayed unlinkTejun Heo
Now that all cic's are immediately unlinked from both ioc and queue, lazy dropping from lookup path and trimming on elevator unregister are unnecessary. Kill them and remove now unused elevator_ops->trim(). This also leaves call_for_each_cic() without any user. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block, cfq: unlink cfq_io_context's immediatelyTejun Heo
cic is association between io_context and request_queue. A cic is linked from both ioc and q and should be destroyed when either one goes away. As ioc and q both have their own locks, locking becomes a bit complex - both orders work for removal from one but not from the other. Currently, cfq tries to circumvent this locking order issue with RCU. ioc->lock nests inside queue_lock but the radix tree and cic's are also protected by RCU allowing either side to walk their lists without grabbing lock. This rather unconventional use of RCU quickly devolves into extremely fragile convolution. e.g. The following is from cfqd going away too soon after ioc and q exits raced. general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: [ 88.503444] Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs RIP: 0010:[<ffffffff81397628>] [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0 ... Call Trace: [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90 [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20 [<ffffffff81389130>] exit_io_context+0x100/0x140 [<ffffffff81098a29>] do_exit+0x579/0x850 [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0 [<ffffffff81098de7>] sys_exit_group+0x17/0x20 [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b The only real hot path here is cic lookup during request initialization and avoiding extra locking requires very confined use of RCU. This patch makes cic removal from both ioc and request_queue perform double-locking and unlink immediately. * From q side, the change is almost trivial as ioc->lock nests inside queue_lock. It just needs to grab each ioc->lock as it walks cic_list and unlink it. * From ioc side, it's a bit more difficult because of inversed lock order. ioc needs its lock to walk its cic_list but can't grab the matching queue_lock and needs to perform unlock-relock dancing. Unlinking is now wholly done from put_io_context() and fast path is optimized by using the queue_lock the caller already holds, which is by far the most common case. If the ioc accessed multiple devices, it tries with trylock. In unlikely cases of fast path failure, it falls back to full double-locking dance from workqueue. Double-locking isn't the prettiest thing in the world but it's *far* simpler and more understandable than RCU trick without adding any meaningful overhead. This still leaves a lot of now unnecessary RCU logics. Future patches will trim them. -v2: Vivek pointed out that cic->q was being dereferenced after cic->release() was called. Updated to use local variable @this_q instead. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block, cfq: fix cic lookup lockingTejun Heo
* cfq_cic_lookup() may be called without queue_lock and multiple tasks can execute it simultaneously for the same shared ioc. Nothing prevents them racing each other and trying to drop the same dead cic entry multiple times. * smp_wmb() in cfq_exit_cic() doesn't really do anything and nothing prevents cfq_cic_lookup() seeing stale cic->key. This usually doesn't blow up because by the time cic is exited, all requests have been drained and new requests are terminated before going through elevator. However, it can still be triggered by plug merge path which doesn't grab queue_lock and thus can't check DEAD state reliably. This patch updates lookup locking such that, * Lookup is always performed under queue_lock. This doesn't add any more locking. The only issue is cfq_allow_merge() which can be called from plug merge path without holding any lock. For now, this is worked around by using cic of the request to merge into, which is guaranteed to have the same ioc. For longer term, I think it would be best to separate out plug merge method from regular one. * Spurious ioc->lock locking around cic lookup hint assignment dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block, cfq: fix race condition in cic creation path and tighten lockingTejun Heo
cfq_get_io_context() would fail if multiple tasks race to insert cic's for the same association. This patch restructures cfq_get_io_context() such that slow path insertion race is handled properly. Note that the restructuring also makes cfq_get_io_context() called under queue_lock and performs both ioc and cfqd insertions while holding both ioc and queue locks. This is part of on-going locking tightening and will be used to simplify synchronization rules. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block, cfq: move ioc ioprio/cgroup changed handling to cicTejun Heo
ioprio/cgroup change was handled by marking the changed state in ioc and, on the following access to the ioc, performing RCU-protected iteration through all cic's grabbing the matching queue_lock. This patch moves the changed state to each cic. When ioprio or cgroup changes, the respective bit is set on all cic's of the ioc and when each of those cic (not ioc) is accessed, change is applied for that specific ioc-queue pair. This also fixes the following two race conditions between setting and clearing of changed states. * Missing barrier between assign/load of ioprio and ioprio_changed allowed applying old ioprio. * Change requests could happen between application of change and clearing of changed variables. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block, cfq: misc updates to cfq_io_contextTejun Heo
Make the following changes to prepare for ioc/cic management cleanup. * Add cic->q so that ioc can determine the associated queue without querying cfq. This will eventually replace ->key. * Factor out cfq_release_cic() from cic_free_func(). This function assumes that the caller handled locking. * Rename __cfq_exit_single_io_context() to cfq_exit_cic() and make it take only @cic. * Restructure cfq_cic_link() for future updates. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block: misc updates to blk_get_queue()Tejun Heo
* blk_get_queue() is peculiar in that it returns 0 on success and 1 on failure instead of 0 / -errno or boolean. Update it such that it returns %true on success and %false on failure. * Make sure the caller checks for the return value. * Separate out __blk_get_queue() which doesn't check whether @q is dead and put it in blk.h. This will be used later. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block: make ioc get/put interface more conventional and fix race on alloctionTejun Heo
Ignoring copy_io() during fork, io_context can be allocated from two places - current_io_context() and set_task_ioprio(). The former is always called from local task while the latter can be called from different task. The synchornization between them are peculiar and dubious. * current_io_context() doesn't grab task_lock() and assumes that if it saw %NULL ->io_context, it would stay that way until allocation and assignment is complete. It has smp_wmb() between alloc/init and assignment. * set_task_ioprio() grabs task_lock() for assignment and does smp_read_barrier_depends() between "ioc = task->io_context" and "if (ioc)". Unfortunately, this doesn't achieve anything - the latter is not a dependent load of the former. ie, if ioc itself were being dereferenced "ioc->xxx", it would mean something (not sure what tho) but as the code currently stands, the dependent read barrier is noop. As only one of the the two test-assignment sequences is task_lock() protected, the task_lock() can't do much about race between the two. Nothing prevents current_io_context() and set_task_ioprio() allocating its own ioc for the same task and overwriting the other's. Also, set_task_ioprio() can race with exiting task and create a new ioc after exit_io_context() is finished. ioc get/put doesn't have any reason to be complex. The only hot path is accessing the existing ioc of %current, which is simple to achieve given that ->io_context is never destroyed as long as the task is alive. All other paths can happily go through task_lock() like all other task sub structures without impacting anything. This patch updates ioc get/put so that it becomes more conventional. * alloc_io_context() is replaced with get_task_io_context(). This is the only interface which can acquire access to ioc of another task. On return, the caller has an explicit reference to the object which should be put using put_io_context() afterwards. * The functionality of current_io_context() remains the same but when creating a new ioc, it shares the code path with get_task_io_context() and always goes through task_lock(). * get_io_context() now means incrementing ref on an ioc which the caller already has access to (be that an explicit refcnt or implicit %current one). * PF_EXITING inhibits creation of new io_context and once exit_io_context() is finished, it's guaranteed that both ioc acquisition functions return %NULL. * All users are updated. Most are trivial but smp_read_barrier_depends() removal from cfq_get_io_context() needs a bit of explanation. I suppose the original intention was to ensure ioc->ioprio is visible when set_task_ioprio() allocates new io_context and installs it; however, this wouldn't have worked because set_task_ioprio() doesn't have wmb between init and install. There are other problems with this which will be fixed in another patch. * While at it, use NUMA_NO_NODE instead of -1 for wildcard node specification. -v2: Vivek spotted contamination from debug patch. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block: misc ioc cleanupsTejun Heo
* int return from put_io_context() wasn't used by anybody. Make it return void like other put functions and docbook-fy the function comment. * Reorder dummy declarations for !CONFIG_BLOCK case a bit. * Make alloc_ioc_context() use __GFP_ZERO allocation, take init out of if block and drop 0'ing. * Docbook-fy current_io_context() comment. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block, cfq: move cfqd->cic_index to q->idTejun Heo
cfq allocates per-queue id using ida and uses it to index cic radix tree from io_context. Move it to q->id and allocate on queue init and free on queue release. This simplifies cfq a bit and will allow for further improvements of io context life-cycle management. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block: add missing blk_queue_dead() checksTejun Heo
blk_insert_cloned_request(), blk_execute_rq_nowait() and blk_flush_plug_list() either didn't check whether the queue was dead or did it without holding queue_lock. Update them so that dead state is checked while holding queue_lock. AFAICS, this plugs all holes (requeue doesn't matter as the request is transitioning atomically from in_flight to queued). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block: fix drain_all condition in blk_drain_queue()Tejun Heo
When trying to drain all requests, blk_drain_queue() checked only q->rq.count[]; however, this only tracks REQ_ALLOCED requests. This patch updates blk_drain_queue() such that it looks at all the counters and queues so that request_queue is actually empty on completion. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block: add blk_queue_dead()Tejun Heo
There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead() macro and use it. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13block, sx8: kill blk_insert_request()Tejun Heo
The only user left for blk_insert_request() is sx8 and it can be trivially switched to use blk_execute_rq_nowait() - special requests aren't included in io stat and sx8 doesn't use block layer tagging. Switch sx8 and kill blk_insert_requeset(). This patch doesn't introduce any functional difference. Only compile tested. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-13cgroup: don't use subsys->can_attach_task() or ->attach_task()Tejun Heo
Now that subsys->can_attach() and attach() take @tset instead of @task, they can handle per-task operations. Convert ->can_attach_task() and ->attach_task() users to use ->can_attach() and attach() instead. Most converions are straight-forward. Noteworthy changes are, * In cgroup_freezer, remove unnecessary NULL assignments to unused methods. It's useless and very prone to get out of sync, which already happened. * In cpuset, PF_THREAD_BOUND test is checked for each task. This doesn't make any practical difference but is conceptually cleaner. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: James Morris <jmorris@namei.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org>
2011-12-02cfq-iosched: fix cfq_cic_link() race confitionYasuaki Ishimatsu
cfq_cic_link() has race condition. When some processes which shared ioc issue I/O to same block device simultaneously, cfq_cic_link() returns -EEXIST sometimes. The race condition might stop I/O by following steps: step 1: Process A: Issue an I/O to /dev/sda step 2: Process A: Get an ioc (iocA here) in get_io_context() which does not linked with a cic for the device step 3: Process A: Get a new cic for the device (cicA here) in cfq_alloc_io_context() step 4: Process B: Issue an I/O to /dev/sda step 5: Process B: Get iocA in get_io_context() since process A and B share the same ioc step 6: Process B: Get a new cic for the device (cicB here) in cfq_alloc_io_context() since iocA has not been linked with a cic for the device yet step 7: Process A: Link cicA to iocA in cfq_cic_link() step 8: Process A: Dispatch I/O to driver and finish it step 9: Process B: Try to link cicB to iocA in cfq_cic_link() But it fails with showing "cfq: cic link failed!" kernel message, since iocA has already linked with cicA at step 7. step 10: Process B: Wait for finishig I/O in get_request_wait() The function does not wake up, when there is no I/O to the device. When cfq_cic_link() returns -EEXIST, it means ioc has already linked with cic. So when cfq_cic_link() return -EEXIST, retry cfq_cic_lookup(). Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-30cfq-iosched: free cic_index if blkio_alloc_blkg_stats failsmajianpeng
If we fail allocating the blkpg stats, we free cfqd and cfgq. But we need to free the IDA cfqd->cic_index as well. Signed-off-by: majianpeng <majianpeng@gmail.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-23block: initialize request_queue's numa node duringMike Snitzer
struct request_queue is allocated with __GFP_ZERO so its "node" field is zero before initialization. This causes an oops if node 0 is offline in the page allocator because its zonelists are not initialized. From Dave Young's dmesg: SRAT: Node 1 PXM 2 0-d0000000 SRAT: Node 1 PXM 2 100000000-330000000 SRAT: Node 0 PXM 1 330000000-630000000 Initmem setup node 1 0000000000000000-000000000affb000 ... Built 1 zonelists in Node order, mobility grouping on. ... BUG: unable to handle kernel paging request at 0000000000001c08 IP: [<ffffffff8111c355>] __alloc_pages_nodemask+0xb5/0x870 and __alloc_pages_nodemask+0xb5 translates to a NULL pointer on zonelist->_zonerefs. The fix is to initialize q->node at the time of allocation so the correct node is passed to the slab allocator later. Since blk_init_allocated_queue_node() is no longer needed, merge it with blk_init_allocated_queue(). [rientjes@google.com: changelog, initializing q->node] Cc: stable@vger.kernel.org [2.6.37+] Reported-by: Dave Young <dyoung@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Tested-by: Dave Young <dyoung@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-16block: add missed trace_block_plugShaohua Li
After flush plug list, the list has no request, so we need to add a trace_block_plug(). Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-16block: avoid unnecessary plug list flushShaohua Li
get_request_wait() could sleep and flush the plug list. If the list is already flushed, don't flush again. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-13block: Always check length of all iov entries in blk_rq_map_user_iov()Ben Hutchings
Even after commit 5478755616ae2ef1ce144dded589b62b2a50d575 ("block: check for proper length of iov entries earlier ...") we still won't check for zero-length entries after an unaligned entry. Remove the break-statement, so all entries are checked. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-10block: Revert "[SCSI] genhd: add a new attribute "alias" in gendisk"Tejun Heo
This reverts commit a72c5e5eb738033938ab30d6a634b74d1d060f10. The commit introduced alias for block devices which is intended to be used during logging although actual usage hasn't been committed yet. This approach adds very limited benefit (raw log might be easier to follow) which can be trivially implemented in userland but has a lot of problems. It is much worse than netif renames because it doesn't rename the actual device but just adds conveninence name which isn't used universally or enforced. Everything internal including device lookup and sysfs still uses the internal name and nothing prevents two devices from using conflicting alias - ie. sda can have sdb as its alias. This has been nacked by people working on device driver core, block layer and kernel-userland interface and shouldn't have been upstreamed. Revert it. http://thread.gmane.org/gmane.linux.kernel/1155104 http://thread.gmane.org/gmane.linux.scsi/68632 http://thread.gmane.org/gmane.linux.scsi/69776 Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Kay Sievers <kay.sievers@vrfy.org> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Nao Nishijima <nao.nishijima.xt@hitachi.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-07Merge branch 'modsplit-Oct31_2011' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
2011-11-05Merge branch 'for-3.2/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits) virtio-blk: use ida to allocate disk index hpsa: add small delay when using PCI Power Management to reset for kump cciss: add small delay when using PCI Power Management to reset for kump xen/blkback: Fix two races in the handling of barrier requests. xen/blkback: Check for proper operation. xen/blkback: Fix the inhibition to map pages when discarding sector ranges. xen/blkback: Report VBD_WSECT (wr_sect) properly. xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests. xen-blkfront: plug device number leak in xlblk_init() error path xen-blkfront: If no barrier or flush is supported, use invalid operation. xen-blkback: use kzalloc() in favor of kmalloc()+memset() xen-blkback: fixed indentation and comments xen-blkfront: fix a deadlock while handling discard response xen-blkfront: Handle discard requests. xen-blkback: Implement discard requests ('feature-discard') xen-blkfront: add BLKIF_OP_DISCARD and discard request struct drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd() drivers/block/loop.c: emit uevent on auto release drivers/block/cpqarray.c: use pci_dev->revision loop: always allow userspace partitions and optionally support automatic scanning ... Fic up trivial header file includsion conflict in drivers/block/loop.c
2011-11-05Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits) block: don't call blk_drain_queue() if elevator is not up blk-throttle: use queue_is_locked() instead of lockdep_is_held() blk-throttle: Take blkcg->lock while traversing blkcg->policy_list blk-throttle: Free up policy node associated with deleted rule block: warn if tag is greater than real_max_depth. block: make gendisk hold a reference to its queue blk-flush: move the queue kick into blk-flush: fix invalid BUG_ON in blk_insert_flush block: Remove the control of complete cpu from bio. block: fix a typo in the blk-cgroup.h file block: initialize the bounce pool if high memory may be added later block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown block: drop @tsk from attempt_plug_merge() and explain sync rules block: make get_request[_wait]() fail if queue is dead block: reorganize throtl_get_tg() and blk_throtl_bio() block: reorganize queue draining block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() block: pass around REQ_* flags instead of broken down booleans during request alloc/free block: move blk_throtl prototypes to block/blk.h block: fix genhd refcounting in blkio_policy_parse_and_set() ... Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion and making the request functions be of type "void" instead of "int" in - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c} - drivers/staging/zram/zram_drv.c
2011-11-03block: don't call blk_drain_queue() if elevator is not upTejun Heo
blk_cleanup_queue() may be called before elevator is set up on a queue which triggers the following oops. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70 ... Pid: 830, comm: kworker/0:2 Not tainted 3.1.0-next-20111025_64+ #1590 Bochs Bochs RIP: 0010:[<ffffffff8125a69c>] [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70 ... Call Trace: [<ffffffff8125da92>] blk_drain_queue+0x42/0x70 [<ffffffff8125db90>] blk_cleanup_queue+0xd0/0x1c0 [<ffffffff81469640>] md_free+0x50/0x70 [<ffffffff8126f43b>] kobject_release+0x8b/0x1d0 [<ffffffff81270d56>] kref_put+0x36/0xa0 [<ffffffff8126f2b7>] kobject_put+0x27/0x60 [<ffffffff814693af>] mddev_delayed_delete+0x2f/0x40 [<ffffffff81083450>] process_one_work+0x100/0x3b0 [<ffffffff8108527f>] worker_thread+0x15f/0x3a0 [<ffffffff81089937>] kthread+0x87/0x90 [<ffffffff81621834>] kernel_thread_helper+0x4/0x10 Fix it by making blk_cleanup_queue() check whether q->elevator is set up before invoking blk_drain_queue. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-31block: Change module.h -> export.h in bsg-lib.cPaul Gortmaker
This file isn't using full modular functionality, and hence can be "downgraded" to just using the export.h header. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31block: add export.h to files using EXPORT_SYMBOL/THIS_MODULE macrosPaul Gortmaker
These files were getting <linux/module.h> via an implicit include path, but we want to crush those out of existence since they cost time during compiles of processing thousands of lines of headers for no reason. Give them the lightweight header that just contains the EXPORT_SYMBOL infrastructure. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (204 commits) [SCSI] qla4xxx: export address/port of connection (fix udev disk names) [SCSI] ipr: Fix BUG on adapter dump timeout [SCSI] megaraid_sas: Fix instance access in megasas_reset_timer [SCSI] hpsa: change confusing message to be more clear [SCSI] iscsi class: fix vlan configuration [SCSI] qla4xxx: fix data alignment and use nl helpers [SCSI] iscsi class: fix link local mispelling [SCSI] iscsi class: Replace iscsi_get_next_target_id with IDA [SCSI] aacraid: use lower snprintf() limit [SCSI] lpfc 8.3.27: Change driver version to 8.3.27 [SCSI] lpfc 8.3.27: T10 additions for SLI4 [SCSI] lpfc 8.3.27: Fix queue allocation failure recovery [SCSI] lpfc 8.3.27: Change algorithm for getting physical port name [SCSI] lpfc 8.3.27: Changed worst case mailbox timeout [SCSI] lpfc 8.3.27: Miscellanous logic and interface fixes [SCSI] megaraid_sas: Changelog and version update [SCSI] megaraid_sas: Add driver workaround for PERC5/1068 kdump kernel panic [SCSI] megaraid_sas: Add multiple MSI-X vector/multiple reply queue support [SCSI] megaraid_sas: Add support for MegaRAID 9360/9380 12GB/s controllers [SCSI] megaraid_sas: Clear FUSION_IN_RESET before enabling interrupts ...
2011-10-25blk-throttle: use queue_is_locked() instead of lockdep_is_held()Jens Axboe
We can't use the latter if !CONFIG_LOCKDEP. Reported-by: Sedat Dilek <sedat.dilek@googlemail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-25blk-throttle: Take blkcg->lock while traversing blkcg->policy_listVivek Goyal
blkcg->policy_list is protected by blkcg->lock. Its not rcu protected list. So even for readers, they need to take blkcg->lock. There are few functions which were reading the list without taking lock. Fix it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-25blk-throttle: Free up policy node associated with deleted ruleVivek Goyal
If a rule is being deleted, free up associated policy node. Otherwise that memory is leaked. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-25block: warn if tag is greater than real_max_depth.Tao Ma
In case tag depth is reduced, it is max_depth not real_max_depth. So we should allow a request with tag >= max_depth, but for a tag >= real_max_depth, there really should be some problem. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24Merge branch 'for-linus' into for-3.2/coreJens Axboe
2011-10-24block: make gendisk hold a reference to its queueTejun Heo
The following command sequence triggers an oops. # mount /dev/sdb1 /mnt # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete # umount /mnt general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs RIP: 0010:[<ffffffff810d0879>] [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60 ... Call Trace: [<ffffffff810d2845>] lock_acquire+0x95/0x140 [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50 [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70 [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0 [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0 [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0 [<ffffffff811c40df>] blkdev_put+0x5f/0x190 [<ffffffff8118f18d>] kill_block_super+0x4d/0x80 [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70 [<ffffffff8119003a>] deactivate_super+0x4a/0x70 [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130 [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0 [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b This is because bdev holds on to disk but disk doesn't pin the associated queue. If a SCSI device is removed while the device is still open, the sdev puts the base reference to the queue on release. When the bdev is finally released, the associated queue is already gone along with the bdi and bdev_inode_switch_bdi() ends up dereferencing already freed bdi. Even if it were not for this bug, disk not holding onto the associated queue is very unusual and error-prone. Fix it by making add_disk() take an extra reference to its queue and put it on disk_release() and ensuring that disk and its fops owner are put in that order after all accesses to the disk and queue are complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24blk-flush: move the queue kick intoJeff Moyer
A dm-multipath user reported[1] a problem when trying to boot a kernel with commit 4853abaae7e4a2af938115ce9071ef8684fb7af4 (block: fix flush machinery for stacking drivers with differring flush flags) applied. It turns out that an empty flush request can be sent into blk_insert_flush. When the BUG_ON was fixed to allow for this, I/O on the underlying device would stall. The reason is that blk_insert_cloned_request does not kick the queue. In the aforementioned commit, I had added a special case to kick the queue if data was sent down but the queue flags did not require a flush. A better solution is to push the queue kick up into blk_insert_cloned_request. This patch, along with a follow-on which fixes the BUG_ON, fixes the issue reported. [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html Reported-by: Christophe Saout <christophe@saout.de> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Stable note: 3.1 Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24blk-flush: fix invalid BUG_ON in blk_insert_flushJeff Moyer
A user reported a regression due to commit 4853abaae7e4a2af938115ce9071ef8684fb7af4 (block: fix flush machinery for stacking drivers with differring flush flags). Part of the problem is that blk_insert_flush required a single bio be attached to the request. In reality, having no attached bio is also a valid case, as can be observed with an empty flush. [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html Reported-by: Christophe Saout <christophe@saout.de> Signed-off-by: Jeff Moyer <jmoyer@redhat.com Acked-by: Tejun Heo <tj@kernel.org> Stable note: 3.1 Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24block: Remove the control of complete cpu from bio.Tao Ma
bio originally has the functionality to set the complete cpu, but it is broken. Chirstoph said that "This code is unused, and from the all the discussions lately pretty obviously broken. The only thing keeping it serves is creating more confusion and possibly more bugs." And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine with leaving cpu control to the request based drivers, they are the only ones that can toggle the setting anyway". So this patch tries to remove all the work of controling complete cpu from a bio. Cc: Shaohua Li <shaohua.li@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24block: fix a typo in the blk-cgroup.h fileJie Liu
byptes -> bytes. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19block: fix request_queue lifetime handling by making blk_queue_cleanup() ↵Tejun Heo
properly shutdown request_queue is refcounted but actually depdends on lifetime management from the queue owner - on blk_cleanup_queue(), block layer expects that there's no request passing through request_queue and no new one will. This is fundamentally broken. The queue owner (e.g. SCSI layer) doesn't have a way to know whether there are other active users before calling blk_cleanup_queue() and other users (e.g. bsg) don't have any guarantee that the queue is and would stay valid while it's holding a reference. With delay added in blk_queue_bio() before queue_lock is grabbed, the following oops can be easily triggered when a device is removed with in-flight IOs. sd 0:0:1:0: [sdb] Stopping disk ata1.01: disabled general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100 ... Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80) ... Call Trace: [<ffffffff8137d774>] elv_merge+0x84/0xe0 [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400 [<ffffffff813838ea>] generic_make_request+0xca/0x100 [<ffffffff81383994>] submit_bio+0x74/0x100 [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0 [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40 [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60 [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760 [<ffffffff8118c1ca>] do_sync_read+0xda/0x120 [<ffffffff8118ce55>] vfs_read+0xc5/0x180 [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0 [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b This happens because blk_queue_cleanup() destroys the queue and elevator whether IOs are in progress or not and DEAD tests are sprinkled in the request processing path without proper synchronization. Similar problem exists for blk-throtl. On queue cleanup, blk-throtl is shutdown whether it has requests in it or not. Depending on timing, it either oopses or throttled bios are lost putting tasks which are waiting for bio completion into eternal D state. The way it should work is having the usual clear distinction between shutdown and release. Shutdown drains all currently pending requests, marks the queue dead, and performs partial teardown of the now unnecessary part of the queue. Even after shutdown is complete, reference holders are still allowed to issue requests to the queue although they will be immmediately failed. The rest of teardown happens on release. This patch makes the following changes to make blk_queue_cleanup() behave as proper shutdown. * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and queue_lock. * Unsynchronized DEAD check in generic_make_request_checks() removed. This couldn't make any meaningful difference as the queue could die after the check. * blk_drain_queue() updated such that it can drain all requests and is now called during cleanup. * blk_throtl updated such that it checks DEAD on grabbing queue_lock, drains all throttled bios during cleanup and free td when queue is released. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19block: drop @tsk from attempt_plug_merge() and explain sync rulesTejun Heo
attempt_plug_merge() accesses elevator without holding queue_lock and may call into ->elevator_bio_merge_fn(). The elvator is guaranteed to be valid because it's accessed iff the plugged list has requests and elevator is never exited with live requests, so as long as the elevator method can deal with unlocked access, this is safe. Explain the sync rules around attempt_plug_merge() and drop the unnecessary @tsk parameter. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>