summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2014-04-10softirq: Check preemption after reenabling interruptsThomas Gleixner
raise_softirq_irqoff() disables interrupts and wakes the softirq daemon, but after reenabling interrupts there is no preemption check, so the execution of the softirq thread might be delayed arbitrarily. In principle we could add that check to local_irq_enable/restore, but that's overkill as the rasie_softirq_irqoff() sections are the only ones which show this behaviour. Reported-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2014-04-10block: Use cpu_chill() for retry loopsThomas Gleixner
Retry loops on RT might loop forever when the modifying side was preempted. Steven also observed a live lock when there was a concurrent priority boosting going on. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2014-04-10fs-block-rt-support.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-10block: Shorten interrupt disabled regionsThomas Gleixner
Moving the blk_sched_flush_plug() call out of the interrupt/preempt disabled region in the scheduler allows us to replace local_irq_save/restore(flags) by local_irq_disable/enable() in blk_flush_plug(). Now instead of doing this we disable interrupts explicitely when we lock the request_queue and reenable them when we drop the lock. That allows interrupts to be handled when the plug list contains requests for more than one queue. Aside of that this change makes the scope of the irq disabled region more obvious. The current code confused the hell out of me when looking at: local_irq_save(flags); spin_lock(q->queue_lock); ... queue_unplugged(q...); scsi_request_fn(); spin_unlock(q->queue_lock); spin_lock(shost->host_lock); spin_unlock_irq(shost->host_lock); -------------------^^^ ???? spin_lock_irq(q->queue_lock); spin_unlock(q->lock); local_irq_restore(flags); Also add a comment to __blk_run_queue() documenting that q->request_fn() can drop q->queue_lock and reenable interrupts, but must return with q->queue_lock held and interrupts disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20110622174919.025446432@linutronix.de
2014-02-22block: add cond_resched() to potentially long running ioctl discard loopJens Axboe
commit c8123f8c9cb517403b51aa41c3c46ff5e10b2c17 upstream. When mkfs issues a full device discard and the device only supports discards of a smallish size, we can loop in blkdev_issue_discard() for a long time. If preempt isn't enabled, this can turn into a softlock situation and the kernel will start complaining. Add an explicit cond_resched() at the end of the loop to avoid that. Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22block: __elv_next_request() shouldn't call into the elevator if bypassingTejun Heo
commit 556ee818c06f37b2e583af0363e6b16d0e0270de upstream. request_queue bypassing is used to suppress higher-level function of a request_queue so that they can be switched, reconfigured and shut down. A request_queue does the followings while bypassing. * bypasses elevator and io_cq association and queues requests directly to the FIFO dispatch queue. * bypasses block cgroup request_list lookup and always uses the root request_list. Once confirmed to be bypassing, specific elevator and block cgroup policy implementations can assume that nothing is in flight for them and perform various operations which would be dangerous otherwise. Such confirmation is acheived by short-circuiting all new requests directly to the dispatch queue and waiting for all the requests which were issued before to finish. Unfortunately, while the request allocating and draining sides were properly handled, we forgot to actually plug the request dispatch path. Even after bypassing mode is confirmed, if the attached driver tries to fetch a request and the dispatch queue is empty, __elv_next_request() would invoke the current elevator's elevator_dispatch_fn() callback. As all in-flight requests were drained, the elevator wouldn't contain any request but once bypass is confirmed we don't even know whether the elevator is even there. It might be in the process of being switched and half torn down. Frank Mayhar reports that this actually happened while switching elevators, leading to an oops. Let's fix it by making __elv_next_request() avoid invoking the elevator_dispatch_fn() callback if the queue is bypassing. It already avoids invoking the callback if the queue is dying. As a dying queue is guaranteed to be bypassing, we can simply replace blk_queue_dying() check with blk_queue_bypass(). Reported-by: Frank Mayhar <fmayhar@google.com> References: http://lkml.kernel.org/g/1390319905.20232.38.camel@bobble.lax.corp.google.com Tested-by: Frank Mayhar <fmayhar@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-12Update of blkg_stat and blkg_rwstat may happen in bh context. While ↵Hong Zhiguo
u64_stats_fetch_retry is only preempt_disable on 32bit UP system. This is not enough to avoid preemption by bh and may read strange 64 bit value. commit 2c575026fae6e63771bd2a4c1d407214a8096a89 upstream. Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08elevator: acquire q->sysfs_lock in elevator_change()Tomoki Sekiyama
commit 7c8a3679e3d8e9d92d58f282161760a0e247df97 upstream. Add locking of q->sysfs_lock into elevator_change() (an exported function) to ensure it is held to protect q->elevator from elevator_init(), even if elevator_change() is called from non-sysfs paths. sysfs path (elv_iosched_store) uses __elevator_change(), non-locking version, as the lock is already taken by elv_iosched_store(). Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama@hds.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08elevator: Fix a race in elevator switching and md device initializationTomoki Sekiyama
commit eb1c160b22655fd4ec44be732d6594fd1b1e44f4 upstream. The soft lockup below happens at the boot time of the system using dm multipath and the udev rules to switch scheduler. [ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483] [ 356.127001] RIP: 0010:[<ffffffff81072a7d>] [<ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50 ... [ 356.127001] Call Trace: [ 356.127001] [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70 [ 356.127001] [<ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230 [ 356.127001] [<ffffffff810738b2>] del_timer_sync+0x52/0x60 [ 356.127001] [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0 [ 356.127001] [<ffffffff812c98df>] elevator_exit+0x2f/0x50 [ 356.127001] [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0 [ 356.127001] [<ffffffff812caa50>] elv_iosched_store+0x20/0x50 [ 356.127001] [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0 [ 356.127001] [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140 [ 356.127001] [<ffffffff811a326d>] vfs_write+0xbd/0x1e0 [ 356.127001] [<ffffffff811a3ca9>] SyS_write+0x49/0xa0 [ 356.127001] [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b This is caused by a race between md device initialization by multipathd and shell script to switch the scheduler using sysfs. - multipathd: SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init q->elevator = elevator_alloc(q, e); // not yet initialized - sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler': elevator_switch (in the call trace above) struct elevator_queue *old = q->elevator; q->elevator = elevator_alloc(q, new_e); elevator_exit(old); // lockup! (*) - multipathd: (cont.) err = e->ops.elevator_init_fn(q); // init fails; q->elevator is modified (*) When del_timer_sync() is called, lock_timer_base() will loop infinitely while timer->base == NULL. In this case, as timer will never initialized, it results in lockup. This patch introduces acquisition of q->sysfs_lock around elevator_init() into blk_init_allocated_queue(), to provide mutual exclusion between initialization of the q->scheduler and switching of the scheduler. This should fix this bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=902012 Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama@hds.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04blk-core: Fix memory corruption if blkcg_init_queue failsMikulas Patocka
commit fff4996b7db7955414ac74386efa5e07fd766b50 upstream. If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy to clean up structures allocated by the backing dev. ------------[ cut here ]------------ WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0() ODEBUG: free active (active state 0) object type: percpu_counter hint: (null) Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix CPU: 0 PID: 2739 Comm: lvchange Tainted: G W 3.10.15-devel #14 Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009 0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20 ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8 ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67 Call Trace: [<ffffffff813c8fd4>] dump_stack+0x19/0x1b [<ffffffff810399eb>] warn_slowpath_common+0x6b/0xa0 [<ffffffff81039a67>] warn_slowpath_fmt+0x47/0x50 [<ffffffff8122aaaf>] ? debug_check_no_obj_freed+0xcf/0x250 [<ffffffff81229a15>] debug_print_object+0x85/0xa0 [<ffffffff8122abe3>] debug_check_no_obj_freed+0x203/0x250 [<ffffffff8113c4ac>] kmem_cache_free+0x20c/0x3a0 [<ffffffff811f6709>] blk_alloc_queue_node+0x2a9/0x2c0 [<ffffffff811f672e>] blk_alloc_queue+0xe/0x10 [<ffffffffa04c0093>] dm_create+0x1a3/0x530 [dm_mod] [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod] [<ffffffffa04c6c07>] dev_create+0x57/0x2b0 [dm_mod] [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod] [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod] [<ffffffffa04c6528>] ctl_ioctl+0x268/0x500 [dm_mod] [<ffffffff81097662>] ? get_lock_stats+0x22/0x70 [<ffffffffa04c67ce>] dm_ctl_ioctl+0xe/0x20 [dm_mod] [<ffffffff81161aad>] do_vfs_ioctl+0x2ed/0x520 [<ffffffff8116cfc7>] ? fget_light+0x377/0x4e0 [<ffffffff81161d2b>] SyS_ioctl+0x4b/0x90 [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f ---[ end trace 4b5ff0d55673d986 ]--- ------------[ cut here ]------------ This fix should be backported to stable kernels starting with 2.6.37. Note that in the kernels prior to 3.5 the affected code is different, but the bug is still there - bdi_init is called and bdi_destroy isn't. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29block: properly stack underlying max_segment_size to DM deviceMike Snitzer
commit d82ae52e68892338068e7559a0c0657193341ce4 upstream. Without this patch all DM devices will default to BLK_MAX_SEGMENT_SIZE (65536) even if the underlying device(s) have a larger value -- this is due to blk_stack_limits() using min_not_zero() when stacking the max_segment_size limit. 1073741824 before patch: 65536 after patch: 1073741824 Reported-by: Lukasz Flis <l.flis@cyfronet.pl> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29block: fix race between request completion and timeout handlingJeff Moyer
commit 4912aa6c11e6a5d910264deedbec2075c6f1bb73 upstream. crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 491, comm: scsi_eh_0 Tainted: G W ---------------- 2.6.32-220.13.1.el6.x86_64 #1 IBM -[8722PAX]-/00D1461 RIP: 0010:[<ffffffff8124e424>] [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0 RSP: 0018:ffff881057eefd60 EFLAGS: 00010012 RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8 RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780 RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338 R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000 FS: 0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540) Stack: 0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000 <0> ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393 <0> ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90 Call Trace: [<ffffffff81362323>] __scsi_queue_insert+0xa3/0x150 [<ffffffff8135f393>] ? scsi_eh_ready_devs+0x5e3/0x850 [<ffffffff81362a23>] scsi_queue_insert+0x13/0x20 [<ffffffff8135e4d4>] scsi_eh_flush_done_q+0x104/0x160 [<ffffffff8135fb6b>] scsi_error_handler+0x35b/0x660 [<ffffffff8135f810>] ? scsi_error_handler+0x0/0x660 [<ffffffff810908c6>] kthread+0x96/0xa0 [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffff81090830>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 RIP [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0 RSP <ffff881057eefd60> The RIP is this line: BUG_ON(blk_queued_rq(rq)); After digging through the code, I think there may be a race between the request completion and the timer handler running. A timer is started for each request put on the device's queue (see blk_start_request->blk_add_timer). If the request does not complete before the timer expires, the timer handler (blk_rq_timed_out_timer) will mark the request complete atomically: static inline int blk_mark_rq_complete(struct request *rq) { return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags); } and then call blk_rq_timed_out. The latter function will call scsi_times_out, which will return one of BLK_EH_HANDLED, BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED. If BLK_EH_RESET_TIMER is returned, blk_clear_rq_complete is called, and blk_add_timer is again called to simply wait longer for the request to complete. Now, if the request happens to complete while this is going on, what happens? Given that we know the completion handler will bail if it finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion handler running after that bit is cleared. So, from the above paragraph, after the call to blk_clear_rq_complete. If the completion sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom there (I haven't seen this in the cores). Next, if we get the completion before the call to list_add_tail, then the timer will eventually fire for an old req, which may either be freed or reallocated (there is evidence that this might be the case). Finally, if the completion comes in *after* the addition to the timeout list, I think it's harmless. The request will be removed from the timeout list, req_atom_complete will be set, and all will be well. This will only actually explain the coredumps *IF* the request structure was freed, reallocated *and* queued before the error handler thread had a chance to process it. That is possible, but it may make sense to keep digging for another race. I think that if this is what was happening, we would see other instances of this problem showing up as null pointer or garbage pointer dereferences, for example when the request structure was not re-used. It looks like we actually do run into that situation in other reports. This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags)); from blk_add_timer to the only caller that could trip over it (blk_start_request). It then inverts the calls to blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address the race. I've boot tested this patch, but nothing more. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-17block/partitions/efi.c: treat size mismatch as a warning, not an errorDoug Anderson
In commit 27a7c642174e ("partitions/efi: account for pmbr size in lba") we started treating bad sizes in lba field of the partition that has the 0xEE (GPT protective) as errors. However, we may run into these "bad sizes" in the real world if someone uses dd to copy an image from a smaller disk to a bigger disk. Since this case used to work (even without using force_gpt), keep it working and treat the size mismatch as a warning instead of an error. Reported-by: Josh Triplett <josh@joshtriplett.org> Reported-by: Sean Paul <seanpaul@chromium.org> Signed-off-by: Doug Anderson <dianders@chromium.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Tested-by: Artem Bityutskiy <dedekind1@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-30block: change config option name for cmdline partition parsingPaul Gortmaker
Recently commit bab55417b10c ("block: support embedded device command line partition") introduced CONFIG_CMDLINE_PARSER. However, that name is too generic and sounds like it enables/disables generic kernel boot arg processing, when it really is block specific. Before this option becomes a part of a full/final release, add the BLK_ prefix to it so that it is clear in absence of any other context that it is block specific. In addition, fix up the following less critical items: - help text was not really at all helpful. - index file for Documentation was not updated - add the new arg to Documentation/kernel-parameters.txt - clarify wording in source comments Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Cai Zhiyong <caizhiyong@huawei.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-22Merge branch 'for-3.12/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block IO fixes from Jens Axboe: "After merge window, no new stuff this time only a collection of neatly confined and simple fixes" * 'for-3.12/core' of git://git.kernel.dk/linux-block: cfq: explicitly use 64bit divide operation for 64bit arguments block: Add nr_bios to block_rq_remap tracepoint If the queue is dying then we only call the rq->end_io callout. This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up. bio-integrity: Fix use of bs->bio_integrity_pool after free blkcg: relocate root_blkg setting and clearing block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) block: trace all devices plug operation
2013-09-22cfq: explicitly use 64bit divide operation for 64bit argumentsAnatol Pomozov
'samples' is 64bit operant, but do_div() second parameter is 32. do_div silently truncates high 32 bits and calculated result is invalid. In case if low 32bit of 'samples' are zeros then do_div() produces kernel crash. Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-09-18If the queue is dying then we only call the rq->end_io callout.Mike Christie
This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up. This patch has blk_execute_rq_nowait use __blk_end_request_all to free bios and also call rq->end_io. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-09-15partitions/efi: loosen check fot pmbr size in lbaDavidlohr Bueso
Matt found that commit 27a7c642174e ("partitions/efi: account for pmbr size in lba") caused his GPT formatted eMMC device not to boot. The reason is that this commit enforced Linux to always check the lesser of the whole disk or 2Tib for the pMBR size in LBA. While most disk partitioning tools out there create a pMBR with these characteristics, Microsoft does not, as it always sets the entry to the maximum 32-bit limitation - even though a drive may be smaller than that[1]. Loosen this check and only verify that the size is either the whole disk or 0xFFFFFFFF. No tool in its right mind would set it to any value other than these. [1] http://thestarman.pcministry.com/asm/mbr/GPT.htm#GPTPT Reported-and-tested-by: Matt Porter <matt.porter@linaro.org> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interruptJan Kara
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is one such possible user), the following race can happen: radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; <interrupt> ... radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; And we give out one radix tree node twice. That clearly results in radix tree corruption with different results (usually OOPS) depending on which two users of radix tree race. We fix the problem by making radix_tree_node_alloc() always allocate fresh radix tree nodes when in interrupt. Using preloading when in interrupt doesn't make sense since all the allocations have to be atomic anyway and we cannot steal nodes from process-context users because some users rely on radix_tree_insert() succeeding after radix_tree_preload(). in_interrupt() check is somewhat ugly but we cannot simply key off passed gfp_mask as that is acquired from root_gfp_mask() and thus the same for all preload users. Another part of the fix is to avoid node preallocation in radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again, preallocation in such case doesn't make sense and when preallocation would happen in interrupt we could possibly leak some allocated nodes. However, some users of radix_tree_preload() require following radix_tree_insert() to succeed. To avoid unexpected effects for these users, radix_tree_preload() only warns if passed gfp mask doesn't allow waiting and we provide a new function radix_tree_maybe_preload() for those users which get different gfp mask from different call sites and which are prepared to handle radix_tree_insert() failure. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11block/partitions/efi.c: consistently use pr_foo()Andrew Morton
Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Karel Zak <kzak@redhat.com> Cc: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11partitions/efi: some style cleanupsDavidlohr Bueso
Trivial coding style cleanups - still plenty left. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11partitions/efi: delete annoying emacs style commentsDavidlohr Bueso
I love emacs, but these settings for coding style are annoying when trying to open the efi.h file. More important, we already have checkpatch for that. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11partitions/efi: compare first and last usable LBAsDavidlohr Bueso
When verifying GPT header integrity, make sure that first usable LBA is smaller than last usable LBA. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11partitions/efi: account for pmbr size in lbaDavidlohr Bueso
The partition that has the 0xEE (GPT protective), must have the size in lba field set to the lesser of the size of the disk minus one or 0xFFFFFFFF for larger disks. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11partitions/efi: detect hybrid MBRsDavidlohr Bueso
One of the biggest problems with GPT is compatibility with older, non-GPT systems. The problem is addressed by creating hybrid mbrs, an extension, or variant, of the traditional protective mbr. This contains, apart from the 0xEE partition, up three additional primary partitions that point to the same space marked by up to three GPT partitions. The result is that legacy OSs can see the three required MBR partitions and at the same time ignore the GPT-aware partitions that protect the GPT structures. While hybrid MBRs are hacks, workarounds and simply not part of the GPT standard, they do exist and we have no way around them. For instance, by default, OSX creates a hybrid scheme when using multi-OS booting. In order for Linux to properly discover protective MBRs, it must be made aware of devices that have hybrid MBRs. No functionality is changed by this patch, just a debug message informing the user of the MBR scheme that is being used. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11partitions/efi: do not require gpt partition to begin at sector 1Davidlohr Bueso
When detecting a valid protective MBR, the Linux kernel isn't picky about the partition (1-4) the 0xEE is at, but, unlike other operating systems, it does require it to begin at the second sector (sector 1). This check, apart from it not being enforced by UEFI, and causing Linux to potentially fail to detect any *valid* partitions on the disk, can present problems when dealing with hybrid MBRs[1]. For compatibility reasons, if the first partition is hybridized, the 0xEE partition must be small enough to ensure that it only protects the GPT data structures - as opposed to the the whole disk in a protective MBR. This problem is very well described by Rod Smith[1]: where MBR-only partitioning programs (such as older versions of fdisk) can see some of the disk space as unallocated, thus loosing the purpose of the 0xEE partition's protection of GPT data structures. By dropping this check, this patch enables Linux to be more flexible when probing for GPT disklabels. [1] http://www.rodsbooks.com/gdisk/hybrid.html#reactions Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11partitions/efi: check pmbr record's starting lbaDavidlohr Bueso
Per the UEFI Specs 2.4, June 2013, the starting lba of the partition that has the EFI GPT (0xEE) must be set to 0x00000001 - this is obviously the LBA of the GPT Partition Header. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11partitions/efi: use lba-aware partition recordsDavidlohr Bueso
The kernel's GPT implementation currently uses the generic 'struct partition' type for dealing with legacy MBR partition records. While this is is useful for disklabels that we designed for CHS addressing, such as msdos, it doesn't adapt well to newer standards that use LBA instead, such as GUID partition tables. Furthermore, these generic partition structures do not have all the required fields to properly follow the UEFI specs. While a CHS address can be translated to LBA, it's much simpler and cleaner to just replace the partition type. This patch adds a new 'gpt_record' type that is fully compliant with EFI and will allow, in the next patches, to add more checks to properly verify a protective MBR, which is paramount to probing a device that makes use of GPT. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Karel Zak <kzak@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11kernel-wide: fix missing validations on __get/__put/__copy_to/__copy_from_user()Mathieu Desnoyers
I found the following pattern that leads in to interesting findings: grep -r "ret.*|=.*__put_user" * grep -r "ret.*|=.*__get_user" * grep -r "ret.*|=.*__copy" * The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat, since those appear in compat code, we could probably expect the kernel addresses not to be reachable in the lower 32-bit range, so I think they might not be exploitable. For the "__get_user" cases, I don't think those are exploitable: the worse that can happen is that the kernel will copy kernel memory into in-kernel buffers, and will fail immediately afterward. The alpha csum_partial_copy_from_user() seems to be missing the access_ok() check entirely. The fix is inspired from x86. This could lead to information leak on alpha. I also noticed that many architectures map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I wonder if the latter is performing the access checks on every architectures. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11block: support embedded device command line partitionCai Zhiyong
Read block device partition table from command line. The partition used for fixed block device (eMMC) embedded device. It is no MBR, save storage space. Bootloader can be easily accessed by absolute address of data on the block device. Users can easily change the partition. This code reference MTD partition, source "drivers/mtd/cmdlinepart.c" About the partition verbose reference "Documentation/block/cmdline-partition.txt" [akpm@linux-foundation.org: fix printk text] [yongjun_wei@trendmicro.com.cn: fix error return code in parse_parts()] Signed-off-by: Cai Zhiyong <caizhiyong@huawei.com> Cc: Karel Zak <kzak@redhat.com> Cc: "Wanglin (Albert)" <albert.wanglin@huawei.com> Cc: Marius Groeger <mag@sysgo.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Brian Norris <computersforpeace@gmail.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11block/blk-sysfs.c: replace strict_strtoul() with kstrtoul()Jingoo Han
The usage of strict_strtoul() is not preferred, because strict_strtoul() is obsolete. Thus, kstrtoul() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11blkcg: relocate root_blkg setting and clearingTejun Heo
Hello, Jens. The original thread can be read from http://thread.gmane.org/gmane.linux.kernel.cgroups/8937 While it leads to oops, given that it only triggers under specific configurations which aren't common. I don't think it's necessary to backport it through -stable and merging it during the coming merge window should be enough. Thanks! ----- 8< ----- Currently, q->root_blkg and q->root_rl.blkg are set from blkcg_activate_policy() and cleared from blkg_destroy_all(). This doesn't necessarily coincide with the lifetime of the root blkcg_gq leading to the following oops when blkcg is enabled but no policy is activated because __blk_queue_next_rl() malfunctions expecting the root_blkg pointers to be set. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff810c58cb>] __wake_up_common+0x2b/0x90 PGD 60f7a9067 PUD 60f4c9067 PMD 0 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC gsmi: Log Shutdown Reason 0x03 Modules linked in: act_mirred cls_tcindex cls_prioshift sch_dsmark xt_multiport iptable_mangle sata_mv elephant elephant_dev_num cdc_acm uhci_hcd ehci_hcd i2c_d CPU: 9 PID: 41382 Comm: iSCSI-write- Not tainted 3.11.0-dbg-DEV #19 Hardware name: Intel XXX task: ffff88060d16eec0 ti: ffff88060d170000 task.ti: ffff88060d170000 RIP: 0010:[<ffffffff810c58cb>] [<ffffffff810c58cb>] __wake_up_common+0x2b/0x90 RSP: 0000:ffff88060d171818 EFLAGS: 00010096 RAX: 0000000000000082 RBX: ffff880baa3dee60 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff880baa3dee60 RBP: ffff88060d171858 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000002 R12: ffff880baa3dee98 R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003 FS: 00007f977cba6700(0000) GS:ffff880c79c60000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000060f7a5000 CR4: 00000000000007e0 Stack: 0000000000000082 0000000000000000 ffff88060d171858 ffff880baa3dee60 0000000000000082 0000000000000003 0000000000000000 0000000000000000 ffff88060d171898 ffffffff810c7848 ffff88060d171888 ffff880bde4bc4b8 Call Trace: [<ffffffff810c7848>] __wake_up+0x48/0x70 [<ffffffff8131da53>] __blk_drain_queue+0x123/0x190 [<ffffffff8131dbb5>] blk_cleanup_queue+0xf5/0x210 [<ffffffff8141877a>] __scsi_remove_device+0x5a/0xd0 [<ffffffff81418824>] scsi_remove_device+0x34/0x50 [<ffffffff814189cb>] scsi_remove_target+0x16b/0x220 [<ffffffff814210f1>] __iscsi_unbind_session+0xd1/0x1b0 [<ffffffff814212b2>] iscsi_remove_session+0xe2/0x1c0 [<ffffffff814213a6>] iscsi_destroy_session+0x16/0x60 [<ffffffff81423a59>] iscsi_session_teardown+0xd9/0x100 [<ffffffff8142b75a>] iscsi_sw_tcp_session_destroy+0x5a/0xb0 [<ffffffff81420948>] iscsi_if_rx+0x10e8/0x1560 [<ffffffff81573335>] netlink_unicast+0x145/0x200 [<ffffffff815736f3>] netlink_sendmsg+0x303/0x410 [<ffffffff81528196>] sock_sendmsg+0xa6/0xd0 [<ffffffff815294bc>] ___sys_sendmsg+0x38c/0x3a0 [<ffffffff811ea840>] ? fget_light+0x40/0x160 [<ffffffff811ea899>] ? fget_light+0x99/0x160 [<ffffffff811ea840>] ? fget_light+0x40/0x160 [<ffffffff8152bc79>] __sys_sendmsg+0x49/0x90 [<ffffffff8152bcd2>] SyS_sendmsg+0x12/0x20 [<ffffffff815fb642>] system_call_fastpath+0x16/0x1b Code: 66 66 66 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 4c 8d 67 38 53 48 83 ec 18 89 55 c4 48 8b 57 38 4c 89 45 c8 <4c> 8b 2a 48 8d 42 e8 49 Fix it by moving r->root_blkg and q->root_rl.blkg setting to blkg_create() and clearing to blkg_destroy() so that they area initialized when a root blkg is created and cleared when destroyed. Reported-and-tested-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-09-11block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)Joe Perches
Use the helper function instead of __GFP_ZERO. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-09-11block: trace all devices plug operationJianpeng Ma
In func blk_queue_bio, if list of plug is empty,it will call blk_trace_plug. If process deal with a single device,it't ok.But if process deal with multi devices,it only trace the first device. Using request_count to judge, it can soleve this problem. In addition, i modify the comment. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-09-04Merge branch 'for-3.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on the cgroup front. Most changes aren't visible to userland at all at this point and are laying foundation for the planned unified hierarchy. - The biggest change is decoupling the lifetime management of css (cgroup_subsys_state) from that of cgroup's. Because controllers (cpu, memory, block and so on) will need to be dynamically enabled and disabled, css which is the association point between a cgroup and a controller may come and go dynamically across the lifetime of a cgroup. Till now, css's were created when the associated cgroup was created and stayed till the cgroup got destroyed. Assumptions around this tight coupling permeated through cgroup core and controllers. These assumptions are gradually removed, which consists bulk of patches, and css destruction path is completely decoupled from cgroup destruction path. Note that decoupling of creation path is relatively easy on top of these changes and the patchset is pending for the next window. - cgroup has its own event mechanism cgroup.event_control, which is only used by memcg. It is overly complex trying to achieve high flexibility whose benefits seem dubious at best. Going forward, new events will simply generate file modified event and the existing mechanism is being made specific to memcg. This pull request contains prepatory patches for such change. - Various fixes and cleanups" Fixed up conflict in kernel/cgroup.c as per Tejun. * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits) cgroup: fix cgroup_css() invocation in css_from_id() cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp() cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup cgroup: implement CFTYPE_NO_PREFIX cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax cgroup: fix cgroup_write_event_control() cgroup: fix subsystem file accesses on the root cgroup cgroup: change cgroup_from_id() to css_from_id() cgroup: use css_get() in cgroup_create() to check CSS_ROOT cpuset: remove an unncessary forward declaration cgroup: RCU protect each cgroup_subsys_state release cgroup: move subsys file removal to kill_css() cgroup: factor out kill_css() cgroup: decouple cgroup_subsys_state destruction from cgroup destruction cgroup: replace cgroup->css_kill_cnt with ->nr_css cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item cgroup: move cgroup->subsys[] assignment to online_css() cgroup: reorganize css init / exit paths cgroup: add __rcu modifier to cgroup->subsys[] ...
2013-08-23[SCSI] Return ENODATA on medium errorHannes Reinecke
When a medium error is detected the SCSI stack should return ENODATA to the upper layers. [jejb: fix whitespace error] Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2013-08-23[SCSI] return ENOSPC on thin provisioning failureHannes Reinecke
When the thin provisioning hard threshold is reached we should return ENOSPC to inform upper layers about this fact. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2013-08-09cgroup: make css_for_each_descendant() and friends include the origin css in ↵Tejun Heo
the iteration Previously, all css descendant iterators didn't include the origin (root of subtree) css in the iteration. The reasons were maintaining consistency with css_for_each_child() and that at the time of introduction more use cases needed skipping the origin anyway; however, given that css_is_descendant() considers self to be a descendant, omitting the origin css has become more confusing and looking at the accumulated use cases rather clearly indicates that including origin would result in simpler code overall. While this is a change which can easily lead to subtle bugs, cgroup API including the iterators has recently gone through major restructuring and no out-of-tree changes will be applicable without adjustments making this a relatively acceptable opportunity for this type of change. The conversions are mostly straight-forward. If the iteration block had explicit origin handling before or after, it's moved inside the iteration. If not, if (pos == origin) continue; is added. Some conversions add extra reference get/put around origin handling by consolidating origin handling and the rest. While the extra ref operations aren't strictly necessary, this shouldn't cause any noticeable difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
2013-08-09cgroup: make cgroup_taskset deal with cgroup_subsys_state instead of cgroupTejun Heo
cgroup is in the process of converting to css (cgroup_subsys_state) from cgroup as the principal subsystem interface handle. This is mostly to prepare for the unified hierarchy support where css's will be created and destroyed dynamically but also helps cleaning up subsystem implementations as css is usually what they are interested in anyway. cgroup_taskset which is used by the subsystem attach methods is the last cgroup subsystem API which isn't using css as the handle. Update cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp. The conversions are pretty mechanical. One exception is cpuset::cgroup_cs(), which lost its last user and got removed. This patch shouldn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-08-09cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroupTejun Heo
cgroup is currently in the process of transitioning to using css (cgroup_subsys_state) as the primary handle instead of cgroup in subsystem API. For hierarchy iterators, this is beneficial because * In most cases, css is the only thing subsystems care about anyway. * On the planned unified hierarchy, iterations for different subsystems will need to skip over different subtrees of the hierarchy depending on which subsystems are enabled on each cgroup. Passing around css makes it unnecessary to explicitly specify the subsystem in question as css is intersection between cgroup and subsystem * For the planned unified hierarchy, css's would need to be created and destroyed dynamically independent from cgroup hierarchy. Having cgroup core manage css iteration makes enforcing deref rules a lot easier. Most subsystem conversions are straight-forward. Noteworthy changes are * blkio: cgroup_to_blkcg() is no longer used. Removed. * freezer: cgroup_freezer() is no longer used. Removed. * devices: cgroup_to_devcgroup() is no longer used. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk>
2013-08-09cgroup: pass around cgroup_subsys_state instead of cgroup in file methodsTejun Heo
cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup. Please see the previous commit which converts the subsystem methods for rationale. This patch converts all cftype file operations to take @css instead of @cgroup. cftypes for the cgroup core files don't have their subsytem pointer set. These will automatically use the dummy_css added by the previous patch and can be converted the same way. Most subsystem conversions are straight forwards but there are some interesting ones. * freezer: update_if_frozen() is also converted to take @css instead of @cgroup for consistency. This will make the code look simpler too once iterators are converted to use css. * memory/vmpressure: mem_cgroup_from_css() needs to be exported to vmpressure while mem_cgroup_from_cont() can be made static. Updated accordingly. * cpu: cgroup_tg() doesn't have any user left. Removed. * cpuacct: cgroup_ca() doesn't have any user left. Removed. * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left. Removed. * net_cls: cgrp_cls_state() doesn't have any user left. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-08-09cgroup: add subsys backlink pointer to cftypeTejun Heo
cgroup is transitioning to using css (cgroup_subsys_state) instead of cgroup as the primary subsystem handle. The cgroupfs file interface will be converted to use css's which requires finding out the subsystem from cftype so that the matching css can be determined from the cgroup. This patch adds cftype->ss which points to the subsystem the file belongs to. The field is initialized while a cftype is being registered. This makes it unnecessary to explicitly specify the subsystem for other cftype handling functions. @ss argument dropped from various cftype handling functions. This patch shouldn't introduce any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk>
2013-08-09cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methodsTejun Heo
cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup * in subsystem implementations for the following reasons. * With unified hierarchy, subsystems will be dynamically bound and unbound from cgroups and thus css's (cgroup_subsys_state) may be created and destroyed dynamically over the lifetime of a cgroup, which is different from the current state where all css's are allocated and destroyed together with the associated cgroup. This in turn means that cgroup_css() should be synchronized and may return NULL, making it more cumbersome to use. * Differing levels of per-subsystem granularity in the unified hierarchy means that the task and descendant iterators should behave differently depending on the specific subsystem the iteration is being performed for. * In majority of the cases, subsystems only care about its part in the cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods often obtain the matching css pointer from the cgroup and don't bother with the cgroup pointer itself. Passing around css fits much better. This patch converts all cgroup_subsys methods to take @css instead of @cgroup. The conversions are mostly straight-forward. A few noteworthy changes are * ->css_alloc() now takes css of the parent cgroup rather than the pointer to the new cgroup as the css for the new cgroup doesn't exist yet. Knowing the parent css is enough for all the existing subsystems. * In kernel/cgroup.c::offline_css(), unnecessary open coded css dereference is replaced with local variable access. This patch shouldn't cause any behavior differences. v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced with local variable @css as suggested by Li Zefan. Rebased on top of new for-3.12 which includes for-3.11-fixes so that ->css_free() invocation added by da0a12caff ("cgroup: fix a leak when percpu_ref_init() fails") is converted too. Suggested by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-08-09cgroup: add css_parent()Tejun Heo
Currently, controllers have to explicitly follow the cgroup hierarchy to find the parent of a given css. cgroup is moving towards using cgroup_subsys_state as the main controller interface construct, so let's provide a way to climb the hierarchy using just csses. This patch implements css_parent() which, given a css, returns its parent. The function is guarnateed to valid non-NULL parent css as long as the target css is not at the top of the hierarchy. freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices are converted to use css_parent() instead of accessing cgroup->parent directly. * __parent_ca() is dropped from cpuacct and its usage is replaced with parent_ca(). The only difference between the two was NULL test on cgroup->parent which is now embedded in css_parent() making the distinction moot. Note that eventually a css->parent field will be added to css and the NULL check in css_parent() will go away. This patch shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-08-09cgroup: add/update accessors which obtain subsys specific data from cssTejun Heo
css (cgroup_subsys_state) is usually embedded in a subsys specific data structure. Subsystems either use container_of() directly to cast from css to such data structure or has an accessor function wrapping such cast. As cgroup as whole is moving towards using css as the main interface handle, add and update such accessors to ease dealing with css's. All accessors explicitly handle NULL input and return NULL in those cases. While this looks like an extra branch in the code, as all controllers specific data structures have css as the first field, the casting doesn't involve any offsetting and the compiler can trivially optimize out the branch. * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such accessor. Added. * memory, hugetlb and devices already had one but didn't explicitly handle NULL input. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-08-09cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/Tejun Heo
The names of the two struct cgroup_subsys_state accessors - cgroup_subsys_state() and task_subsys_state() - are somewhat awkward. The former clashes with the type name and the latter doesn't even indicate it's somehow related to cgroup. We're about to revamp large portion of cgroup API, so, let's rename them so that they're less awkward. Most per-controller usages of the accessors are localized in accessor wrappers and given the amount of scheduled changes, this isn't gonna add any noticeable headache. Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state() to task_css(). This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-07-14block: delete __cpuinit usage from all block filesPaul Gortmaker
The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the drivers/block uses of the __cpuinit macros from all C files. [1] https://lkml.org/lkml/2013/5/20/589 Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2013-07-11Merge branch 'for-3.11/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block IO updates from Jens Axboe: "Here are the core IO block bits for 3.11. It contains: - A tweak to the reserved tag logic from Jan, for weirdo devices with just 3 free tags. But for those it improves things substantially for random writes. - Periodic writeback fix from Jan. Marked for stable as well. - Fix for a race condition in IO scheduler switching from Jianpeng. - The hierarchical blk-cgroup support from Tejun. This is the grunt of the series. - blk-throttle fix from Vivek. Just a note that I'm in the middle of a relocation, whole family is flying out tomorrow. Hence I will be awal the remainder of this week, but back at work again on Monday the 15th. CC'ing Tejun, since any potential "surprises" will most likely be from the blk-cgroup work. But it's been brewing for a while and sitting in my tree and linux-next for a long time, so should be solid." * 'for-3.11/core' of git://git.kernel.dk/linux-block: (36 commits) elevator: Fix a race in elevator switching block: Reserve only one queue tag for sync IO if only 3 tags are available writeback: Fix periodic writeback after fs mount blk-throttle: implement proper hierarchy support blk-throttle: implement throtl_grp->has_rules[] blk-throttle: Account for child group's start time in parent while bio climbs up blk-throttle: add throtl_qnode for dispatch fairness blk-throttle: make throtl_pending_timer_fn() ready for hierarchy blk-throttle: make tg_dispatch_one_bio() ready for hierarchy blk-throttle: make blk_throtl_bio() ready for hierarchy blk-throttle: make blk_throtl_drain() ready for hierarchy blk-throttle: dispatch from throtl_pending_timer_fn() blk-throttle: implement dispatch looping blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log() blk-throttle: add throtl_service_queue->parent_sq blk-throttle: generalize update_disptime optimization in blk_throtl_bio() blk-throttle: dispatch to throtl_data->service_queue.bio_lists[] blk-throttle: move bio_lists[] and friends to throtl_service_queue ...
2013-07-09partitions/msdos: enumerate also AIX LVM partitionsPhilippe De Muyter
Graft AIX partitions enumeration into partitions/msdos.c There is already a AIX disks detection logic in msdos.c. When an AIX disk has been found, and if configured to, call the aix partitions recognizer. This avoids removal of AIX disks protection from msdos.c, avoids code duplication, and ensures that AIX partitions enumeration is called before plain msdos partitions enumeration. Signed-off-by: Philippe De Muyter <phdm@macqel.be> Cc: Karel Zak <kzak@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09partitions: add aix lvm partition support filesPhilippe De Muyter
Add partitions/aix.h and partitions/aix.c. AIX LVM permits to make "logical volumes" which are made of multiple slices of multiple disks. The new code allows only access to the "logical volumes" which are made of one slice on the probed disk, a slice being a contiguous disk area. The code also detects "logical volumes" made of multiple slices on the probed disk, but can not describe them to the partition layer, because the partition layer generic code does not support that. When such non-contiguous "logical volumes" are detected, a diagnostic message is printed. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Philippe De Muyter <phdm@macqel.be> Cc: Karel Zak <kzak@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>