summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-01-30xen/blkfront: realloc ring info in blkif_resumeBob Liu
Need to reallocate ring info in the resume path, because info->rinfo was freed in blkif_free(). And 'multi-queue-max-queues' backend reports may have been changed. Signed-off-by: Bob Liu <bob.liu@oracle.com> Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkfront: Fix crash if backend doesn't follow the right states.Konrad Rzeszutek Wilk
We have split the setting up of all the resources in two steps: 1) talk_to_blkback - which figures out the num_ring_pages (from the default value of zero), sets up shadow and so 2) blkfront_connect - does the real part of filling out the internal structures. The problem is if we bypass the 1) step and go straight to 2) and call blkfront_setup_indirect where we use the macro BLK_RING_SIZE - which returns an negative value (because sz is zero - since num_ring_pages is zero - since it has never been set). We can fix this by making sure that we always have called talk_to_blkback before going to blkfront_connect. Or we could set in blkfront_probe info->nr_ring_pages = 1 to have a default value. But that looks odd - as we haven't actually negotiated any ring size. This patch changes XenbusStateConnected state to detect if we haven't done the initial handshake - and if so continue on as if were in XenbusStateInitWait state. We also roll the error recovery (freeing the structure) into talk_to_blkback error path - which is safe since that function is only called from blkback_changed. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkback: Fix two memory leaks.Bob Liu
This patch fixs two memleaks: backtrace: [<ffffffff817ba5e8>] kmemleak_alloc+0x28/0x50 [<ffffffff81205e3b>] kmem_cache_alloc+0xbb/0x1d0 [<ffffffff81534028>] xen_blkbk_probe+0x58/0x230 [<ffffffff8146adb6>] xenbus_dev_probe+0x76/0x130 [<ffffffff81511716>] driver_probe_device+0x166/0x2c0 [<ffffffff815119bc>] __device_attach_driver+0xac/0xb0 [<ffffffff8150fa57>] bus_for_each_drv+0x67/0x90 [<ffffffff81511ab7>] __device_attach+0xc7/0x120 [<ffffffff81511b23>] device_initial_probe+0x13/0x20 [<ffffffff8151059a>] bus_probe_device+0x9a/0xb0 [<ffffffff8150f0a1>] device_add+0x3b1/0x5c0 [<ffffffff8150f47e>] device_register+0x1e/0x30 [<ffffffff8146a9e8>] xenbus_probe_node+0x158/0x170 [<ffffffff8146abaf>] xenbus_dev_changed+0x1af/0x1c0 [<ffffffff8146b1bb>] backend_changed+0x1b/0x20 [<ffffffff81468ca6>] xenwatch_thread+0xb6/0x160 unreferenced object 0xffff880007ba8ef8 (size 224): backtrace: [<ffffffff817ba5e8>] kmemleak_alloc+0x28/0x50 [<ffffffff81205c73>] __kmalloc+0xd3/0x1e0 [<ffffffff81534d87>] frontend_changed+0x2c7/0x580 [<ffffffff8146af12>] xenbus_otherend_changed+0xa2/0xb0 [<ffffffff8146b2c0>] frontend_changed+0x10/0x20 [<ffffffff81468ca6>] xenwatch_thread+0xb6/0x160 [<ffffffff810d3e97>] kthread+0xd7/0xf0 [<ffffffff817c4a9f>] ret_from_fork+0x3f/0x70 [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffff8800048dcd38 (size 224): The first leak is caused by not put() the be->blkif reference which we had gotten in xen_blkif_alloc(), while the second is us not freeing blkif->rings in the right place. Signed-off-by: Bob Liu <bob.liu@oracle.com> Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkback: make st_ statistics per ringBob Liu
Make st_* statistics per ring and the VBD sysfs would iterate over all the rings. Note: xenvbd_sysfs_delif() is called in xen_blkbk_remove() before all rings are torn down, so it's safe. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- v2: Aligned the variables on the same column.
2016-01-04xen/blkfront: Handle non-indirect grant with 64KB pagesJulien Grall
The minimal size of request in the block framework is always PAGE_SIZE. It means that when 64KB guest is support, the request will at least be 64KB. Although, if the backend doesn't support indirect descriptor (such as QDISK in QEMU), a ring request is only able to accommodate 11 segments of 4KB (i.e 44KB). The current frontend is assuming that an I/O request will always fit in a ring request. This is not true any more when using 64KB page granularity and will therefore crash during boot. On ARM64, the ABI is completely neutral to the page granularity used by the domU. The guest has the choice between different page granularity supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB). This can't be enforced by the hypervisor and therefore it's possible to run guests using different page granularity. So we can't mandate the block backend to support indirect descriptor when the frontend is using 64KB page granularity and have to fix it properly in the frontend. The solution exposed below is based on modifying directly the frontend guest rather than asking the block framework to support smaller size (i.e < PAGE_SIZE). This is because the change is the block framework are not trivial as everything seems to relying on a struct *page (see [1]). Although, it may be possible that someone succeed to do it in the future and we would therefore be able to use it. Given that a block request may not fit in a single ring request, a second request is introduced for the data that cannot fit in the first one. This means that the second ring request should never be used on Linux if the page size is smaller than 44KB. To achieve the support of the extra ring request, the block queue size is divided by two. Therefore, the ring will always contain enough space to accommodate 2 ring requests. While this will reduce the overall performance, it will make the implementation more contained. The way forward to get better performance is to implement in the backend either indirect descriptor or multiple grants ring. Note that the parameters blk_queue_max_* helpers haven't been updated. The block code will set the mimimum size supported and we may be able to support directly any change in the block framework that lower down the minimal size of a request. [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen-blkfront: Introduce blkif_ring_get_requestJulien Grall
The code to get a request is always the same. Therefore we can factorize it in a single function. Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule()Jiri Kosina
xen_blkif_schedule() kthread calls try_to_freeze() at the beginning of every attempt to purge the LRU. This operation can't ever succeed though, as the kthread hasn't marked itself as freezable. Before (hopefully eventually) kthread freezing gets converted to fileystem freezing, we'd rather mark xen_blkif_schedule() freezable (as it can generate I/O during suspend). Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkback: Free resources if connect_ring failed.Konrad Rzeszutek Wilk
With the multi-queue support we could fail at setting up some of the rings and fail the connection. That meant that all resources tied to rings[0..n-1] (where n is the ring that failed to be setup). Eventually the frontend will switch to the states and we will call xen_blkif_disconnect. However we do not want to be at the mercy of the frontend deciding when to change states. This allows us to do the cleanup right away and freeing resources. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blocks: Return -EXX instead of -1Konrad Rzeszutek Wilk
Lets return sensible values instead of -1. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkback: make pool of persistent grants and free pages per-queueBob Liu
Make pool of persistent grants and free pages per-queue/ring instead of per-device to get better scalability. Test was done based on null_blk driver: dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk" domu: v4.2-rc8 16vcpus 10GB [test] rw=read direct=1 ioengine=libaio bs=4k time_based runtime=30 filename=/dev/xvdb numjobs=16 iodepth=64 iodepth_batch=64 iodepth_batch_complete=64 group_reporting Results: iops1: After patch "xen/blkfront: make persistent grants per-queue". iops2: After this patch. Queues: 1 4 8 16 Iops orig(k): 810 1064 780 700 Iops1(k): 810 1230(~20%) 1024(~20%) 850(~20%) Iops2(k): 810 1410(~35%) 1354(~75%) 1440(~100%) With 4 queues after this commit we can get ~75% increase in IOPS, and performance won't drop if increasing queue numbers. Please find the respective chart in this link: https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0 Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkback: get the number of hardware queues/rings from blkfrontBob Liu
Backend advertises "multi-queue-max-queues" to front, also get the negotiated number from "multi-queue-num-queues" written by blkfront. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkback: pseudo support for multi hardware queues/ringsKonrad Rzeszutek Wilk
Preparatory patch for multiple hardware queues (rings). The number of rings is unconditionally set to 1, larger number will be enabled in "xen/blkback: get the number of hardware queues/rings from blkfront". Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- v2: Align variables in the structures.
2016-01-04xen/blkback: separate ring information out of struct xen_blkifBob Liu
Split per ring information to an new structure "xen_blkif_ring", so that one vbd device can be associated with one or more rings/hardware queues. Introduce 'pers_gnts_lock' to protect the pool of persistent grants since we may have multi backend threads. This patch is a preparation for supporting multi hardware queues/rings. Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- v2: Align the variables in the structure.
2016-01-04xen/blkfront: correct setting for xen_blkif_max_ring_orderPeng Fan
According to this piece code: " pr_info("Invalid max_ring_order (%d), will use default max: %d.\n", xen_blkif_max_ring_order, XENBUS_MAX_RING_GRANT_ORDER); " if xen_blkif_max_ring_order is bigger that XENBUS_MAX_RING_GRANT_ORDER, need to set xen_blkif_max_ring_order using XENBUS_MAX_RING_GRANT_ORDER, but not 0. Signed-off-by: Peng Fan <van.freenix@gmail.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkfront: make persistent grants pool per-queueBob Liu
Make persistent grants per-queue/ring instead of per-device, so that we can drop the 'dev_lock' and get better scalability. Test was done based on null_blk driver: dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk" domu: v4.2-rc8 16vcpus 10GB [test] rw=read direct=1 ioengine=libaio bs=4k time_based runtime=30 filename=/dev/xvdb numjobs=16 iodepth=64 iodepth_batch=64 iodepth_batch_complete=64 group_reporting Queues: 1 4 8 16 Iops orig(k): 810 1064 780 700 Iops patched(k): 810 1230(~20%) 1024(~20%) 850(~20%) Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkfront: Remove duplicate setting of ->xbdev.Bob Liu
We do the same exact operations a bit earlier in the function. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkfront: Cleanup of comments, fix unaligned variables, and syntax errors.Konrad Rzeszutek Wilk
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkfront: negotiate number of queues/rings to be used with backendBob Liu
The max number of hardware queues for xen/blkfront is set by parameter 'max_queues'(default 4), while it is also capped by the max value that the xen/blkback exposes through XenStore key 'multi-queue-max-queues'. The negotiated number is the smaller one and would be written back to xenstore as "multi-queue-num-queues", blkback needs to read this negotiated number. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkfront: split per device io_lockBob Liu
After patch "xen/blkfront: separate per ring information out of device info", per-ring data is protected by a per-device lock ('io_lock'). This is not a good way and will effect the scalability, so introduce a per-ring lock ('ring_lock'). The old 'io_lock' is renamed to 'dev_lock' which protects the ->grants list and ->persistent_gnts_c which are shared by all rings. Note that in 'blkfront_probe' the 'blkfront_info' is setup via kzalloc so setting ->persistent_gnts_c to zero is not needed. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkfront: pseudo support for multi hardware queues/ringsBob Liu
Preparatory patch for multiple hardware queues (rings). The number of rings is unconditionally set to 1, larger number will be enabled in patch "xen/blkfront: negotiate number of queues/rings to be used with backend" so as to make review easier. Note that blkfront_gather_backend_features does not call blkfront_setup_indirect anymore (as that needs to be done per ring). That means that in blkif_recover/blkif_connect we have to do it in a loop (bounded by nr_rings). Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkfront: separate per ring information out of device infoBob Liu
Split per ring information to a new structure "blkfront_ring_info". A ring is the representation of a hardware queue, every vbd device can associate with one or more rings depending on how many hardware queues/rings to be used. This patch is a preparation for supporting real multi hardware queues/rings. We also add a backpointer to 'struct blkfront_info' (dev_info) which is not needed (we could use containers_of) but further patch ("xen/blkfront: pseudo support for multi hardware queues/rings") will make allocation of 'blkfront_ring_info' dynamic. Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-01-04xen/blkif: document blkif multi-queue/ring extensionBob Liu
Document the multi-queue/ring feature in terms of XenStore keys to be written by the backend and by the frontend. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2015-12-31bcache: Change refill_dirty() to always scan entire disk if necessaryKent Overstreet
Previously, it would only scan the entire disk if it was starting from the very start of the disk - i.e. if the previous scan got to the end. This was broken by refill_full_stripes(), which updates last_scanned so that refill_dirty was never triggering the searched_from_start path. But if we change refill_dirty() to always scan the entire disk if necessary, regardless of what last_scanned was, the code gets cleaner and we fix that bug too. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-31bcache: prevent crash on changing writeback_runningStefan Bader
Added a safeguard in the shutdown case. At least while not being attached it is also possible to trigger a kernel bug by writing into writeback_running. This change adds the same check before trying to wake up the thread for that case. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-31bcache: allows use of register in udev to avoid "device_busy" error.Gabriel de Perthuis
Allows to use register, not register_quiet in udev to avoid "device_busy" error. The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis <g2p.code@gmail.com> does not unlock the mutex and hangs the kernel. See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion. Cc: Denis Bychkov <manover@gmail.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Gabriel de Perthuis <g2p.code@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-31bcache: unregister reboot notifier if bcache fails to unregister deviceZheng Liu
In bcache_init() function it forgot to unregister reboot notifier if bcache fails to unregister a block device. This commit fixes this. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-31bcache: fix a leak in bch_cached_dev_run()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-31bcache: clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing deviceZheng Liu
This bug can be reproduced by the following script: #!/bin/bash bcache_sysfs="/sys/fs/bcache" function clear_cache() { if [ ! -e $bcache_sysfs ]; then echo "no bcache sysfs" exit fi cset_uuid=$(ls -l $bcache_sysfs|head -n 2|tail -n 1|awk '{print $9}') sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/detach" sleep 5 sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/attach" } for ((i=0;i<10;i++)); do clear_cache done The warning messages look like below: [ 275.948611] ------------[ cut here ]------------ [ 275.963840] WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xb8/0xd0() (Tainted: P W --------------- ) [ 275.979253] Hardware name: Tecal RH2285 [ 275.994106] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:09.0/0000:08:00.0/host4/target4:2:1/4:2:1:0/block/sdb/sdb1/bcache/cache' [ 276.024105] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] [ 276.072643] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1 [ 276.089315] Call Trace: [ 276.105801] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0 [ 276.122650] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50 [ 276.139361] [<ffffffff81205c08>] ? sysfs_add_one+0xb8/0xd0 [ 276.156012] [<ffffffff8120609b>] ? sysfs_do_create_link+0x12b/0x170 [ 276.172682] [<ffffffff81206113>] ? sysfs_create_link+0x13/0x20 [ 276.189282] [<ffffffffa03bda21>] ? bcache_device_link+0xc1/0x110 [bcache] [ 276.205993] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache] [ 276.222794] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache] [ 276.239680] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110 [ 276.256594] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170 [ 276.273364] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0 [ 276.290133] [<ffffffff811890b1>] ? sys_write+0x51/0x90 [ 276.306368] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b [ 276.322301] ---[ end trace 9f5d4fcdd0c3edfb ]--- [ 276.338241] ------------[ cut here ]------------ [ 276.354109] WARNING: at /home/wenqing.lz/bcache/bcache/super.c:720 bcache_device_link+0xdf/0x110 [bcache]() (Tainted: P W --------------- ) [ 276.386017] Hardware name: Tecal RH2285 [ 276.401430] Couldn't create device <-> cache set symlinks [ 276.401759] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] [ 276.465477] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1 [ 276.482169] Call Trace: [ 276.498610] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0 [ 276.515405] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50 [ 276.532059] [<ffffffffa03bda3f>] ? bcache_device_link+0xdf/0x110 [bcache] [ 276.548808] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache] [ 276.565569] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache] [ 276.582418] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110 [ 276.599341] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170 [ 276.616142] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0 [ 276.632607] [<ffffffff811890b1>] ? sys_write+0x51/0x90 [ 276.648671] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b [ 276.664756] ---[ end trace 9f5d4fcdd0c3edfc ]--- We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach() function when we attach a backing device first time. After detaching this backing device, this flag will be true and sysfs_remove_link() isn't called in bcache_device_unlink(). Then when we attach this backing device again, sysfs_create_link() will return EEXIST error in bcache_device_link(). So the fix is trival and we clear this flag in bcache_device_link(). Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-31bcache: Add a cond_resched() call to gcKent Overstreet
Signed-off-by: Takashi Iwai <tiwai@suse.de> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-31bcache: fix a livelock when we cause a huge number of cache missesZheng Liu
Subject : [PATCH v2] bcache: fix a livelock in btree lock Date : Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM) This commit tries to fix a livelock in bcache. This livelock might happen when we causes a huge number of cache misses simultaneously. When we get a cache miss, bcache will execute the following path. ->cached_dev_make_request() ->cached_dev_read() ->cached_lookup() ->bch->btree_map_keys() ->btree_root() <------------------------ ->bch_btree_map_keys_recurse() | ->cache_lookup_fn() | ->cached_dev_cache_miss() | ->bch_btree_insert_check_key() -| [If btree->seq is not equal to seq + 1, we should return EINTR and traverse btree again.] In bch_btree_insert_check_key() function we first need to check upgrade flag (op->lock == -1), and when this flag is true we need to release read btree->lock and try to take write btree->lock. During taking and releasing this write lock, btree->seq will be monotone increased in order to prevent other threads modify this in cache miss (see btree.h:74). But if there are some cache misses caused by some requested, we could meet a livelock because btree->seq is always changed by others. Thus no one can make progress. This commit will try to take write btree->lock if it encounters a race when we traverse btree. Although it sacrifice the scalability but we can ensure that only one can modify the btree. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Joshua Schmid <jschmid@suse.com> Cc: Zhu Yanhai <zhu.yanhai@gmail.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-23sx8: use real time for the command secondsJens Axboe
Commit 8182503df1ba used monotonic time, but if the adapter is using the seconds for logging entries, then we'll get duplicate entries if the system is rebooted. Use real time instead. Reported-by: Arnd Bergmann <arnd@arndb.de> Fixes: 8182503df1ba ("block: sx8.c: Replace timeval with ktime_t") Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-22block: sx8.c: Replace timeval with ktime_tShraddha Barke
32-bit systems using 'struct timeval' will break in the year 2038, in order to avoid that replace the code with more appropriate types. This patch replaces timeval with 64 bit ktime_t which is y2038 safe. Since st->timestamp is only interested in seconds, directly using time64_t here. Function ktime_get_seconds is used since it uses monotonic instead of real time and thus will not cause overflow. Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: fix error path during resizeLars Ellenberg
In case the lower level device size changed, but some other internal details of the resize did not work out, drbd_determine_dev_size() would try to restore the previous settings, trusting drbd_md_set_sector_offsets() to "do the right thing", but overlooked that this internally may set the meta data base offset based on device size. This could end up with incomplete on-disk meta data layout change, and ultimately lead to data corruption (if the failure was not noticed or ignored by the operator, and other things go wrong as well). Just remember all meta data related offsets/sizes, and on error restore them all. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: avoid potential deadlock during handshakeLars Ellenberg
During handshake communication, we also reconsider our device size, using drbd_determine_dev_size(). Just in case we need to change the offsets or layout of our on-disk metadata, we lock out application and other meta data IO, and wait for the activity log to be "idle" (no more referenced extents). If this handshake happens just after a connection loss, with a fencing policy of "resource-and-stonith", we have frozen IO. If, additionally, the activity log was "starving" (too many incoming random writes at that point in time), it won't become idle, ever, because of the frozen IO, and this would be a lockup of the receiver thread, and consquentially of DRBD. Previous logic (re-)initialized with a special "empty" transaction block, which required the activity log to fully drain first. Instead, write out some standard activity log transactions. Using lc_try_lock_for_transaction() instead of lc_try_lock() does not care about pending activity log references, avoiding the potential deadlock. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: separate out __al_write_transaction helper functionLars Ellenberg
To be able to "force out" an activity log transaction, even if there are no pending updates. This will be used to relocate the on-disk activity log, if the on-disk offsets have to be changed, without the need to empty the activity log first. While at it, move the definition, so we can drop the forward declaration of a static helper. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: make suspend_io() / resume_io() must be thread and recursion safePhilipp Reisner
Avoid to prematurely resume application IO: don't set/clear a single bit, but inc/dec an atomic counter. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: fix "endless" transfer log walk in protocol ALars Ellenberg
Don't remember a DRBD request as ack_pending, if it is not. In protocol A, we usually clear RQ_NET_PENDING at the same time we set RQ_NET_SENT, so when deciding to remember it as ack_pending, mod_rq_state needs to look at the current request state, not at the previous state before the current modification was applied. This should prevent advance_conn_req_ack_pending() from walking the full transfer log just to find NULL in protocol A, which would cause serious performance degradation with many "in-flight" requests, e.g. when working via DRBD-proxy, or with a huge bandwidth-delay product. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: fix memory leak in drbd_adm_resizeOleg Drokin
new_disk_conf could be leaked if the follow on checks fail, so make sure to free it on error if it was not assigned yet. Found with smatch. Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: don't block forever in disconnect during resync if fencing=r-a-stonithLars Ellenberg
Disconnect should wait for pending bitmap IO. But if that bitmap IO is not happening, because it is waiting for pending application IO, and there is no progress, because the fencing policy suspended application IO because of the disconnect, then we deadlock. The bitmap writeout in this case does not care for concurrent application IO, so there is no point waiting for it. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25lru_cache: Converted lc_seq_printf_status to return voidRoland Kammerer
Fix the semantic of lc_seq_printf. Currently, it always returns 0 and the return value is unused, therefore, convert the return type to void. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: make drbd known to lsblk: use bd_link_disk_holderLars Ellenberg
lsblk should be able to pick up stacking device driver relations involving DRBD conveniently. Even though upstream kernel since 2011 says "DON'T USE THIS UNLESS YOU'RE ALREADY USING IT." a new user has been added since (bcache), which sets the precedences for us to use it as well. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: fix queue limit setup for discardLars Ellenberg
We cannot possibly support SECDISCARD, even if all backend devices would support it: if our peer is currently unreachable, some instance of the data may obviously still be recoverable. We did not set discard_granularity at all. We don't really care (yet), we only pass them on, so for now, set our granularity to one sector. blkdev_stack_limits() takes care of the rest. If we decide we cannot support discards, not only clear the (not user visible) QUEUE_FLAG_DISCARD, but set both (user visible) discard_granularity and max_discard_sectors to zero, to avoid confusion with e.g. lsblk -D. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: fix spurious alert level printkLars Ellenberg
When accessing out meta data area on disk, we double check the plausibility of the requested sector offsets, and are very noisy about it if they look suspicious. During initial read of our "superblock", for "external" meta data, this triggered because the range estimate returned by drbd_md_last_sector() was still wrong. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: use bitmap_weight() helper, don't open codeLars Ellenberg
Suggested by Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: avoid redefinition of BITS_PER_PAGELars Ellenberg
Apparently we now implicitly get definitions for BITS_PER_PAGE and BITS_PER_PAGE_MASK from the pid_namespace.h Instead of renaming our defines, I chose to define only if not yet defined, but to double check the value if already defined. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: use resource name in workqueueLars Ellenberg
Since kernel 3.3, we can use snprintf-style arguments to create a workqueue. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: debugfs: expose ed_data_gen_idLars Ellenberg
The effective data generation ID may be interesting for debugging purposes of scenarios involving diskless states. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: prevent NULL pointer deref when resuming diskless primaryLars Ellenberg
In a multiple error scenario, we may end up with a "frozen" Primary, that has no access to any data (no local disk, no replication link). If we then resume-io, we try to generate a new data generation id, which will fail if there is no longer a local disk. Double check for available local data, which prevents the NULL pointer deref. If we are diskless, turn the resume-io in this situation into the first stage of a "force down", by bumping the "effective" data gen id, which will prevent later attach or connect to the former data set without first being demoted (deconfigured). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: Create a dedicated workqueue for sending acks on the control connectionPhilipp Reisner
The intention is to reduce CPU utilization. Recent measurements unveiled that the current performance bottleneck is CPU utilization on the receiving node. The asender thread became CPU limited. One of the main points is to eliminate the idr_for_each_entry() loop from the sending acks code path. One exception in that is sending back ping_acks. These stay in the ack-receiver thread. Otherwise the logic becomes too complicated for no added value. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-25drbd: Rename asender to ack_receiverPhilipp Reisner
This prepares the next patch where the sending on the meta (or control) socket is moved to a dedicated workqueue. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>