summaryrefslogtreecommitdiff
path: root/fs/btrfs/volumes.c
AgeCommit message (Collapse)Author
2016-06-25Merge branch 'for-linus-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I have a two part pull this time because one of the patches Dave Sterba collected needed to be against v4.7-rc2 or higher (we used rc4). I try to make my for-linus-xx branch testable on top of the last major so we can hand fixes to people on the list more easily, so I've split this pull in two. This first part has some fixes and two performance improvements that we've been testing for some time. Josef's two performance fixes are most notable. The transid tracking patch makes a big improvement on pretty much every workload" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: Force stripesize to the value of sectorsize btrfs: fix disk_i_size update bug when fallocate() fails Btrfs: fix error handling in map_private_extent_buffer Btrfs: fix error return code in btrfs_init_test_fs() Btrfs: don't do nocow check unless we have to btrfs: fix deadlock in delayed_ref_async_start Btrfs: track transid for delayed ref flushing
2016-06-23Btrfs: Force stripesize to the value of sectorsizeChandan Rajendra
Btrfs code currently assumes stripesize to be same as sectorsize. However Btrfs-progs (until commit df05c7ed455f519e6e15e46196392e4757257305) has been setting btrfs_super_block->stripesize to a value of 4096. This commit makes sure that the value of btrfs_super_block->stripesize is a power of 2. Later, it unconditionally sets btrfs_root->stripesize to sectorsize. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-18Merge branch 'for-linus-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "The most user visible change here is a fix for our recent superblock validation checks that were causing problems on non-4k pagesized systems" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: btrfs_check_super_valid: Allow 4096 as stripesize btrfs: remove build fixup for qgroup_account_snapshot btrfs: use new error message helper in qgroup_account_snapshot btrfs: avoid blocking open_ctree from cleaner_kthread Btrfs: don't BUG_ON() in btrfs_orphan_add btrfs: account for non-CoW'd blocks in btrfs_abort_transaction Btrfs: check if extent buffer is aligned to sectorsize btrfs: Use correct format specifier
2016-06-17Btrfs: check if extent buffer is aligned to sectorsizeLiu Bo
Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer via alloc_extent_buffer(). An unaligned eb can have more pages than it should have, which ends up extent buffer's leak or some corrupted content in extent buffer. This adds a warning to let us quickly know what was happening. Now that alloc_extent_buffer() no more returns NULL, this changes its caller and callers of its caller to match with the new error handling. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-10Merge branch 'for-linus-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Has some fixes and some new self tests for btrfs. The self tests are usually disabled in the .config file (unless you're doing btrfs dev work), and this bunch is meant to find problems with the 64K page size patches. Jeff has a patch to help people see if they are using the hardware assist crc32c module, which really helps us nail down problems when people ask why crcs are using so much CPU. Otherwise, it's small fixes" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize Btrfs: self-tests: Use macros instead of constants and add missing newline Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE btrfs: advertise which crc32c implementation is being used at module load Btrfs: add validadtion checks for chunk loading Btrfs: add more validation checks for superblock Btrfs: clear uptodate flags of pages in sys_array eb Btrfs: self-tests: Support non-4k page size Btrfs: Fix integer overflow when calculating bytes_per_bitmap Btrfs: test_check_exists: Fix infinite loop when searching for free space entries Btrfs: end transaction if we abort when creating uuid root btrfs: Use __u64 in exported linux/btrfs.h.
2016-06-08Merge branch 'misc-fixes-4.7' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.7
2016-06-06Btrfs: add validadtion checks for chunk loadingLiu Bo
To prevent fuzzed filesystem images from panic the whole system, we need various validation checks to refuse to mount such an image if btrfs finds any invalid value during loading chunks, including both sys_array and regular chunks. Note that these checks may not be sufficient to cover all corner cases, feel free to add more checks. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06Btrfs: add more validation checks for superblockLiu Bo
This adds validation checks for super_total_bytes, super_bytes_used and super_stripesize, super_num_devices. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06Btrfs: clear uptodate flags of pages in sys_array ebLiu Bo
We set uptodate flag to pages in the temporary sys_array eb, but do not clear the flag after free eb. As the special btree inode may still hold a reference on those pages, the uptodate flag can remain alive in them. If btrfs_super_chunk_root has been intentionally changed to the offset of this sys_array eb, reading chunk_root will read content of sys_array and it will skip our beautiful checks in btree_readpage_end_io_hook() because of "pages of eb are uptodate => eb is uptodate" This adds the 'clear uptodate' part to force it to read from disk. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-04Merge branch 'for-linus-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "The important part of this pull is Filipe's set of fixes for btrfs device replacement. Filipe fixed a few issues seen on the list and a number he found on his own" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent Btrfs: fix race between device replace and read repair Btrfs: fix race between device replace and discard Btrfs: fix race between device replace and chunk allocation Btrfs: fix race setting block group back to RW mode during device replace Btrfs: fix unprotected assignment of the left cursor for device replace Btrfs: fix race setting block group readonly during device replace Btrfs: fix race between device replace and block group removal Btrfs: fix race between readahead and device replace/removal
2016-05-31Btrfs: end transaction if we abort when creating uuid rootJosef Bacik
We still need to call btrfs_end_transaction if we call btrfs_abort_transaction, otherwise we hang and make me super grumpy. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-30Btrfs: fix race between device replace and chunk allocationFilipe Manana
While iterating and copying extents from the source device, the device replace code keeps adjusting a left cursor that is used to make sure that once we finish processing a device extent, any future writes to extents from the corresponding block group will get into both the source and target devices. This left cursor is also used for resuming the device replace operation at mount time. However using this left cursor to decide whether writes go into both devices or only the source device is not enough to guarantee we don't miss copying extents into the target device. There are two cases where the current approach fails. The first one is related to when there are holes in the device and they get allocated for new block groups while the device replace operation is iterating the device extents (more on this explained below). The second one is that when that loop over the device extents finishes, we start dellaloc, wait for all ordered extents and then commit the current transaction, we might have got new block groups allocated that are now using a device extent that has an offset greater then or equals to the value of the left cursor, in which case writes to extents belonging to these new block groups will get issued only to the source device. For the first case where the current approach of using a left cursor fails, consider the source device currently has the following layout: [ extent bg A ] [ hole, unallocated space ] [extent bg B ] 3Gb 4Gb 5Gb While we are iterating the device extents from the source device using the commit root of the device tree, the following happens: CPU 1 CPU 2 <we are at transaction N> scrub_enumerate_chunks() --> searches the device tree for extents belonging to the source device using the device tree's commit root --> 1st iteration finds extent belonging to block group A --> sets block group A to RO mode (btrfs_inc_block_group_ro) --> sets cursor left to found_key.offset which is 3Gb --> scrub_chunk() starts copies all allocated extents from block group's A stripe at source device into target device btrfs_alloc_chunk() --> allocates device extent in the range [4Gb, 5Gb[ from the source device for a new block group C extent allocated from block group C for a direct IO, buffered write or btree node/leaf extent is written to, perhaps in response to a writepages() call from the VM or directly through direct IO the write is made only against the source device and not against the target device because the extent's offset is in the interval [4Gb, 5Gb[ which is larger then the value of cursor_left (3Gb) --> scrub_chunks() finishes --> updates left cursor from 3Gb to 4Gb --> btrfs_dec_block_group_ro() sets block group A back to RW mode <we are still at transaction N> --> 2nd iteration finds extent belonging to block group B - it did not find the new extent in the range [4Gb, 5Gb[ for block group C because we are using the device tree's commit root or even because the block group's items are not all yet inserted in the respective btrees, that is, the block group is still attached to some transaction handle's new_bgs list and btrfs_create_pending_block_groups() was not called yet against that transaction handle, so the device extent items were not yet inserted into the devices tree <we are still at transaction N> --> so we end not copying anything from the newly allocated device extent from the source device to the target device So fix this by making __btrfs_map_block() always redirect writes to the target device as well, independently of the left cursor's value. With this change the left cursor is now used only for the purpose of tracking progress and allow a mount operation to resume a device replace. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30Btrfs: fix race between device replace and block group removalFilipe Manana
When it's finishing, the device replace code iterates all extent maps representing block group and for each one that has a stripe that refers to the source device, it replaces its device with the target device. However when it replaces the source device with the target device it, the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID), only after its ID is changed to match the one from the source device. This leads to races with the chunk removal code that can temporarly see a device with an ID of 0ULL and then attempt to use that ID to remove items from the device tree and fail, causing a transaction abort: [ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to /dev/sde finished [ 9238.594377] ------------[ cut here ]------------ [ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 btrfs_remove_chunk+0x2e5/0x793 [btrfs] [ 9238.594403] BTRFS: Transaction aborted (error 1) [ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg i2c_core se rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod fl oppy [last unloaded: btrfs] [ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 4.6.0-rc7-btrfs-next-29+ #1 [ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 9238.594421] 0000000000000000 ffff88017f1dbc60 ffffffff8126b42c ffff88017f1dbcb0 [ 9238.594422] 0000000000000000 ffff88017f1dbca0 ffffffff81052b14 00000ad37f1dbd18 [ 9238.594423] 0000000000000001 ffff88018068a558 ffff88005c4b9c00 ffff880233f60db0 [ 9238.594424] Call Trace: [ 9238.594428] [<ffffffff8126b42c>] dump_stack+0x67/0x90 [ 9238.594430] [<ffffffff81052b14>] __warn+0xc2/0xdd [ 9238.594432] [<ffffffff81052b7a>] warn_slowpath_fmt+0x4b/0x53 [ 9238.594434] [<ffffffff8116c311>] ? kmem_cache_free+0x128/0x188 [ 9238.594450] [<ffffffffa04d43f5>] btrfs_remove_chunk+0x2e5/0x793 [btrfs] [ 9238.594452] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc [ 9238.594464] [<ffffffffa04a26fa>] btrfs_delete_unused_bgs+0x317/0x382 [btrfs] [ 9238.594476] [<ffffffffa04a961d>] cleaner_kthread+0x1ad/0x1c7 [btrfs] [ 9238.594489] [<ffffffffa04a9470>] ? btree_invalidatepage+0x8e/0x8e [btrfs] [ 9238.594490] [<ffffffff8106f403>] kthread+0xd4/0xdc [ 9238.594494] [<ffffffff8149e242>] ret_from_fork+0x22/0x40 [ 9238.594495] [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286 [ 9238.594496] ---[ end trace 183efbe50275f059 ]--- The sequence of steps leading to this is like the following: CPU 1 CPU 2 btrfs_dev_replace_finishing() at this point dev_replace->tgtdev->devid == BTRFS_DEV_REPLACE_DEVID (0ULL) ... btrfs_start_transaction() btrfs_commit_transaction() btrfs_delete_unused_bgs() btrfs_remove_chunk() looks up for the extent map corresponding to the chunk lock_chunks() (chunk_mutex) check_system_chunk() unlock_chunks() (chunk_mutex) locks fs_info->chunk_mutex btrfs_dev_replace_update_device_in_mapping_tree() --> iterates fs_info->mapping_tree and replaces the device in every extent map's map->stripes[] with dev_replace->tgtdev, which still has an id of 0ULL (BTRFS_DEV_REPLACE_DEVID) iterates over all stripes from the extent map --> calls btrfs_free_dev_extent() passing it the target device that still has an ID of 0ULL --> btrfs_free_dev_extent() fails --> aborts current transaction finishes setting up the target device, namely it sets tgtdev->devid to the value of srcdev->devid (which is necessarily > 0) frees the srcdev unlocks fs_info->chunk_mutex So fix this by taking the device list mutex while processing the stripes for the chunk's extent map. This is similar to the race between device replace and block group creation that was fixed by commit 50460e37186a ("Btrfs: fix race when finishing dev replace leading to transaction abort"). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-27Merge branch 'for-linus-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs cleanups and fixes from Chris Mason: "We have another round of fixes and a few cleanups. I have a fix for short returns from btrfs_copy_from_user, which finally nails down a very hard to find regression we added in v4.6. Dave is pushing around gfp parameters, mostly to cleanup internal apis and make it a little more consistent. The rest are smaller fixes, and one speelling fixup patch" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (22 commits) Btrfs: fix handling of faults from btrfs_copy_from_user btrfs: fix string and comment grammatical issues and typos btrfs: scrub: Set bbio to NULL before calling btrfs_map_block Btrfs: fix unexpected return value of fiemap Btrfs: free sys_array eb as soon as possible btrfs: sink gfp parameter to convert_extent_bit btrfs: make state preallocation more speculative in __set_extent_bit btrfs: untangle gotos a bit in convert_extent_bit btrfs: untangle gotos a bit in __clear_extent_bit btrfs: untangle gotos a bit in __set_extent_bit btrfs: sink gfp parameter to set_record_extent_bits btrfs: sink gfp parameter to set_extent_new btrfs: sink gfp parameter to set_extent_defrag btrfs: sink gfp parameter to set_extent_delalloc btrfs: sink gfp parameter to clear_extent_dirty btrfs: sink gfp parameter to clear_record_extent_bits btrfs: sink gfp parameter to clear_extent_bits btrfs: sink gfp parameter to set_extent_bits btrfs: make find_workspace warn if there are no workspaces btrfs: make find_workspace always succeed ...
2016-05-25Merge branch 'cleanups-4.7' into for-chris-4.7-20160525David Sterba
2016-05-25btrfs: fix string and comment grammatical issues and typosNicholas D Steeves
Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-25Btrfs: free sys_array eb as soon as possibleLiu Bo
While reading sys_chunk_array in superblock, btrfs creates a temporary extent buffer. Since we don't use it after finishing reading sys_chunk_array, we don't need to keep it in memory. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-21Merge branch 'for-linus-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This has our merge window series of cleanups and fixes. These target a wide range of issues, but do include some important fixes for qgroups, O_DIRECT, and fsync handling. Jeff Mahoney moved around a few definitions to make them easier for userland to consume. Also whiteout support is included now that issues with overlayfs have been cleared up. I have one more fix pending for page faults during btrfs_copy_from_user, but I wanted to get this bulk out the door first" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (90 commits) btrfs: fix memory leak during RAID 5/6 device replacement Btrfs: add semaphore to synchronize direct IO writes with fsync Btrfs: fix race between block group relocation and nocow writes Btrfs: fix race between fsync and direct IO writes for prealloc extents Btrfs: fix number of transaction units for renames with whiteout Btrfs: pin logs earlier when doing a rename exchange operation Btrfs: unpin logs if rename exchange operation fails Btrfs: fix inode leak on failure to setup whiteout inode in rename btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT Btrfs: pin log earlier when renaming Btrfs: unpin log if rename operation fails Btrfs: don't do unnecessary delalloc flushes when relocating Btrfs: don't wait for unrelated IO to finish before relocation Btrfs: fix empty symlink after creating symlink and fsync parent dir Btrfs: fix for incorrect directory entries after fsync log replay btrfs: build fixup for qgroup_account_snapshot btrfs: qgroup: Fix qgroup accounting when creating snapshot Btrfs: fix fspath error deallocation btrfs: make find_workspace warn if there are no workspaces btrfs: make find_workspace always succeed ...
2016-05-21lib/uuid.c: move generate_random_uuid() to uuid.cAndy Shevchenko
Let's gather the UUID related functions under one hood. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-16Merge branch 'foreign/anand/dev-del-by-id-ext' into for-chris-4.7-20160516David Sterba
2016-05-16Merge branch 'cleanups-4.7' into for-chris-4.7-20160516David Sterba
2016-05-06btrfs: fix lock dep warning move scratch super outside of chunk_mutexAnand Jain
Move scratch super outside of the chunk lock to avoid below lockdep warning. The better place to scratch super is in the function btrfs_rm_dev_replace_free_srcdev() just before free_device, which is outside of the chunk lock as well. To reproduce: (fresh boot) mkfs.btrfs -f -draid5 -mraid5 /dev/sdc /dev/sdd /dev/sde mount /dev/sdc /btrfs dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 (get devmgt from https://github.com/asj/devmgt.git) devmgt detach /dev/sde dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 sync btrfs replace start -Brf 3 /dev/sdf /btrfs <-- devmgt attach host7 ====================================================== [ INFO: possible circular locking dependency detected ] 4.6.0-rc2asj+ #1 Not tainted --------------------------------------------------- btrfs/2174 is trying to acquire lock: (sb_writers){.+.+.+}, at: [<ffffffff812449b4>] __sb_start_write+0xb4/0xf0 but task is already holding lock: (&fs_info->chunk_mutex){+.+.+.}, at: [<ffffffffa05c5f55>] btrfs_dev_replace_finishing+0x145/0x980 [btrfs] which lock already depends on the new lock. Chain exists of: sb_writers --> &fs_devs->device_list_mutex --> &fs_info->chunk_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->chunk_mutex); lock(&fs_devs->device_list_mutex); lock(&fs_info->chunk_mutex); lock(sb_writers); *** DEADLOCK *** -> #0 (sb_writers){.+.+.+}: [<ffffffff810e6415>] __lock_acquire+0x1bc5/0x1ee0 [<ffffffff810e707e>] lock_acquire+0xbe/0x210 [<ffffffff810df49a>] percpu_down_read+0x4a/0xa0 [<ffffffff812449b4>] __sb_start_write+0xb4/0xf0 [<ffffffff81265534>] mnt_want_write+0x24/0x50 [<ffffffff812508a2>] path_openat+0x952/0x1190 [<ffffffff81252451>] do_filp_open+0x91/0x100 [<ffffffff8123f5cc>] file_open_name+0xfc/0x140 [<ffffffff8123f643>] filp_open+0x33/0x60 [<ffffffffa0572bb6>] update_dev_time+0x16/0x40 [btrfs] [<ffffffffa057f60d>] btrfs_scratch_superblocks+0x5d/0xb0 [btrfs] [<ffffffffa057f70e>] btrfs_rm_dev_replace_remove_srcdev+0xae/0xd0 [btrfs] [<ffffffffa05c62c5>] btrfs_dev_replace_finishing+0x4b5/0x980 [btrfs] [<ffffffffa05c6ae8>] btrfs_dev_replace_start+0x358/0x530 [btrfs] Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06Btrfs: remove BUG_ON()'s in btrfs_map_blockJosef Bacik
btrfs_map_block can go horribly wrong in the face of fs corruption, lets agree to not be assholes and panic at any possible chance things are all fucked up. Signed-off-by: Josef Bacik <jbacik@fb.com> [ removed type casts ] Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06Btrfs: fix divide error upon chunk's stripe_lenLiu Bo
The struct 'map_lookup' uses type int for @stripe_len, while btrfs_chunk_stripe_len() can return a u64 value, and it may end up with @stripe_len being undefined value and it can lead to 'divide error' in __btrfs_map_block(). This changes 'map_lookup' to use type u64 for stripe_len, also right now we only use BTRFS_STRIPE_LEN for stripe_len, so this adds a valid checker for BTRFS_STRIPE_LEN. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ folded division fix to scrub_raid56_parity ] Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and ↵Anand Jain
uuid_mutex When the replace target fails, the target device will be taken out of fs device list, scratch + update_dev_time and freed. However we could do the scratch + update_dev_time and free part after the device has been taken out of device list, so that we don't have to hold the device_list_mutex and uuid_mutex locks. Reported issue: [ 5375.718845] ====================================================== [ 5375.718846] [ INFO: possible circular locking dependency detected ] [ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted [ 5375.718849] ------------------------------------------------------- [ 5375.718851] btrfs-health/4662 is trying to acquire lock: [ 5375.718861] (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0 [ 5375.718862] [ 5375.718862] but task is already holding lock: [ 5375.718907] (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs] [ 5375.718907] [ 5375.718907] which lock already depends on the new lock. [ 5375.718907] [ 5375.718908] [ 5375.718908] the existing dependency chain (in reverse order) is: [ 5375.718911] [ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}: [ 5375.718917] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0 [ 5375.718921] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0 [ 5375.718940] [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs] [ 5375.718945] [<ffffffff81267079>] show_vfsmnt+0x49/0x150 [ 5375.718948] [<ffffffff81240b07>] m_show+0x17/0x20 [ 5375.718951] [<ffffffff81246868>] seq_read+0x2d8/0x3b0 [ 5375.718955] [<ffffffff8121df28>] __vfs_read+0x28/0xd0 [ 5375.718959] [<ffffffff8121e806>] vfs_read+0x86/0x130 [ 5375.718962] [<ffffffff8121f4c9>] SyS_read+0x49/0xa0 [ 5375.718966] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a [ 5375.718968] [ 5375.718968] -> #2 (namespace_sem){+++++.}: [ 5375.718971] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0 [ 5375.718974] [<ffffffff81635199>] down_write+0x49/0x80 [ 5375.718977] [<ffffffff81243593>] lock_mount+0x43/0x1c0 [ 5375.718979] [<ffffffff81243c13>] do_add_mount+0x23/0xd0 [ 5375.718982] [<ffffffff81244afb>] do_mount+0x27b/0xe30 [ 5375.718985] [<ffffffff812459dc>] SyS_mount+0x8c/0xd0 [ 5375.718988] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a [ 5375.718991] [ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}: [ 5375.718994] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0 [ 5375.718996] [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0 [ 5375.719001] [<ffffffff8122d608>] path_openat+0x468/0x1360 [ 5375.719004] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0 [ 5375.719007] [<ffffffff8121da7b>] do_sys_open+0x12b/0x210 [ 5375.719010] [<ffffffff8121db7e>] SyS_open+0x1e/0x20 [ 5375.719013] [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a [ 5375.719015] [ 5375.719015] -> #0 (sb_writers){.+.+.+}: [ 5375.719018] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0 [ 5375.719021] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0 [ 5375.719026] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0 [ 5375.719028] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0 [ 5375.719031] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50 [ 5375.719035] [<ffffffff8122ded2>] path_openat+0xd32/0x1360 [ 5375.719037] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0 [ 5375.719040] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130 [ 5375.719043] [<ffffffff8121d923>] filp_open+0x33/0x60 [ 5375.719073] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs] [ 5375.719099] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs] [ 5375.719123] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs] [ 5375.719150] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs] [ 5375.719175] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs] [ 5375.719199] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs] [ 5375.719222] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs] [ 5375.719225] [<ffffffff810a70df>] kthread+0xef/0x110 [ 5375.719229] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70 [ 5375.719230] [ 5375.719230] other info that might help us debug this: [ 5375.719230] [ 5375.719233] Chain exists of: [ 5375.719233] sb_writers --> namespace_sem --> &fs_devs->device_list_mutex [ 5375.719233] [ 5375.719234] Possible unsafe locking scenario: [ 5375.719234] [ 5375.719234] CPU0 CPU1 [ 5375.719235] ---- ---- [ 5375.719236] lock(&fs_devs->device_list_mutex); [ 5375.719238] lock(namespace_sem); [ 5375.719239] lock(&fs_devs->device_list_mutex); [ 5375.719241] lock(sb_writers); [ 5375.719241] [ 5375.719241] *** DEADLOCK *** [ 5375.719241] [ 5375.719243] 4 locks held by btrfs-health/4662: [ 5375.719266] #0: (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs] [ 5375.719293] #1: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs] [ 5375.719319] #2: (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs] [ 5375.719343] #3: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs] [ 5375.719343] [ 5375.719343] stack backtrace: [ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40 [ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015 [ 5375.719352] 0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0 [ 5375.719354] ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930 [ 5375.719357] ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004 [ 5375.719357] Call Trace: [ 5375.719363] [<ffffffff813529e3>] dump_stack+0x85/0xc2 [ 5375.719366] [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260 [ 5375.719369] [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0 [ 5375.719373] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20 [ 5375.719376] [<ffffffff810da4be>] lock_acquire+0xce/0x1e0 [ 5375.719378] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0 [ 5375.719383] [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0 [ 5375.719385] [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0 [ 5375.719387] [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0 [ 5375.719389] [<ffffffff81242eb4>] mnt_want_write+0x24/0x50 [ 5375.719393] [<ffffffff8122ded2>] path_openat+0xd32/0x1360 [ 5375.719415] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs] [ 5375.719418] [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20 [ 5375.719420] [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0 [ 5375.719423] [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80 [ 5375.719426] [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0 [ 5375.719430] [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120 [ 5375.719433] [<ffffffff8121d8a4>] file_open_name+0xe4/0x130 [ 5375.719436] [<ffffffff8121d923>] filp_open+0x33/0x60 [ 5375.719462] [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs] [ 5375.719485] [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs] [ 5375.719506] [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs] [ 5375.719530] [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs] [ 5375.719554] [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs] [ 5375.719576] [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs] [ 5375.719598] [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs] [ 5375.719621] [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs] [ 5375.719641] [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs] [ 5375.719661] [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs] [ 5375.719663] [<ffffffff810a70df>] kthread+0xef/0x110 [ 5375.719666] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 [ 5375.719669] [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70 [ 5375.719672] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 [ 5375.719697] ------------[ cut here ]------------ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06btrfs: allow balancing to dup with multi-deviceAustin S. Hemmelgarn
Currently, we don't allow the user to try and rebalance to a dup profile on a multi-device filesystem. In most cases, this is a perfectly sensible restriction as raid1 uses the same amount of space and provides better protection. However, when reshaping a multi-device filesystem down to a single device filesystem, this requires the user to convert metadata and system chunks to single profile before deleting devices, and then convert again to dup, which leaves a period of time where metadata integrity is reduced. This patch removes the single-device-only restriction from converting to dup profile to remove this potential data integrity reduction. Signed-off-by: Austin S. Hemmelgarn <ahferroin7@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-04btrfs: cleanup assigning next active device with a checkAnand Jain
Creates helper fucntion as needed by the device delete and replace operations. Also now it checks if the next device being assigned is an active device. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-04btrfs: s_bdev is not null after missing replaceAnand Jain
Yauhen reported in the ML that s_bdev is null at mount, and s_bdev gets updated to some device when missing device is replaced, as because bdev is null for missing device, things gets matched up. Fix this by checking if s_bdev is set. I didn't want to completely remove updating s_bdev because the future multi device support at vfs layer may need it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: rename btrfs_find_device_by_user_inputDavid Sterba
For clarity how we are going to find the device, let's call it a device specifier, devspec for short. Also rename the arguments that are a leftover from previous function purpose. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: use existing device constraints table btrfs_raid_arrayDavid Sterba
We should avoid duplicating the device constraints, let's use the btrfs_raid_array in btrfs_check_raid_min_devices. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: introduce raid-type to error-code table, for minimum device constraintDavid Sterba
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: pass number of devices to btrfs_check_raid_min_devicesDavid Sterba
Before this patch, btrfs_check_raid_min_devices would do an off-by-one check of the constraints and not the miminmum check, as its name suggests. This is not a problem if the only caller is device remove, but would be confusing for others. Add an argument with the exact number and let the caller(s) decide if this needs any adjustments, like when device replace is running. Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: rename __check_raid_min_devicesDavid Sterba
Underscores are for special functions, use the full prefix for better stacktrace recognition. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: optimize check for stale deviceAnand Jain
Optimize check for stale device to only be checked when there is device added or changed. If there is no update to the device, there is no need to call btrfs_free_stale_device(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: introduce device delete by devidAnand Jain
This introduces new ioctl BTRFS_IOC_RM_DEV_V2, which uses enhanced struct btrfs_ioctl_vol_args_v2 to carry devid as an user argument. The patch won't delete the old ioctl interface and so kernel remains backward compatible with user land progs. Test case/script: echo "0 $(blockdev --getsz /dev/sdf) linear /dev/sdf 0" | dmsetup create bad_disk mkfs.btrfs -f -d raid1 -m raid1 /dev/sdd /dev/sde /dev/mapper/bad_disk mount /dev/sdd /btrfs dmsetup suspend bad_disk echo "0 $(blockdev --getsz /dev/sdf) error /dev/sdf 0" | dmsetup load bad_disk dmsetup resume bad_disk echo "bad disk failed. now deleting/replacing" btrfs dev del 3 /btrfs echo $? btrfs fi show /btrfs umount /btrfs btrfs-show-super /dev/sdd | egrep num_device dmsetup remove bad_disk wipefs -a /dev/sdf Signed-off-by: Anand Jain <anand.jain@oracle.com> Reported-by: Martin <m_btrfs@ml1.co.uk> [ adjust messages, s/disk/device/ ] Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: make use of btrfs_scratch_superblocks() in btrfs_rm_device()Anand Jain
With the previous patches now the btrfs_scratch_superblocks() is ready to be used in btrfs_rm_device() so use it. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ use GFP_KERNEL ] Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: enhance btrfs_find_device_by_user_input() to check device pathAnand Jain
The operation of device replace and device delete follows same steps upto some depth with in btrfs kernel, however they don't share codes. This enhancement will help replace and delete to share codes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: make use of btrfs_find_device_by_user_input()Anand Jain
btrfs_rm_device() has a section of the code which can be replaced btrfs_find_device_by_user_input() Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: create helper btrfs_find_device_by_user_input()Anand Jain
The patch renames btrfs_dev_replace_find_srcdev() to btrfs_find_device_by_user_input() and moves it to volumes.c, so that delete device can use it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: clean up and optimize __check_raid_min_device()Anand Jain
__check_raid_min_device() which was pealed from btrfs_rm_device() maintianed its original code to show the block move. This patch cleans up __check_raid_min_device(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: create helper function __check_raid_min_devices()Anand Jain
move a section of btrfs_rm_device() code to check for min number of the devices into the function __check_raid_min_devices() Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: create a helper function to read the disk superAnand Jain
A part of code from btrfs_scan_one_device() is moved to a new function btrfs_read_disk_super(), so that former function looks cleaner. (In this process it also moves the code which ensures null terminating label). So this creates easy opportunity to merge various duplicate codes on read disk super. Earlier attempt to merge duplicate codes highlighted that there were some issues for which there are duplicate codes (to read disk super), however it was not clear what was the issue. So until we figure that out, its better to keep them in a separate functions. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ use GFP_KERNEL, PAGE_CACHE_ removal related fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28Btrfs: do not create empty block group if we have allocated dataLiu Bo
Now we force to create empty block group to keep data profile alive, however, in the below example, we eventually get an empty block group while we're trying to get more space for other types (metadata/system), - Before, block group "A": size=2G, used=1.2G block group "B": size=2G, used=512M - After "btrfs balance start -dusage=50 mount_point", block group "A": size=2G, used=(1.2+0.5)G block group "C": size=2G, used=0 Since there is no data in block group C, it won't be deleted automatically and we have to get the unused 2G until the next mount. Balance itself just moves data and doesn't remove data, so it's safe to not create such a empty block group if we already have data allocated in other block groups. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28btrfs: rename btrfs_std_error to btrfs_handle_fs_errorAnand Jain
btrfs_std_error() handles errors, puts FS into readonly mode (as of now). So its good idea to rename it to btrfs_handle_fs_error(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ edit changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-14btrfs: Fix misspellings in comments.Adam Buchbinder
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-26Merge branch 'foreign/liubo/replace-lockup' into for-chris-4.6David Sterba
2016-02-26Merge branch 'dev/rename-keys' into for-chris-4.6David Sterba
2016-02-23Btrfs: fix lockdep deadlock warning due to dev_replaceLiu Bo
Xfstests btrfs/011 complains about a deadlock warning, [ 1226.649039] ========================================================= [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ] [ 1226.649039] 4.1.0+ #270 Not tainted [ 1226.649039] --------------------------------------------------------- [ 1226.652955] kswapd0/46 just changed the state of lock: [ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0 [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past: [ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.} and interrupts could create inverse lock ordering between them. [ 1226.652955] other info that might help us debug this: [ 1226.652955] Chain exists of: &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock [ 1226.652955] Possible interrupt unsafe locking scenario: [ 1226.652955] CPU0 CPU1 [ 1226.652955] ---- ---- [ 1226.652955] lock(&fs_info->dev_replace.lock); [ 1226.652955] local_irq_disable(); [ 1226.652955] lock(&delayed_node->mutex); [ 1226.652955] lock(&found->groups_sem); [ 1226.652955] <Interrupt> [ 1226.652955] lock(&delayed_node->mutex); [ 1226.652955] *** DEADLOCK *** Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried to fix a similar one that has the exactly same warning, but with that, we still run to this. The above lock chain comes from btrfs_commit_transaction ->btrfs_run_delayed_items ... ->__btrfs_update_delayed_inode ... ->__btrfs_cow_block ... ->find_free_extent ->cache_block_group ->load_free_space_cache ->btrfs_readpages ->submit_one_bio ... ->__btrfs_map_block ->btrfs_dev_replace_lock However, with high memory pressure, tasks which hold dev_replace.lock can be interrupted by kswapd and then kswapd is intended to release memory occupied by superblock, inodes and dentries, where we may call evict_inode, and it comes to [ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0 [ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30 [ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700 delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads to a ABBA deadlock. To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but things are simpler here since we only needs read's spinlock to blocking lock. With this, btrfs/011 no more produces warnings in dmesg. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11btrfs: switch dev stats item to the permanent item keyDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.com>