summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-09-10Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlockFilipe Manana
After commmit e44163e17796 ("btrfs: explictly delete unused block groups in close_ctree and ro-remount"), added in the 4.3 merge window, we have calls to btrfs_delete_unused_bgs() while holding the cleaner_mutex. This can cause a deadlock with a concurrent block group relocation (when a filesystem balance or shrink operation is in progress for example) because btrfs_delete_unused_bgs() locks delete_unused_bgs_mutex and the relocation path locks first delete_unused_bgs_mutex and then it locks cleaner_mutex, resulting in a classic ABBA deadlock: CPU 0 CPU 1 lock fs_info->cleaner_mutex __btrfs_balance() || btrfs_shrink_device() lock fs_info->delete_unused_bgs_mutex btrfs_relocate_chunk() btrfs_relocate_block_group() lock fs_info->cleaner_mutex btrfs_delete_unused_bgs() lock fs_info->delete_unused_bgs_mutex Fix this by not taking the cleaner_mutex before calling btrfs_delete_unused_bgs() because it's no longer needed after commit 67c5e7d464bc ("Btrfs: fix race between balance and unused block group deletion"). The mutex fs_info->delete_unused_bgs_mutex, the spinlock fs_info->unused_bgs_lock and a block group's spinlock are enough to get correct serialization between tasks running relocation and unused block group deletion (as well as between multiple tasks concurrently calling btrfs_delete_unused_bgs()). This issue was discussed (in the mailing list) during the review of the patch titled "btrfs: explictly delete unused block groups in close_ctree and ro-remount" and it was agreed that acquiring the cleaner mutex had to be dropped after the patch titled "Btrfs: fix race between balance and unused block group deletion" got merged (both patches were submitted at about the same time, but one landed in kernel 4.2 and the other in the 4.3 merge window). Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-09-08Btrfs: don't initialize a space info as full to prevent ENOSPCFilipe Manana
Commit 2e6e518335f8 ("Btrfs: fix block group ->space_info null pointer dereference") accidently marked a space info as full when initializing it with a value of 0 total bytes. This introduces an ENOSPC problem when writing file data if we mount a filesystem that has no data block groups allocated, because the data space info is initialized with 0 total bytes, marked as full, and it never gets its total bytes incremented by a (positive) value to unmark it as full (because there are no data block groups loaded when the fs is mounted). For metadata and system spaces this issue can never happen since we always have at least one metadata block group and one system block group (even for an empty filesystem). So fix this by just not initializing a space info as full, reverting the offending part of the commit mentioned above. The following test case for fstests reproduces the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 # Mount our filesystem without space caches enabled so that we do not # get any space used from the initial data block group that mkfs creates # (space caches used space from data block groups). _scratch_mount "-o nospace_cache" # Need an fs with at least 2Gb to make sure mkfs.btrfs does not create # an fs using mixed block groups (used both for data and metadata). We # really need to have dedicated block groups for data to reproduce the # issue and mkfs.btrfs defaults to mixed block groups only for small # filesystems (up to 1Gb). _require_fs_space $SCRATCH_MNT $((2 * 1024 * 1024)) # Run balance with the purpose of deleting the unused data block group # that mkfs created. We could also wait for the background kthread to # automatically delete the unused block group, but we do not have a way # to make it run and wait for it to complete, so just do a balance # instead of some unreliable sleep _run_btrfs_util_prog balance start -dusage=0 $SCRATCH_MNT # Now unmount the filesystem, mount it again (either with or with space # caches enabled, it does not matter to trigger the problem) and attempt # to create a file with some data - this used to fail with ENOSPC # because there were no data block groups when the filesystem was # mounted and the data space info object was marked as full when # initialized (because it had 0 total bytes), which prevented the file # write path from attempting to allocate a data block group and fail # immediately with ENOSPC. _scratch_remount echo "hello world" > $SCRATCH_MNT/foobar echo "Silence is golden" status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-08-31Btrfs: cleanup: remove unnecessary check before btrfs_free_path is calledTsutomu Itoh
We need not check path before btrfs_free_path() is called because path is checked in btrfs_free_path(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31btrfs: async_thread: Fix workqueue 'max_active' value when initializingQu Wenruo
At initializing time, for threshold-able workqueue, it's max_active of kernel workqueue should be 1 and grow if it hits threshold. But due to the bad naming, there is both 'max_active' for kernel workqueue and btrfs workqueue. So wrong value is given at workqueue initialization. This patch fixes it, and to avoid further misunderstanding, change the member name of btrfs_workqueue to 'current_active' and 'limit_active'. Also corresponding comment is added for readability. Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31btrfs: Add raid56 support for updatingZhao Lei
num_tolerated_disk_barrier_failures in btrfs_balance Code for updating fs_info->num_tolerated_disk_barrier_failures in btrfs_balance() lacks raid56 support. Reason: Above code was wroten in 2012-08-01, together with btrfs_calc_num_tolerated_disk_barrier_failures()'s first version. Then, btrfs_calc_num_tolerated_disk_barrier_failures() got updated later to support raid56, but code in btrfs_balance() was not updated together. Fix: Merge above similar code to a common function: btrfs_get_num_tolerated_disk_barrier_failures() and make it support both case. It can fix this bug with a bonus of cleanup, and make these code never in above no-sync state from now on. Suggested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failuresZhao Lei
1: Use ARRAY_SIZE(types) to replace a static-value variant: int num_types = 4; 2: Use 'continue' on condition to reduce one level tab if (!XXX) { code; ... } -> if (XXX) continue; code; ... 3: Put setting 'num_tolerated_disk_barrier_failures = 2' to (num_tolerated_disk_barrier_failures > 2) condition to make make logic neat. if (num_tolerated_disk_barrier_failures > 0 && XXX) num_tolerated_disk_barrier_failures = 0; else if (num_tolerated_disk_barrier_failures > 1) { if (XXX) num_tolerated_disk_barrier_failures = 1; else if (XXX) num_tolerated_disk_barrier_failures = 2; -> if (num_tolerated_disk_barrier_failures > 0 && XXX) num_tolerated_disk_barrier_failures = 0; if (num_tolerated_disk_barrier_failures > 1 && XXX) num_tolerated_disk_barrier_failures = ; if (num_tolerated_disk_barrier_failures > 2 && XXX) num_tolerated_disk_barrier_failures = 2; 4: Remove comment of: num_mirrors - 1: if RAID1 or RAID10 is configured and more than 2 mirrors are used. which is not fit with code. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31btrfs: Remove noused chunk_tree and chunk_objectid from ↵Zhao Lei
scrub_enumerate_chunks and scrub_chunk These variables are not used from introduced version, remove them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31btrfs: Update out-of-date "skip parity stripe" commentZhao Lei
Because btrfs support scrub raid56 parity stripe now. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-21btrfs: fix compile when block cgroups are not enabledChris Mason
bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on. This adds an ifdef around them. It's not perfect, but our use of bi_ioc is being removed in the 4.3 merge window. The bi_css usage really should go into bio_clone, but I want to make sure that doesn't introduce problems for other bio_clone use cases. Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19Btrfs: fix file read corruption after extent cloning and fsyncFilipe Manana
If we partially clone one extent of a file into a lower offset of the file, fsync the file, power fail and then mount the fs to trigger log replay, we can get multiple checksum items in the csum tree that overlap each other and result in checksum lookup failures later. Those failures can make file data read requests assume a checksum value of 0, but they will not return an error (-EIO for example) to userspace exactly because the expected checksum value 0 is a special value that makes the read bio endio callback return success and set all the bytes of the corresponding page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()). From a userspace perspective this is equivalent to file corruption because we are not returning what was written to the file. Details about how this can happen, and why, are included inline in the following reproducer test case for fstests and the comment added to tree-log.c. seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_dm_flakey _require_cloner _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create our test file with a single 100K extent starting at file # offset 800K. We fsync the file here to make the fsync log tree gets # a single csum item that covers the whole 100K extent, which causes # the second fsync, done after the cloning operation below, to not # leave in the log tree two csum items covering two sub-ranges # ([0, 20K[ and [20K, 100K[)) of our extent. $XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K" \ -c "fsync" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now clone part of our extent into file offset 400K. This adds a file # extent item to our inode's metadata that points to the 100K extent # we created before, using a data offset of 20K and a data length of # 20K, so that it refers to the sub-range [20K, 40K[ of our original # extent. $CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \ -l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo # Now fsync our file to make sure the extent cloning is durably # persisted. This fsync will not add a second csum item to the log # tree containing the checksums for the blocks in the sub-range # [20K, 40K[ of our extent, because there was already a csum item in # the log tree covering the whole extent, added by the first fsync # we did before. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo echo "File digest before power failure:" md5sum $SCRATCH_MNT/foo | _filter_scratch # Silently drop all writes and ummount to simulate a crash/power # failure. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again, mount to trigger log replay and validate file # contents. # The fsync log replay first processes the file extent item # corresponding to the file offset 400K (the one which refers to the # [20K, 40K[ sub-range of our 100K extent) and then processes the file # extent item for file offset 800K. It used to happen that when # processing the later, it erroneously left in the csum tree 2 csum # items that overlapped each other, 1 for the sub-range [20K, 40K[ and # 1 for the whole range of our extent. This introduced a problem where # subsequent lookups for the checksums of blocks within the range # [40K, 100K[ of our extent would not find anything because lookups in # the csum tree ended up looking only at the smaller csum item, the # one covering the subrange [20K, 40K[. This made read requests assume # an expected checksum with a value of 0 for those blocks, which caused # checksum verification failure when the read operations finished. # However those checksum failure did not result in read requests # returning an error to user space (like -EIO for e.g.) because the # expected checksum value had the special value 0, and in that case # btrfs set all bytes of the corresponding pages with the value 0x01 # and produce the following warning in dmesg/syslog: # # "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\ # 1322675045 expected csum 0" # _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey echo "File digest after log replay:" # Must match the same digest he had after cloning the extent and # before the power failure happened. md5sum $SCRATCH_MNT/foo | _filter_scratch _unmount_flakey status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19Btrfs: check if previous transaction aborted to avoid fs corruptionFilipe Manana
While we are committing a transaction, it's possible the previous one is still finishing its commit and therefore we wait for it to finish first. However we were not checking if that previous transaction ended up getting aborted after we waited for it to commit, so we ended up committing the current transaction which can lead to fs corruption because the new superblock can point to trees that have had one or more nodes/leafs that were never durably persisted. The following sequence diagram exemplifies how this is possible: CPU 0 CPU 1 transaction N starts (...) btrfs_commit_transaction(N) cur_trans->state = TRANS_STATE_COMMIT_START; (...) cur_trans->state = TRANS_STATE_COMMIT_DOING; (...) cur_trans->state = TRANS_STATE_UNBLOCKED; root->fs_info->running_transaction = NULL; btrfs_start_transaction() --> starts transaction N + 1 btrfs_write_and_wait_transaction(trans, root); --> starts writing all new or COWed ebs created at transaction N creates some new ebs, COWs some existing ebs but doesn't COW or deletes eb X btrfs_commit_transaction(N + 1) (...) cur_trans->state = TRANS_STATE_COMMIT_START; (...) wait_for_commit(root, prev_trans); --> prev_trans == transaction N btrfs_write_and_wait_transaction() continues writing ebs --> fails writing eb X, we abort transaction N and set bit BTRFS_FS_STATE_ERROR on fs_info->fs_state, so no new transactions can start after setting that bit cleanup_transaction() btrfs_cleanup_one_transaction() wakes up task at CPU 1 continues, doesn't abort because cur_trans->aborted (transaction N + 1) is zero, and no checks for bit BTRFS_FS_STATE_ERROR in fs_info->fs_state are made btrfs_write_and_wait_transaction(trans, root); --> succeeds, no errors during writeback write_ctree_super(trans, root, 0); --> succeeds --> we have now a superblock that points us to some root that uses eb X, which was never written to disk In this scenario future attempts to read eb X from disk results in an error message like "parent transid verify failed on X wanted Y found Z". So fix this by aborting the current transaction if after waiting for the previous transaction we verify that it was aborted. Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19btrfs: use __GFP_NOFAIL in alloc_btrfs_bioMichal Hocko
alloc_btrfs_bio relies on GFP_NOFS allocation when committing the transaction but this allocation context is rather weak wrt. reclaim capabilities. The page allocator currently tries hard to not fail these allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) but it can still fail if the _current_ process is the OOM killer victim. Moreover there is an attempt to move away from the default no-fail behavior and allow these allocation to fail more eagerly. This would lead to: [ 37.928625] kernel BUG at fs/btrfs/extent_io.c:4045 which is clearly undesirable and the nofail behavior should be explicit if the allocation failure cannot be tolerated. Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19btrfs: Prevent from early transaction abortMichal Hocko
Btrfs relies on GFP_NOFS allocation when committing the transaction but this allocation context is rather weak wrt. reclaim capabilities. The page allocator currently tries hard to not fail these allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) so this is not a problem currently but there is an attempt to move away from the default no-fail behavior and allow these allocation to fail more eagerly. And this would lead to a pre-mature transaction abort as follows: [ 55.328093] Call Trace: [ 55.328890] [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b [ 55.330518] [<ffffffff8108fa28>] ? console_unlock+0x334/0x363 [ 55.332738] [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4 [ 55.334910] [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c [ 55.336844] [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs] [ 55.338973] [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs] [ 55.341329] [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs] [ 55.343566] [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs] [ 55.345577] [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs] [ 55.347679] [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e [ 55.349434] [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs] [ 55.351681] [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs] [ 55.353979] [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs] [ 55.356212] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.358378] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.360626] [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs] [ 55.362894] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.365221] [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs] [ 55.367273] [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e [ 55.369047] [<ffffffff81186833>] vfs_fsync+0x1c/0x1e [ 55.370654] [<ffffffff81186869>] do_fsync+0x34/0x4e [ 55.372246] [<ffffffff81186ab3>] SyS_fsync+0x10/0x14 [ 55.373851] [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f [ 55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory [ 55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction. [ 55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure). [ 55.384280] ------------[ cut here ]------------ [ 55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]() [...] [ 55.384337] Call Trace: [ 55.384353] [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b [ 55.384357] [<ffffffff8107f717>] ? down_trylock+0x2d/0x37 [ 55.384359] [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb [ 55.384398] [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs] [ 55.384400] [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c [ 55.384423] [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs] [ 55.384446] [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs] [ 55.384455] [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs] [ 55.384476] [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs] [ 55.384499] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.384521] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.384543] [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs] [ 55.384565] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.384588] [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs] [ 55.384591] [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e [ 55.384592] [<ffffffff81186833>] vfs_fsync+0x1c/0x1e [ 55.384593] [<ffffffff81186869>] do_fsync+0x34/0x4e [ 55.384594] [<ffffffff81186ab3>] SyS_fsync+0x10/0x14 [ 55.384595] [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f [...] [ 55.384608] ---[ end trace c29799da1d4dd621 ]--- [ 55.437323] BTRFS info (device hdb1): forced readonly [ 55.438815] BTRFS info (device hdb1): delayed_refs has NO entry Fix this by being explicit about the no-fail behavior of this allocation path and use __GFP_NOFAIL. Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19btrfs: Remove unused arguments in tree-log.cZhaolei
Following arguments are not used in tree-log.c: insert_one_name(): path, type wait_log_commit(): trans wait_for_writer(): trans This patch remove them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19btrfs: Remove useless condition in start_log_trans()Zhaolei
Dan Carpenter <dan.carpenter@oracle.com> reported a smatch warning for start_log_trans(): fs/btrfs/tree-log.c:178 start_log_trans() warn: we tested 'root->log_root' before and it was 'false' fs/btrfs/tree-log.c 147 if (root->log_root) { We test "root->log_root" here. ... Reason: Condition of: fs/btrfs/tree-log.c:178: if (!root->log_root) { is not necessary after commit: 7237f1833 It caused a smatch warning, and no functionally error. Fix: Deleting above condition will make smatch shut up, but a better way is to do cleanup for start_log_trans() to remove duplicated code and make code more readable. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Merge branch 'jeffm-discard-4.3' into for-linus-4.3Chris Mason
2015-08-09Btrfs: add support for blkio controllersChris Mason
This attaches accounting information to bios as we submit them so the new blkio controllers can throttle on btrfs filesystems. Not much is required, we're just associating bios with blkcgs during clone, calling wbc_init_bio()/wbc_account_io() during writepages submission, and attaching the bios to the current context during direct IO. Finally if we are splitting bios during btrfs_map_bio, this attaches accounting information to the split. The end result is able to throttle nicely on single disk filesystems. A little more work is required for multi-device filesystems. Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: remove unused mutex from struct 'btrfs_fs_info'Byongho Lee
The code using 'ordered_extent_flush_mutex' mutex has removed by below commit. - 8d875f95da43c6a8f18f77869f2ef26e9594fecc btrfs: disable strict file flushes for renames and truncates But the mutex still lives in struct 'btrfs_fs_info'. So, this patch removes the mutex from struct 'btrfs_fs_info' and its initialization code. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: fix parity scrub of RAID 5/6 with missing deviceOmar Sandoval
When testing the previous patch, Zhao Lei reported a similar bug when attempting to scrub a degraded RAID 5/6 filesystem with a missing device, leading to NULL pointer dereferences from the RAID 5/6 parity scrubbing code. The first cause was the same as in the previous patch: attempting to call bio_add_page() on a missing block device. To fix this, scrub_extent_for_parity() can just mark the sectors on the missing device as errors instead of attempting to read from it. Additionally, the code uses scrub_remap_extent() to map the extent of the corresponding data stripe, but the extent wasn't already mapped. If scrub_remap_extent() finds a missing block device, it doesn't initialize extent_dev, so we're left with a NULL struct btrfs_device. The solution is to use btrfs_map_block() directly. Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: fix device replace of a missing RAID 5/6 deviceOmar Sandoval
The original implementation of device replace on RAID 5/6 seems to have missed support for replacing a missing device. When this is attempted, we end up calling bio_add_page() on a bio with a NULL ->bi_bdev, which crashes when we try to dereference it. This happens because btrfs_map_block() has no choice but to return us the missing device because RAID 5/6 don't have any alternate mirrors to read from, and a missing device has a NULL bdev. The idea implemented here is to handle the missing device case separately, which better only happen when we're replacing a missing RAID 5/6 device. We use the new BTRFS_RBIO_REBUILD_MISSING operation to reconstruct the data from parity, check it with scrub_recheck_block_checksum(), and write it out with scrub_write_block_to_dev_replace(). Reported-by: Philip <bugzilla@philip-seeger.de> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96141 Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operationOmar Sandoval
The current RAID 5/6 recovery code isn't quite prepared to handle missing devices. In particular, it expects a bio that we previously attempted to use in the read path, meaning that it has valid pages allocated. However, missing devices have a NULL blkdev, and we can't call bio_add_page() on a bio with a NULL blkdev. We could do manual manipulation of bio->bi_io_vec, but that's pretty gross. So instead, add a separate path that allows us to manually add pages to the rbio. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: count devices correctly in readahead during RAID 5/6 replaceOmar Sandoval
Commit 5fbc7c59fd22 ("Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting") fixed a problem where we would skip a missing device when we shouldn't have because there are no other mirrors to read from in RAID 5/6. After commit 2c8cdd6ee4e7 ("Btrfs, replace: write dirty pages into the replace target device"), the fix doesn't work when we're doing a missing device replace on RAID 5/6 because the replace device is counted as a mirror so we're tricked into thinking we can safely skip the missing device. The fix is to count only the real stripes and decide based on that. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: remove misleading handling of missing device scrubOmar Sandoval
scrub_submit() claims that it can handle a bio with a NULL block device, but this is misleading, as calling bio_add_page() on a bio with a NULL ->bi_bdev would've already crashed. Delete this, as we're about to properly handle a missing block device. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: fix clone / extent-same deadlocksMark Fasheh
Clone and extent same lock their source and target inodes in opposite order. In addition to this, the range locking in clone doesn't take ordering into account. Fix this by having clone use the same locking helpers as btrfs-extent-same. In addition, I do a small cleanup of the locking helpers, removing a case (both inodes being the same) which was poorly accounted for and never actually used by the callers. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: fix defrag to merge tail file extentLiu Bo
The file layout is [extent 1]...[extent n][4k extent][HOLE][extent x] extent 1~n and 4k extent can be merged during defrag, and the whole defrag bytes is larger than our defrag thresh(256k), 4k extent as a tail is left unmerged since we check if its next extent can be merged (the next one is a hole, so the check will fail), the layout thus can be [new extent][4k extent][HOLE][extent x] (1~n) To fix it, beside looking at the next one, this also looks at the previous one by checking @defrag_end, which is set to 0 when we decide to stop merging contiguous extents, otherwise, we can merge the previous one with our extent. Also, this makes btrfs behave consistent with how xfs and ext4 do. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: fix warning in backref walkingLiu Bo
When we do backref walking, we search firstly in queued delayed refs and then the on-disk backrefs, but we parse differently for shared references, for delayed refs we also add 'ref->root' while for on-disk backrefs we don't, this can prevent us from merging refs indexed by the same bytenr and cause find_parent_nodes() to throw a warning at 'WARN_ON(ref->count < 0)', for example, when we have a shared data extent with 'ref_cnt=1' and a delayed shared data with a BTRFS_DROP_DELAYED_REF, that happens. For shared references, no matter if it's delayed or on-disk, ref->root is not at all used, instead it's ref->parent that really matters, so this has delayed refs handled as the same way as on-disk refs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Add WARN_ON() for double lock in btrfs_tree_lock()Zhaolei
When a task trying to double lock a extent buffer, there are no lockdep warning about it because this lock may be in "blocking_lock" state, and make us hard to debug. This patch add a WARN_ON() for above condition, it can not report all deadlock cases(as lock between tasks), but at least helps us some. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Remove root argument in extent_data_ref_count()Zhaolei
Because it is never used. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Fix wrong comment of btrfs_alloc_tree_block()Zhaolei
These wrong comment was copyed from another function(expired) from init, this patch fixed them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: abort transaction on btrfs_reloc_cow_block()Zhaolei
When btrfs_reloc_cow_block() failed in __btrfs_cow_block(), current code just return a err-value to caller, but leave new_created extent buffer exist and locked. Then subsequent code (in relocate) try to lock above eb again, and caused deadlock without any dmesg. (eb lock use wait_event(), so no lockdep message) It is hard to do recover work in __btrfs_cow_block() at this error point, but we can abort transaction to avoid deadlock and operate on unstable state.a It also helps developer to find wrong place quickly. (better than a frozen fs without any dmesg before patch) Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Remove unnecessary variants in relocation.cZhaolei
These arguments are not used in functions, remove them for cleanup and make kernel stack happy. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Cleanup: Remove chunk_objectid argument from btrfs_relocate_chunk()Zhaolei
Remove chunk_objectid argument from btrfs_relocate_chunk() because it is not necessary, it can also cleanup some code in caller for prepare its value. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Cleanup: Remove objectid's init-value in create_reloc_inode()Zhaolei
objectid's init-value is not used in any case, remove it. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Error handle for get_ref_objectid_v0() in relocate_block_group()Zhaolei
We need error checking code for get_ref_objectid_v0() in relocate_block_group(), to avoid unpredictable result, especially for accessing uninitialized value(when function failed) after this line. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Fix data checksum error cause by replace with io-load.Zhaolei
xfstests btrfs/070 sometimes failed. In my test machine, its fail rate is about 30%. In another vm(vmware), its fail rate is about 50%. Reason: btrfs/070 do replace and defrag with fsstress simultaneously, after above operation, checksum error is found by scrub. Actually, it have no relationship with defrag operation, only replace with fsstress can trigger this bug. New data writen to target device have possibility rewrited by old data from source device by replace code in debug, to avoid above problem, we can set target block group to readonly in replace period, so new data requested by other operation will not write to same place with replace code. Before patch(4.1-rc3): 30% failed in 100 xfstests. After patch: 0% failed in 300 xfstests. It also happened in btrfs/071 as it's another scrub with IO load tests. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: use scrub_pause_on/off() to reduce code in scrub_enumerate_chunks()Zhaolei
Use new intruduced scrub_pause_on/off() can make this code block clean and more readable. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Separate scrub_blocked_if_needed() to scrub_pause_on/off()Zhaolei
It can reduce current duplicated code which is similar to scrub_blocked_if_needed() but can not call it because little different. It also used by my next patch which is in same case. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Use ref_cnt for set_block_group_ro()Zhaolei
More than one code call set_block_group_ro() and restore rw in fail. Old code use bool bit to save blockgroup's ro state, it can not support parallel case(it is confirmd exist in my debug log). This patch use ref count to store ro state, and rename set_block_group_ro/set_block_group_rw to inc_block_group_ro/dec_block_group_ro. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Bypass unrelated items before accessing its contents in scrubZhao Lei
When we access extent_root in scrub_stripe() and scrub_raid56_parity(), we need bypass unrelated tree item firstly before using its contents to do other condition. It is not a bug fix, only making code sequence in logic. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Load only necessary csums into list in scrubZhao Lei
We need not load csum of whole strip in scrub because strip is trimed before use, it is to say, what we really need to calculate csum is data between [extent_logical, extent_len). This patch changed to use above segment for btrfs_lookup_csums_range() in scrub_stripe() Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Fix calculate typo caused by ambiguous meaning of logic_endZhao Lei
For example, in scrub_raid56_parity(), following lines are used to judge is all data processed: place1: if (key.objectid > logic_end) ... place2: if (logic_start >= logic_end) ... ... (place2 is typo, is should be ">", it is copied from other place, where logic_end's meaning is different, long story...) We can fix above typo directly, but the root reason is ambiguous meaning of logic_end in scrub raid56 parity. In other place, XXX_end is pointed to data which is not included, and we need to process segment of [XXX_start, XXX_end). But for scrub raid56 parity, logic_end is pointed to lattest data need to process, and introduced many "+ 1" and "- 1" in code as below: length = sparity->logic_end - sparity->logic_start + 1 logic_end - logic_start + 1 stripe_logical + increment - 1 This patch changed logic_end's meaning to make it in normal understanding in raid56 parity functions and data struct alone with above bugfix. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Free checksum list on scrub_extent() failZhao Lei
When scrub_extent() failed, we need to free previois created checksum list. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Check cancel and pause in interval of scrub operationZhao Lei
Old code checking cancel and pause request inside scrub stripe operation, like: loop() { if (parity) { scrub_parity_stripe(); continue; } check_cancel_and_pause() scrub_normal_stripe(); } Reason is when introduce raid56 stripe scrub, new code is inserted simplely to front of loop. Better to: loop() { check_cancel_and_pause() if (parity) scrub_parity_stripe(); else scrub_normal_stripe(); } This patch adjusted code place to realize above sequence. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Show detail information when mount failed on missing devicesZhao Lei
When mount failed because missing device, we can see following dmesg: [ 1060.267743] BTRFS: too many missing devices, writeable mount is not allowed [ 1060.273158] BTRFS: open_ctree failed This patch add missing_device_number and tolerated_missing_device_number to above output, to let user know what really happened, and helps bug-report and debug. dmesg after patch: [ 127.050367] BTRFS: missing devices(1) exceeds the limit(0), writeable mount is not allowed [ 127.056099] BTRFS: open_ctree failed Changelog v1->v2: 1: Changed to more clear description, suggested-by: Anand Jain <anand.jain@oracle.com> Suggested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: Fix scrub panic when leaf crosses stripesZhao Lei
Scrub panic in following operation: mkfs.ext4 /dev/vdh btrfs-convert /dev/vdh mount /dev/vdh /mnt/tmp1 btrfs scrub start -B /dev/vdh (panic) Reason: 1: In some case, leaf created by btrfs-convert was splited into 2 strips. 2: Scrub bypassed part of above wrong leaf data, but remain data caused panic in scrub_checksum_tree_block(). For reason 1: we can get following information after some simple operation. a. mkfs.ext4 /dev/vdh btrfs-convert /dev/vdh b. btrfs-debug-tree /dev/vdh we can see following item in extent tree: item 25 key (27054080 METADATA_ITEM 0) itemoff 15083 itemsize 33 Its logical address is [27054080, 27070464) and acrossed 2 strips: [27000832, 27066368) [27066368, 27131904) Will be fixed in btrfs-progs(btrfs-convert, btrfsck, ...) For reason 2: Scrub is trying to do a "bypass" in this case, but the result is "panic", because current code lacks of some condition in bypass, and let some wrong leaf data escaped. This patch fixed above scrub code. Before patch: # btrfs scrub start -B /dev/vdh (panic) After patch: # btrfs scrub start -B /dev/vdh scrub done for 353cec8f-da31-4a94-aa35-be72d997b06e ... # dmesg ... [ 59.088697] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27000832 [ 59.089929] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27066368 # Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: fix stale dir entries after removing a link and fsyncFilipe Manana
We have one more case where after a log tree is replayed we get inconsistent metadata leading to stale directory entries, due to some directories having entries pointing to some inode while the inode does not have a matching BTRFS_INODE_[REF|EXTREF]_KEY item. To trigger the problem we need to have a file with multiple hard links belonging to different parent directories. Then if one of those hard links is removed and we fsync the file using one of its other links that belongs to a different parent directory, we end up not logging the fact that the removed hard link doesn't exists anymore in the parent directory. Simple reproducer: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs generic _supported_os Linux _require_scratch _require_dm_flakey _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create our test directory and file. mkdir $SCRATCH_MNT/testdir touch $SCRATCH_MNT/foo ln $SCRATCH_MNT/foo $SCRATCH_MNT/testdir/foo2 ln $SCRATCH_MNT/foo $SCRATCH_MNT/testdir/foo3 # Make sure everything done so far is durably persisted. sync # Now we remove one of our file's hardlinks in the directory testdir. unlink $SCRATCH_MNT/testdir/foo3 # We now fsync our file using the "foo" link, which has a parent that # is not the directory "testdir". $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Silently drop all writes and unmount to simulate a crash/power # failure. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again, mount to trigger journal/log replay. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # After the journal/log is replayed we expect to not see the "foo3" # link anymore and we should be able to remove all names in the # directory "testdir" and then remove it (no stale directory entries # left after the journal/log replay). echo "Entries in testdir:" ls -1 $SCRATCH_MNT/testdir rm -f $SCRATCH_MNT/testdir/* rmdir $SCRATCH_MNT/testdir _unmount_flakey status=0 exit The test fails with: $ ./check generic/107 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian3 4.1.0-rc6-btrfs-next-11+ MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 generic/107 3s ... - output mismatch (see .../results/generic/107.out.bad) --- tests/generic/107.out 2015-08-01 01:39:45.807462161 +0100 +++ /home/fdmanana/git/hub/xfstests/results//generic/107.out.bad @@ -1,3 +1,5 @@ QA output created by 107 Entries in testdir: foo2 +foo3 +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty ... _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent \ (see /home/fdmanana/git/hub/xfstests/results//generic/107.full) _check_dmesg: something found in dmesg (see .../results/generic/107.dmesg) Ran: generic/107 Failures: generic/107 Failed 1 of 1 tests $ cat /home/fdmanana/git/hub/xfstests/results//generic/107.full (...) checking fs roots root 5 inode 257 errors 200, dir isize wrong unresolved ref dir 257 index 3 namelen 4 name foo3 filetype 1 errors 5, no dir item, no inode ref (...) And produces the following warning in dmesg: [127298.759064] BTRFS info (device dm-0): failed to delete reference to foo3, inode 258 parent 257 [127298.762081] ------------[ cut here ]------------ [127298.763311] WARNING: CPU: 10 PID: 7891 at fs/btrfs/inode.c:3956 __btrfs_unlink_inode+0x182/0x35a [btrfs]() [127298.767327] BTRFS: Transaction aborted (error -2) (...) [127298.788611] Call Trace: [127298.789137] [<ffffffff8145f077>] dump_stack+0x4f/0x7b [127298.790090] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2 [127298.791157] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb [127298.792323] [<ffffffffa065ad09>] ? __btrfs_unlink_inode+0x182/0x35a [btrfs] [127298.793633] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48 [127298.794699] [<ffffffffa065ad09>] __btrfs_unlink_inode+0x182/0x35a [btrfs] [127298.797640] [<ffffffffa065be8f>] btrfs_unlink_inode+0x1e/0x40 [btrfs] [127298.798876] [<ffffffffa065bf11>] btrfs_unlink+0x60/0x9b [btrfs] [127298.800154] [<ffffffff8116fb48>] vfs_unlink+0x9c/0xed [127298.801303] [<ffffffff81173481>] do_unlinkat+0x12b/0x1fb [127298.802450] [<ffffffff81253855>] ? lockdep_sys_exit_thunk+0x12/0x14 [127298.803797] [<ffffffff81174056>] SyS_unlinkat+0x29/0x2b [127298.805017] [<ffffffff81465197>] system_call_fastpath+0x12/0x6f [127298.806310] ---[ end trace bbfddacb7aaada7b ]--- [127298.807325] BTRFS warning (device dm-0): __btrfs_unlink_inode:3956: Aborting unused transaction(No such entry). So fix this by logging all parent inodes, current and old ones, to make sure we do not get stale entries after log replay. This is not a simple solution such as triggering a full transaction commit because it would imply full transaction commit when an inode is fsynced in the same transaction that modified it and reloaded it after eviction (because its last_unlink_trans is set to the same value as its last_trans as of the commit with the title "Btrfs: fix stale dir entries after unlink, inode eviction and fsync"), and it would also make fstest generic/066 fail since one of the fsyncs triggers a full commit and the next fsync will not find the inode in the log anymore (therefore not removing the xattr). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: fix search key advancing conditionNaohiro Aota
The search key advancing condition used in copy_to_sk() is loose. It can advance the key even if it reaches sk->max_*: e.g. when the max key = (512, 1024, -1) and the current key = (512, 1025, 10), it increments the offset by 1, continues hopeless search from (512, 1025, 11). This issue make ioctl() to take unexpectedly long time scanning all the leaf a blocks one by one. This commit fix the problem using standard way of key comparison: btrfs_comp_cpu_keys() Signed-off-by: Naohiro Aota <naota@elisp.net> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: teach backref walking about backrefs with underflowed offset valuesFilipe Manana
When cloning/deduplicating file extents (through the clone and extent_same ioctls) we can get data back references with offset values that are a result of an unsigned integer arithmetic underflow, that is, values that are much larger then they could be otherwise. This is not a problem when decrementing or dropping the back references (happens when we overwrite the extents or punch a hole for example, through __btrfs_drop_extents()), since we compute the same too large offset value, but it is a problem for the backref walking code, used by an incremental send and the ioctls that are used by the btrfs tool "inspect-internal" commands, as it makes it miss the corresponding file extent items because the search key is set for an extent item that starts at an offset matching the exceptionally large offset value of the data back reference. For an incremental send this causes the send ioctl to fail with -EIO. So teach the backref walking code to deal with these cases by setting the search key's offset to 0 if the backref's offset value is larger than LLONG_MAX (the largest possible file offset). This makes sure the backref walking code finds the corresponding file extent items at the expense of scanning more items and leafs in the btree. Fixing the clone/dedup ioctls to not produce such underflowed results would require major changes breaking backward compatibility, updating user space tools, etc. Simple reproducer case for fstests: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -fr $send_files_dir rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner _need_to_be_root send_files_dir=$TEST_DIR/btrfs-test-$seq rm -f $seqres.full rm -fr $send_files_dir mkdir $send_files_dir _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount # Create our test file with a single extent of 64K starting at file # offset 128K. $XFS_IO_PROG -f -c "pwrite -S 0xaa 128K 64K" $SCRATCH_MNT/foo \ | _filter_xfs_io _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \ $SCRATCH_MNT/mysnap1 # Now clone parts of the original extent into lower offsets of the file. # # The first clone operation adds a file extent item to file offset 0 # that points to our initial extent with a data offset of 16K. The # corresponding data back reference in the extent tree has an offset of # 18446744073709535232, which is the result of file_offset - data_offset # = 0 - 16K. # # The second clone operation adds a file extent item to file offset 16K # that points to our initial extent with a data offset of 48K. The # corresponding data back reference in the extent tree has an offset of # 18446744073709518848, which is the result of file_offset - data_offset # = 16K - 48K. # # Those large back reference offsets (result of unsigned arithmetic # underflow) confused the back reference walking code (used by an # incremental send and the multiple inspect-internal ioctls) and made it # miss the back references, which for the case of an incremental send it # made it fail with -EIO and print a message like the following to # dmesg: # # "BTRFS error (device sdc): did not find backref in send_root. \ # inode=257, offset=0, disk_byte=12845056 found extent=12845056" # $CLONER_PROG -s $(((128 + 16) * 1024)) -d 0 -l $((16 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo $CLONER_PROG -s $(((128 + 48) * 1024)) -d $((16 * 1024)) \ -l $((16 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \ $SCRATCH_MNT/mysnap2 _run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap _run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \ -f $send_files_dir/2.snap echo "File digest in the original filesystem:" md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch # Now recreate the filesystem by receiving both send streams and verify # we get the same file contents that the original filesystem had. _scratch_unmount _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/2.snap echo "File digest in the new filesystem:" md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch status=0 exit The test's expected golden output is: wrote 65536/65536 bytes at offset 131072 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File digest in the original filesystem: 6c6079335cff141b8a31233ead04cbff SCRATCH_MNT/mysnap2/foo File digest in the new filesystem: 6c6079335cff141b8a31233ead04cbff SCRATCH_MNT/mysnap2/foo But it failed with: (...) @@ -1,7 +1,5 @@ QA output created by 097 wrote 65536/65536 bytes at offset 131072 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -File digest in the original filesystem: -6c6079335cff141b8a31233ead04cbff SCRATCH_MNT/mysnap2/foo -File digest in the new filesystem: -6c6079335cff141b8a31233ead04cbff SCRATCH_MNT/mysnap2/foo ... $ cat /home/fdmanana/git/hub/xfstests/results//btrfs/097.full (...) ERROR: send ioctl failed with -5: Input/output error Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: fix stale dir entries after unlink, inode eviction and fsyncFilipe Manana
If we remove a hard link from an inode, the inode gets evicted, then we fsync the inode and then power fail/crash, when the log tree is replayed, the parent directory inode still has entries pointing to the name that no longer exists, while our inode no longer has the BTRFS_INODE_REF_KEY item matching the deleted hard link (as expected), leaving the filesystem in an inconsistent state. The stale directory entries can not be deleted (an attempt to delete them causes -ESTALE errors), which makes it impossible to delete the parent directory. This happens because we track the id of the transaction where the last unlink operation for the inode happened (last_unlink_trans) in an in-memory only field of the inode, that is, a value that is never persisted in the inode item stored on the fs/subvol btree. So if an inode is evicted and loaded again, the value for last_unlink_trans is set to 0, which prevents the fsync from logging the parent directory at btrfs_log_inode_parent(). So fix this by setting last_unlink_trans to the id of the transaction that last modified the inode when we load the inode. This is a pessimistic approach but it always ensures correctness with the trade off of ocassional full transaction commits when an fsync is done against the inode in the same transaction where it was evicted and reloaded when our inode is a directory and often logging its parent unnecessarily when our inode is not a directory. The following test case for fstests triggers the problem: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs generic _supported_os Linux _require_scratch _require_dm_flakey _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create our test file with 2 hard links. mkdir $SCRATCH_MNT/testdir touch $SCRATCH_MNT/testdir/foo ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar # Make sure everything done so far is durably persisted. sync # Now remove one of the links, trigger inode eviction and then fsync # our inode. unlink $SCRATCH_MNT/testdir/bar echo 2 > /proc/sys/vm/drop_caches $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foo # Silently drop all writes on our scratch device to simulate a power failure. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again and mount the fs to trigger log/journal replay. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Now verify our directory entries. echo "Entries in testdir:" ls -1 $SCRATCH_MNT/testdir # If we remove our inode, its parent should become empty and therefore we should # be able to remove the parent. rm -f $SCRATCH_MNT/testdir/* rmdir $SCRATCH_MNT/testdir _unmount_flakey # The fstests framework will call fsck against our filesystem which will verify # that all metadata is in a consistent state. status=0 exit The test failed on btrfs with: generic/098 4s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad) --- tests/generic/098.out 2015-07-23 18:01:12.616175932 +0100 +++ /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad 2015-07-23 18:04:58.924138308 +0100 @@ -1,3 +1,6 @@ QA output created by 098 Entries in testdir: +bar foo +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo': Stale file handle +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty ... (Run 'diff -u tests/generic/098.out /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad' to see the entire diff) _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent (see /home/fdmanana/git/hub/xfstests/results//generic/098.full) $ cat /home/fdmanana/git/hub/xfstests/results//generic/098.full (...) checking fs roots root 5 inode 258 errors 2001, no inode item, link count wrong unresolved ref dir 257 index 0 namelen 3 name foo filetype 1 errors 6, no dir index, no inode ref unresolved ref dir 257 index 3 namelen 3 name bar filetype 1 errors 5, no dir item, no inode ref Checking filesystem on /dev/sdc (...) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: fix stale directory entries after fsync log replayFilipe Manana
We have another case where after an fsync log replay we get an inode with a wrong link count (smaller than it should be) and a number of directory entries greater than its link count. This happens when we add a new link hard link to our inode A and then we fsync some other inode B that has the side effect of logging the parent directory inode too. In this case at log replay time we add the new hard link to our inode (the item with key BTRFS_INODE_REF_KEY) when processing the parent directory but we never adjust the link count of our inode A. As a result we get stale dir entries for our inode A that can never be deleted and therefore it makes it impossible to remove the parent directory (as its i_size can never decrease back to 0). A simple reproducer for fstests that triggers this issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs generic _supported_os Linux _require_scratch _require_dm_flakey _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create our test directory and files. mkdir $SCRATCH_MNT/testdir touch $SCRATCH_MNT/testdir/foo touch $SCRATCH_MNT/testdir/bar # Make sure everything done so far is durably persisted. sync # Create one hard link for file foo and another one for file bar. After # that fsync only the file bar. ln $SCRATCH_MNT/testdir/bar $SCRATCH_MNT/testdir/bar_link ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/foo_link $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/bar # Silently drop all writes on scratch device to simulate power failure. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again and mount the fs to trigger log/journal replay. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Now verify both our files have a link count of 2. echo "Link count for file foo: $(stat --format=%h $SCRATCH_MNT/testdir/foo)" echo "Link count for file bar: $(stat --format=%h $SCRATCH_MNT/testdir/bar)" # We should be able to remove all the links of our files in testdir, and # after that the parent directory should become empty and therefore # possible to remove it. rm -f $SCRATCH_MNT/testdir/* rmdir $SCRATCH_MNT/testdir _unmount_flakey # The fstests framework will call fsck against our filesystem which will verify # that all metadata is in a consistent state. status=0 exit The test fails with: -Link count for file foo: 2 +Link count for file foo: 1 Link count for file bar: 2 +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo_link': Stale file handle +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty (...) _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent And fsck's output: (...) checking fs roots root 5 inode 258 errors 2001, no inode item, link count wrong unresolved ref dir 257 index 5 namelen 8 name foo_link filetype 1 errors 4, no inode ref Checking filesystem on /dev/sdc (...) So fix this by marking inodes for link count fixup at log replay time whenever a directory entry is replayed if the entry was created in the transaction where the fsync was made and if it points to a non-directory inode. This isn't a new problem/regression, the issue exists for a long time, possibly since the log tree feature was added (2008). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>