summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2014-04-03fs/direct-io.c: remove some left over checksDan Carpenter
We know that "ret > 0" is true here. These tests were left over from commit 02afc27faec9 ('direct-io: Handle O_(D)SYNC AIO') and aren't needed any more. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-16Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We have a small collection of fixes in my for-linus branch. The big thing that stands out is a revert of a new ioctl. Users haven't shipped yet in btrfs-progs, and Dave Sterba found a better way to export the information" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: use right clone root offset for compressed extents btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105 Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol Btrfs: fix max_inline mount option Btrfs: fix a lockdep warning when cleaning up aborted transaction Revert "btrfs: add ioctl to export size of global metadata reservation"
2014-02-15Btrfs: use right clone root offset for compressed extentsFilipe David Borba Manana
For non compressed extents, iterate_extent_inodes() gives us offsets that take into account the data offset from the file extent items, while for compressed extents it doesn't. Therefore we have to adjust them before placing them in a send clone instruction. Not doing this adjustment leads to the receiving end requesting for a wrong a file range to the clone ioctl, which results in different file content from the one in the original send root. Issue reproducible with the following excerpt from the test I made for xfstests: _scratch_mkfs _scratch_mount "-o compress-force=lzo" $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 $XFS_IO_PROG -c "pwrite -S 0x3e -b 80000 200000 80000" $SCRATCH_MNT/foo $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT $XFS_IO_PROG -c "pwrite -S 0xdc -b 10000 250000 10000" $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0xff -b 10000 300000 10000" $SCRATCH_MNT/foo # will be used for incremental send to be able to issue clone operations $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/clones_snap $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1 $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \ -x $SCRATCH_MNT/mysnap2/clones_snap $SCRATCH_MNT/mysnap2 $FSSUM_PROG -A -f -w $tmp/clones.fssum $SCRATCH_MNT/clones_snap \ -x $SCRATCH_MNT/clones_snap/mysnap1 -x $SCRATCH_MNT/clones_snap/mysnap2 $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap $BTRFS_UTIL_PROG send $SCRATCH_MNT/clones_snap -f $tmp/clones.snap $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 \ -c $SCRATCH_MNT/clones_snap $SCRATCH_MNT/mysnap2 -f $tmp/2.snap _scratch_unmount _scratch_mkfs _scratch_mount $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap $FSSUM_PROG -r $tmp/1.fssum $SCRATCH_MNT/mysnap1 2>> $seqres.full $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/clones.snap $FSSUM_PROG -r $tmp/clones.fssum $SCRATCH_MNT/clones_snap 2>> $seqres.full $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap $FSSUM_PROG -r $tmp/2.fssum $SCRATCH_MNT/mysnap2 2>> $seqres.full Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-15btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105Anand Jain
bdev is null when disk has disappeared and mounted with the degrade option stack trace --------- btrfs_sysfs_add_one+0x105/0x1c0 [btrfs] open_ctree+0x15f3/0x1fe0 [btrfs] btrfs_mount+0x5db/0x790 [btrfs] ? alloc_pages_current+0xa4/0x160 mount_fs+0x34/0x1b0 vfs_kern_mount+0x62/0xf0 do_mount+0x22e/0xa80 ? __get_free_pages+0x9/0x40 ? copy_mount_options+0x31/0x170 SyS_mount+0x7e/0xc0 system_call_fastpath+0x16/0x1b --------- reproducer: ------- mkfs.btrfs -draid1 -mraid1 /dev/sdc /dev/sdd (detach a disk) devmgt detach /dev/sdc [1] mount -o degrade /dev/sdd /btrfs ------- [1] github.com/anajain/devmgt.git Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Tested-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-14Btrfs: unset DCACHE_DISCONNECTED when mounting default subvolJosef Bacik
A user was running into errors from an NFS export of a subvolume that had a default subvol set. When we mount a default subvol we will use d_obtain_alias() to find an existing dentry for the subvolume in the case that the root subvol has already been mounted, or a dummy one is allocated in the case that the root subvol has not already been mounted. This allows us to connect the dentry later on if we wander into the path. However if we don't ever wander into the path we will keep DCACHE_DISCONNECTED set for a long time, which angers NFS. It doesn't appear to cause any problems but it is annoying nonetheless, so simply unset DCACHE_DISCONNECTED in the get_default_root case and switch btrfs_lookup() to use d_materialise_unique() instead which will make everything play nicely together and reconnect stuff if we wander into the defaul subvol path from a different way. With this patch I'm no longer getting the NFS errors when exporting a volume that has been mounted with a default subvol set. Thanks, cc: bfields@fieldses.org cc: ebiederm@xmission.com Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-14Btrfs: fix max_inline mount optionMitch Harder
Currently, the only mount option for max_inline that has any effect is max_inline=0. Any other value that is supplied to max_inline will be adjusted to a minimum of 4k. Since max_inline has an effective maximum of ~3900 bytes due to page size limitations, the current behaviour only has meaning for max_inline=0. This patch will allow the the max_inline mount option to accept non-zero values as indicated in the documentation. Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-14Btrfs: fix a lockdep warning when cleaning up aborted transactionLiu Bo
Given now we have 2 spinlock for management of delayed refs, CONFIG_DEBUG_SPINLOCK=y helped me find this, [ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258 [ 4723.414882] lock: 0xffff880048377670, .magic: dead4ead, .owner: btrfs-transacti/2258, .owner_cpu: 2 [ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: G W O 3.12.0+ #4 [ 4723.421321] Call Trace: [ 4723.421872] [<ffffffff81680fe7>] dump_stack+0x54/0x74 [ 4723.422753] [<ffffffff81681093>] spin_dump+0x8c/0x91 [ 4723.424979] [<ffffffff816810b9>] spin_bug+0x21/0x26 [ 4723.425846] [<ffffffff81323956>] do_raw_spin_unlock+0x66/0x90 [ 4723.434424] [<ffffffff81689bf7>] _raw_spin_unlock+0x27/0x40 [ 4723.438747] [<ffffffffa015da9e>] btrfs_cleanup_one_transaction+0x35e/0x710 [btrfs] [ 4723.443321] [<ffffffffa015df54>] btrfs_cleanup_transaction+0x104/0x570 [btrfs] [ 4723.444692] [<ffffffff810c1b5d>] ? trace_hardirqs_on_caller+0xfd/0x1c0 [ 4723.450336] [<ffffffff810c1c2d>] ? trace_hardirqs_on+0xd/0x10 [ 4723.451332] [<ffffffffa015e5ee>] transaction_kthread+0x22e/0x270 [btrfs] [ 4723.452543] [<ffffffffa015e3c0>] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs] [ 4723.457833] [<ffffffff81079efa>] kthread+0xea/0xf0 [ 4723.458990] [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140 [ 4723.460133] [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0 [ 4723.460865] [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140 [ 4723.496521] ------------[ cut here ]------------ ---------------------------------------------------------------------- The reason is that we get to call cond_resched_lock(&head_ref->lock) while still holding @delayed_refs->lock. So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire dance before and after actually processing delayed refs. Here we don't drop the lock, others are not able to add new delayed refs to head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-14Revert "btrfs: add ioctl to export size of global metadata reservation"Chris Mason
This reverts commit 01e219e8069516cdb98594d417b8bb8d906ed30d. David Sterba found a different way to provide these features without adding a new ioctl. We haven't released any progs with this ioctl yet, so I'm taking this out for now until we finalize things. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.cz> CC: Jeff Mahoney <jeffm@suse.com>
2014-02-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is a small collection of fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix data corruption when reading/updating compressed extents Btrfs: don't loop forever if we can't run because of the tree mod log btrfs: reserve no transaction units in btrfs_ioctl_set_features btrfs: commit transaction after setting label and features Btrfs: fix assert screwup for the pending move stuff
2014-02-09Btrfs: fix data corruption when reading/updating compressed extentsFilipe David Borba Manana
When using a mix of compressed file extents and prealloc extents, it is possible to fill a page of a file with random, garbage data from some unrelated previous use of the page, instead of a sequence of zeroes. A simple sequence of steps to get into such case, taken from the test case I made for xfstests, is: _scratch_mkfs _scratch_mount "-o compress-force=lzo" $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar This results in the following file items in the fs tree: item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160 inode generation 6 transid 6 size 542872 block group 0 mode 100600 item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16 inode ref index 2 namelen 6 name: foobar item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 24576 ram 266240 extent compression 0 item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53 prealloc data disk byte 12849152 nr 241664 gen 6 prealloc data offset 0 nr 241664 item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53 extent data disk byte 12845056 nr 4096 gen 6 extent data offset 0 nr 20480 ram 20480 extent compression 2 item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53 prealloc data disk byte 13090816 nr 405504 gen 6 prealloc data offset 0 nr 258048 The on disk extent at offset 266240 (which corresponds to 1 single disk block), contains 5 compressed chunks of file data. Each of the first 4 compress 4096 bytes of file data, while the last one only compresses 3024 bytes of file data. Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 = 1072 bytes) should always return zeroes (our next extent is a prealloc one). The solution here is the compression code path to zero the remaining (untouched) bytes of the last page it uncompressed data into, as the information about how much space the file data consumes in the last page is not known in the upper layer fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing the remainder of the page but only if it corresponds to the last page of the inode and if the inode's size is not a multiple of the page size. This would cause not only returning random data on reads, but also permanently storing random data when updating parts of the region that should be zeroed. For the example above, it means updating a single byte in the region [285648 ; 286720[ would store that byte correctly but also store random data on disk. A test case for xfstests follows soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-09Btrfs: don't loop forever if we can't run because of the tree mod logJosef Bacik
A user reported a 100% cpu hang with my new delayed ref code. Turns out I forgot to increase the count check when we can't run a delayed ref because of the tree mod log. If we can't run any delayed refs during this there is no point in continuing to look, and we need to break out. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-09btrfs: reserve no transaction units in btrfs_ioctl_set_featuresDavid Sterba
Added in patch "btrfs: add ioctls to query/change feature bits online" modifications to superblock don't need to reserve metadata blocks when starting a transaction. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-09btrfs: commit transaction after setting label and featuresJeff Mahoney
The set_fslabel ioctl uses btrfs_end_transaction, which means it's possible that the change will be lost if the system crashes, same for the newly set features. Let's use btrfs_commit_transaction instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-09Btrfs: fix assert screwup for the pending move stuffJosef Bacik
Wang noticed that he was failing btrfs/030 even though me and Filipe couldn't reproduce. Turns out this is because Wang didn't have CONFIG_BTRFS_ASSERT set, which meant that a key part of Filipe's original patch was not being built in. This appears to be a mess up with merging Filipe's patch as it does not exist in his original patch. Fix this by changing how we make sure del_waiting_dir_move asserts that it did not error and take the function out of the ifdef check. This makes btrfs/030 pass with the assert on or off. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Filipe is fixing compile and boot problems with our crc32c rework, and Josef has disabled snapshot aware defrag for now. As the number of snapshots increases, we're hitting OOM. For the short term we're disabling things until a bigger fix is ready" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: use late_initcall instead of module_init Btrfs: use btrfs_crc32c everywhere instead of libcrc32c Btrfs: disable snapshot aware defrag for now
2014-02-03Btrfs: use late_initcall instead of module_initFilipe David Borba Manana
It seems that when init_btrfs_fs() is called, crc32c/crc32c-intel might not always be already initialized, which results in the call to crypto_alloc_shash() returning -ENOENT, as experienced by Ahmet who reported this. Therefore make sure init_btrfs_fs() is called after crc32c is initialized (which is at initialization level 6, module_init), by using late_initcall (which is at initialization level 7) instead of module_init for btrfs. Reported-and-Tested-by: Ahmet Inan <ainan@mathematik.uni-freiburg.de> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-03Btrfs: use btrfs_crc32c everywhere instead of libcrc32cFilipe David Borba Manana
After the commit titled "Btrfs: fix btrfs boot when compiled as built-in", LIBCRC32C requirement was removed from btrfs' Kconfig. This made it not possible to build a kernel with btrfs enabled (either as module or built-in) if libcrc32c is not enabled as well. So just replace all uses of libcrc32c with the equivalent function in btrfs hash.h - btrfs_crc32c. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-03Btrfs: disable snapshot aware defrag for nowJosef Bacik
It's just broken and it's taking a lot of effort to fix it, so for now just disable it so people can defrag in peace. Thanks, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-31Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This is a pretty big pull, and most of these changes have been floating in btrfs-next for a long time. Filipe's properties work is a cool building block for inheriting attributes like compression down on a per inode basis. Jeff Mahoney kicked in code to export filesystem info into sysfs. Otherwise, lots of performance improvements, cleanups and bug fixes. Looks like there are still a few other small pending incrementals, but I wanted to get the bulk of this in first" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (149 commits) Btrfs: fix spin_unlock in check_ref_cleanup Btrfs: setup inode location during btrfs_init_inode_locked Btrfs: don't use ram_bytes for uncompressed inline items Btrfs: fix btrfs_search_slot_for_read backwards iteration Btrfs: do not export ulist functions Btrfs: rework ulist with list+rb_tree Btrfs: fix memory leaks on walking backrefs failure Btrfs: fix send file hole detection leading to data corruption Btrfs: add a reschedule point in btrfs_find_all_roots() Btrfs: make send's file extent item search more efficient Btrfs: fix to catch all errors when resolving indirect ref Btrfs: fix protection between walking backrefs and root deletion btrfs: fix warning while merging two adjacent extents Btrfs: fix infinite path build loops in incremental send btrfs: undo sysfs when open_ctree() fails Btrfs: fix snprintf usage by send's gen_unique_name btrfs: fix defrag 32-bit integer overflow btrfs: sysfs: list the NO_HOLES feature btrfs: sysfs: don't show reserved incompat feature btrfs: call permission checks earlier in ioctls and return EPERM ...
2014-01-30Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...
2014-01-29Btrfs: fix spin_unlock in check_ref_cleanupChris Mason
Our goto out should have gone a little farther. Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29Btrfs: setup inode location during btrfs_init_inode_lockedChris Mason
We have a race during inode init because the BTRFS_I(inode)->location is setup after the inode hash table lock is dropped. btrfs_find_actor uses the location field, so our search might not find an existing inode in the hash table if we race with the inode init code. This commit changes things to setup the location field sooner. Also the find actor now uses only the location objectid to match inodes. For inode hashing, we just need a unique and stable test, it doesn't have to reflect the inode numbers we show to userland. Signed-off-by: Chris Mason <clm@fb.com> CC: stable@vger.kernel.org
2014-01-29Btrfs: don't use ram_bytes for uncompressed inline itemsChris Mason
If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect the new size. The fixe uses the size directly from the item header when reading uncompressed inlines, and also fixes truncate to update the size as it goes. Reported-by: Jens Axboe <axboe@fb.com> Signed-off-by: Chris Mason <clm@fb.com> CC: stable@vger.kernel.org
2014-01-29Btrfs: fix btrfs_search_slot_for_read backwards iterationFilipe David Borba Manana
If the current path's leaf slot is 0, we do search for the previous leaf (via btrfs_prev_leaf) and set the new path's leaf slot to a value corresponding to the number of items - 1 of the former leaf. Fix this by using the slot set by btrfs_prev_leaf, decrementing it by 1 if it's equal to the leaf's number of items. Use of btrfs_search_slot_for_read() for backward iteration is used in particular by the send feature, which could miss items when the input leaf has less items than its previous leaf. This could be reproduced by running btrfs/007 from xfstests in a loop. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29Btrfs: do not export ulist functionsWang Shilong
There are not any users that use ulist except Btrfs,don't export them. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29Btrfs: rework ulist with list+rb_treeWang Shilong
We are really suffering from now ulist's implementation, some developers gave their try, and i just gave some of my ideas for things: 1. use list+rb_tree instead of arrary+rb_tree 2. add cur_list to iterator rather than ulist structure. 3. add seqnum into every node when they are added, this is used to do selfcheck when iterating node. I noticed Zach Brown's comments before, long term is to kick off ulist implementation, however, for now, we need at least avoid arrary from ulist. Cc: Liu Bo <bo.li.liu@oracle.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Zach Brown <zab@redhat.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29Btrfs: fix memory leaks on walking backrefs failureWang Shilong
When walking backrefs, we may iterate every inode's extent and add/merge them into ulist, and the caller will free memory from ulist. However, if we fail to allocate inode's extents element memory or ulist_add() fail to allocate memory, we won't add allocated memory into ulist, and the caller won't free some allocated memory thus memory leaks happen. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29Btrfs: fix send file hole detection leading to data corruptionFilipe David Borba Manana
There was a case where file hole detection was incorrect and it would cause an incremental send to override a section of a file with zeroes. This happened in the case where between the last leaf we processed which contained a file extent item for our current inode and the leaf we're currently are at (and has a file extent item for our current inode) there are only leafs containing exclusively file extent items for our current inode, and none of them was updated since the previous send operation. The file hole detection code would incorrectly consider the file range covered by these leafs as a hole. A test case for xfstests follows soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29Btrfs: add a reschedule point in btrfs_find_all_roots()Wang Shilong
I can easily trigger the following warnings when enabling quota in my virtual machine(running Opensuse), Steps are firstly creating a subvolume full of fragment extents, and then create many snapshots (500 in my test case). [ 2362.808459] BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-qgroup-re:1970] [ 2362.809023] task: e4af8450 ti: e371c000 task.ti: e371c000 [ 2362.809026] EIP: 0060:[<fa38f4ae>] EFLAGS: 00000246 CPU: 0 [ 2362.809049] EIP is at __merge_refs+0x5e/0x100 [btrfs] [ 2362.809051] EAX: 00000000 EBX: cfadbcf0 ECX: 00000000 EDX: cfadbcb0 [ 2362.809052] ESI: dd8d3370 EDI: e371dde0 EBP: e371dd6c ESP: e371dd5c [ 2362.809054] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [ 2362.809055] CR0: 80050033 CR2: ac454d50 CR3: 009a9000 CR4: 001407d0 [ 2362.809099] Stack: [ 2362.809100] 00000001 e371dde0 dfcc6890 f29f8000 e371de28 fa39016d 00000011 00000001 [ 2362.809105] 99bfc000 00000000 93928000 00000000 00000001 00000050 e371dda8 00000001 [ 2362.809109] f3a31000 f3413000 00000001 e371ddb8 000040a8 00000202 00000000 00000023 [ 2362.809113] Call Trace: [ 2362.809136] [<fa39016d>] find_parent_nodes+0x34d/0x1280 [btrfs] [ 2362.809156] [<fa391172>] btrfs_find_all_roots+0xb2/0x110 [btrfs] [ 2362.809174] [<fa3934a8>] btrfs_qgroup_rescan_worker+0x358/0x7a0 [btrfs] [ 2362.809180] [<c024d0ce>] ? lock_timer_base.isra.39+0x1e/0x40 [ 2362.809199] [<fa3648df>] worker_loop+0xff/0x470 [btrfs] [ 2362.809204] [<c027a88a>] ? __wake_up_locked+0x1a/0x20 [ 2362.809221] [<fa3647e0>] ? btrfs_queue_worker+0x2b0/0x2b0 [btrfs] [ 2362.809225] [<c025ebbc>] kthread+0x9c/0xb0 [ 2362.809229] [<c06b487b>] ret_from_kernel_thread+0x1b/0x30 [ 2362.809233] [<c025eb20>] ? kthread_create_on_node+0x110/0x110 By adding a reschedule point at the end of btrfs_find_all_roots(), i no longer hit these warnings. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29Btrfs: make send's file extent item search more efficientFilipe David Borba Manana
Instead of looking for a file extent item, process it, release the path and do a btree search for the next file extent item, just process all file extent items in a leaf without intermediate btree searches. This way we save cpu and we're not blocking other tasks or affecting concurrency on the btree, because send's paths use the commit root and skip btree node/leaf locking. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29Btrfs: fix to catch all errors when resolving indirect refWang Shilong
We can only tolerate ENOENT here, for other errors, we should return directly. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29Btrfs: fix protection between walking backrefs and root deletionWang Shilong
There is a race condition between resolving indirect ref and root deletion, and we should gurantee that root can not be destroyed to avoid accessing broken tree here. Here we fix it by holding @subvol_srcu, and we will release it as soon as we have held root node lock. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29btrfs: fix warning while merging two adjacent extentsGui Hecheng
When we have two adjacent extents in relink_extent_backref, we try to merge them. When we use btrfs_search_slot to locate the slot for the current extent, we shouldn't set "ins_len = 1", because we will merge it into the previous extent rather than insert a new item. Otherwise, we may happen to create a new leaf in btrfs_search_slot and path->slot[0] will be 0. Then we try to fetch the previous item using "path->slots[0]--", and it will cause a warning as follows: [ 145.713385] WARNING: CPU: 3 PID: 1796 at fs/btrfs/extent_io.c:5043 map_private_extent_buffer+0xd4/0xe0 [ 145.713387] btrfs bad mapping eb start 5337088 len 4096, wanted 167772306 8 ... [ 145.713462] [<ffffffffa034b1f4>] map_private_extent_buffer+0xd4/0xe0 [ 145.713476] [<ffffffffa030097a>] ? btrfs_free_path+0x2a/0x40 [ 145.713485] [<ffffffffa0340864>] btrfs_get_token_64+0x64/0xf0 [ 145.713498] [<ffffffffa033472c>] relink_extent_backref+0x41c/0x820 [ 145.713508] [<ffffffffa0334d69>] btrfs_finish_ordered_io+0x239/0xa80 I encounter this warning when running defrag having mkfs.btrfs with option -M. At the same time there are read/writes & snapshots running at background. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29Btrfs: fix infinite path build loops in incremental sendFilipe David Borba Manana
The send operation processes inodes by their ascending number, and assumes that any rename/move operation can be successfully performed (sent to the caller) once all previous inodes (those with a smaller inode number than the one we're currently processing) were processed. This is not true when an incremental send had to process an hierarchical change between 2 snapshots where the parent-children relationship between directory inodes was reversed - that is, parents became children and children became parents. This situation made the path building code go into an infinite loop, which kept allocating more and more memory that eventually lead to a krealloc warning being displayed in dmesg: WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0() Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012 [ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007 [ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0 [ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0 [ 5381.660457] Call Trace: [ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68 [ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0 [ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20 [ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0 [ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60 [ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440 [ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60 [ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50 [ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0 [ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0 [ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50 [ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50 [ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100 [ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230 [ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs] [ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe [ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0 [ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs] [ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs] [ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs] [ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs] Even without this loop, the incremental send couldn't succeed, because it would attempt to send a rename/move operation for the lower inode before the highest inode number was renamed/move. This issue is easy to trigger with the following steps: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c/d $ mkdir /mnt/btrfs/a/b/c2 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2 $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send The structure of the filesystem when the first snapshot is taken is: . (ino 256) |-- a (ino 257) |-- b (ino 258) |-- c (ino 259) | |-- d (ino 260) | |-- c2 (ino 261) And its structure when the second snapshot is taken is: . (ino 256) |-- a (ino 257) |-- b (ino 258) |-- c2 (ino 261) |-- d2 (ino 260) |-- cc (ino 259) Before the move/rename operation is performed for the inode 259, the move/rename for inode 260 must be performed, since 259 is now a child of 260. A test case for xfstests, with a more complex scenario, will follow soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28btrfs: undo sysfs when open_ctree() failsAnand Jain
reproducer: mkfs.btrfs -f /dev/sdb &&\ mount /dev/sdb /btrfs &&\ btrfs dev add -f /dev/sdc /btrfs &&\ umount /btrfs &&\ wipefs -a /dev/sdc &&\ mount -o degraded /dev/sdb /btrfs //above mount fails so try with RO mount -o degraded,ro /dev/sdb /btrfs ------ sysfs: cannot create duplicate filename '/fs/btrfs/3f48c79e-5ed0-4e87-b189-86e749e503f4' :: dump_stack+0x49/0x5e warn_slowpath_common+0x87/0xb0 warn_slowpath_fmt+0x41/0x50 strlcat+0x69/0x80 sysfs_warn_dup+0x87/0xa0 sysfs_add_one+0x40/0x50 create_dir+0x76/0xc0 sysfs_create_dir_ns+0x7a/0xc0 kobject_add_internal+0xad/0x220 kobject_add_varg+0x38/0x60 kobject_init_and_add+0x53/0x70 mutex_lock+0x11/0x40 __free_pages+0x25/0x30 free_pages+0x41/0x50 selinux_sb_copy_data+0x14e/0x1e0 mount_fs+0x3e/0x1a0 vfs_kern_mount+0x71/0x120 do_mount+0x3f7/0x980 SyS_mount+0x8b/0xe0 system_call_fastpath+0x16/0x1b ------ further 'modprobe -r btrfs' fails as well Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: fix snprintf usage by send's gen_unique_nameFilipe David Borba Manana
The buffer size argument passed to snprintf must account for the trailing null byte added by snprintf, and it returns a value >= then sizeof(buffer) when the string can't fit in the buffer. Since our buffer has a size of 64 characters, and the maximum orphan name we can generate is 63 characters wide, we must pass 64 as the buffer size to snprintf, and not 63. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28btrfs: fix defrag 32-bit integer overflowJustin Maggard
When defragging a very large file, the cluster variable can wrap its 32-bit signed int type and become negative, which eventually gets passed to btrfs_force_ra() as a very large unsigned long value. On 32-bit platforms, this eventually results in an Oops from the SLAB allocator. Change the cluster and max_cluster signed int variables to unsigned long to match the readahead functions. This also allows the min() comparison in btrfs_defrag_file() to work as intended. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28btrfs: sysfs: list the NO_HOLES featureDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28btrfs: sysfs: don't show reserved incompat featureDavid Sterba
The COMPRESS_LZOv2 incompat featue is currently not implemented, the bit is only reserved, no point to list it in sysfs. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28btrfs: call permission checks earlier in ioctls and return EPERMDavid Sterba
The owner and capability checks in IOC_SUBVOL_SETFLAGS and SET_RECEIVED_SUBVOL should be called before any other checks are done. Also unify the error code to EPERM. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28btrfs: restrict snapshotting to own subvolumesDavid Sterba
Currently, any user can snapshot any subvolume if the path is accessible and thus indirectly create and keep files he does not own under his direcotries. This is not possible with traditional directories. In security context, a user can snapshot root filesystem and pin any potentially buggy binaries, even if the updates are applied. All the snapshots are visible to the administrator, so it's possible to verify if there are suspicious snapshots. Another more practical problem is that any user can pin the space used by eg. root and cause ENOSPC. Original report: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786 CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: fix wrong block group in trace during the free space allocationMiao Xie
We allocate the free space from the former block group, not the current one, so should use the former one to output the trace information. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: cleanup the code of used_block_group in find_free_extent()Miao Xie
used_block_group is just used for the space cluster which doesn't belong to the current block group, the other place needn't use it. Or the logic of code seems unclear. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: cleanup the redundant code for the block group allocation and initMiao Xie
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: change the members' order of btrfs_space_info structure to reduce the ↵Miao Xie
cache miss It is better that the position of the lock is close to the data which is protected by it, because they may be in the same cache line, we will load less cache lines when we access them. So we rearrange the members' position of btrfs_space_info structure to make the lock be closer to the its data. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: fix wrong search path initialization before searching tree rootWang Shilong
To search tree root without transaction protection, we should neither search commit root nor skip locking here, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: flush the dirty pages of the ordered extent aggressively during ↵Miao Xie
logging csum The performance of fsync dropped down suddenly sometimes, the main reason of this problem was that we might only flush part dirty pages in a ordered extent, then got that ordered extent, wait for the csum calcucation. But if no task flushed the left part, we would wait until the flusher flushed them, sometimes we need wait for several seconds, it made the performance drop down suddenly. (On my box, it drop down from 56MB/s to 4-10MB/s) This patch improves the above problem by flushing left dirty pages aggressively. Test Environment: CPU: 2CPU * 2Cores Memory: 4GB Partition: 20GB(HDD) Test Command: # sysbench --num-threads=8 --test=fileio --file-num=1 \ > --file-total-size=8G --file-block-size=32768 \ > --file-io-mode=sync --file-fsync-freq=100 \ > --file-fsync-end=no --max-requests=10000 \ > --file-test-mode=rndwr run Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: fix transaction abortion when remounting btrfs from RW to ROWang Shilong
Steps to reproduce: # mkfs.btrfs -f /dev/sda8 # mount /dev/sda8 /mnt -o flushoncommit # dd if=/dev/zero of=/mnt/data bs=4k count=102400 & # mount /dev/sda8 /mnt -o remount, ro When remounting RW to RO, the logic is to firstly set flag to RO and then commit transaction, however with option flushoncommit enabled,we will do RO check within committing transaction, so we get a transaction abortion here. Actually,here check is wrong, we should check if FS_STATE_ERROR is set, fix it. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Suggested-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: faster file extent item search in clone ioctlFilipe David Borba Manana
When we are looking for file extent items that intersect the cloning range, for each one that falls completely outside the range, don't release the path and do another full tree search - just move on to the next slot and copy the file extent item into our buffer only if the item intersects the cloning range. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28Btrfs: fix extent state leak on transaction abortionLiu Bo
When transaction is aborted, we fail to commit transaction, instead we do cleanup work. After that when we umount btrfs, we get to free fs roots' log trees respectively, but that happens after we unpin extents, so those extents pinned by freeing log trees will remain in memory and lead to the leak. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>