summaryrefslogtreecommitdiff
path: root/fs/btrfs/disk-io.c
AgeCommit message (Collapse)Author
2015-04-13btrfs: Fix NO_SPACE bug caused by delayed-iputZhao Lei
Steps to reproduce: while true; do dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%] rm /btrfs_dir/file sync done And we'll see dd failed because btrfs return NO_SPACE. Reason: Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs() in end to free fs space for next write, but sometimes it hadn't done work on time, because btrfs-cleaner thread get delayed-iputs from list before, but do iput() after next write. This is log: [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin [ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs() [ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs() [ 2569.087554] comm=sync func=btrfs_commit_transaction() end [ 2569.191081] comm=dd begin [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28 [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888 [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512 ... [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496 [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end Fix: Make btrfs_commit_transaction() wait current running btrfs-cleaner's delayed-iputs() done in end. Test: Use script similar to above(more complex), before patch: 7 failed in 100 * 20 loop. after patch: 0 failed in 100 * 20 loop. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10Btrfs: fix use after free when close_ctree frees the orphan_rsvChris Mason
Near the end of close_ctree, we're calling btrfs_free_block_rsv to free up the orphan rsv. The problem is this call updates the space_info, which has already been freed. This adds a new __ function that directly calls kfree instead of trying to update the space infos. Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10Btrfs: allow block group cache writeout outside critical section in commitChris Mason
We loop through all of the dirty block groups during commit and write the free space cache. In order to make sure the cache is currect, we do this while no other writers are allowed in the commit. If a large number of block groups are dirty, this can introduce long stalls during the final stages of the commit, which can block new procs trying to change the filesystem. This commit changes the block group cache writeout to take appropriate locks and allow it to run earlier in the commit. We'll still have to redo some of the block groups, but it means we can get most of the work out of the way without blocking the entire FS. Signed-off-by: Chris Mason <clm@fb.com>
2015-03-27Btrfs: Remove the check for old-style mkfsLiu Bo
This was used to make sure that a fresh btrfs from an older mkfs.btrfs, but it also allows us to mount a buggy btrfs if this btrfs has the right superblock head part but has something wrong with chunk tree part[1], and after that we can hit BUG_ON()s set in the code to prevent something impossible. Since David has released "Btrfs progs v3.19-rc2", just remove the check, if anyone who wants to make a fresh btrfs, please use the latest one. [1]: http://www.spinics.net/lists/linux-btrfs/msg42358.html Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Omar Sandoval <osandov@osandov.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-25Merge branch 'cleanups-post-3.19' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1 Signed-off-by: Chris Mason <clm@fb.com> Conflicts: fs/btrfs/disk-io.c
2015-03-25Merge branch 'cleanups-for-4.1-v2' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1
2015-03-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Most of these are fixing extent reservation accounting, or corners with tree writeback during commit. Josef's set does add a test, which isn't strictly a fix, but it'll keep us from making this same mistake again" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix outstanding_extents accounting in DIO Btrfs: add sanity test for outstanding_extents accounting Btrfs: just free dummy extent buffers Btrfs: account merges/splits properly Btrfs: prepare block group cache before writing Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list) Btrfs: account for the correct number of extents for delalloc reservations Btrfs: fix merge delalloc logic Btrfs: fix comp_oper to get right order Btrfs: catch transaction abortion after waiting for it btrfs: fix sizeof format specifier in btrfs_check_super_valid()
2015-03-13btrfs: fix sizeof format specifier in btrfs_check_super_valid()Fabian Frederick
This patch fixes mips compilation warning: fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid': fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat] Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-03btrfs: cleanup, use kmalloc_array/kcalloc array helpersDavid Sterba
Convert kmalloc(nr * size, ..) to kmalloc_array that does additional overflow checks, the zeroing variant is kcalloc. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03btrfs: cleanup 64bit/32bit divs, compile time constantsDavid Sterba
Switch to div_u64 if the divisor is a numeric constant or sum of sizeof()s. We can remove a few instances of do_div that has the hidden semtantics of changing the 1st argument. Small power-of-two divisors are converted to bitshifts, large values are kept intact for clarity. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-19Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This pull is mostly cleanups and fixes: - The raid5/6 cleanups from Zhao Lei fixup some long standing warts in the code and add improvements on top of the scrubbing support from 3.19. - Josef has round one of our ENOSPC fixes coming from large btrfs clusters here at FB. - Dave Sterba continues a long series of cleanups (thanks Dave), and Filipe continues hammering on corner cases in fsync and others This all was held up a little trying to track down a use-after-free in btrfs raid5/6. It's not clear yet if this is just made easier to trigger with this pull or if its a new bug from the raid5/6 cleanups. Dave Sterba is the only one to trigger it so far, but he has a consistent way to reproduce, so we'll get it nailed shortly" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits) Btrfs: don't remove extents and xattrs when logging new names Btrfs: fix fsync data loss after adding hard link to inode Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group Btrfs: account for large extents with enospc Btrfs: don't set and clear delalloc for O_DIRECT writes Btrfs: only adjust outstanding_extents when we do a short write btrfs: Fix out-of-space bug Btrfs: scrub, fix sleep in atomic context Btrfs: fix scheduler warning when syncing log Btrfs: Remove unnecessary placeholder in btrfs_err_code btrfs: cleanup init for list in free-space-cache btrfs: delete chunk allocation attemp when setting block group ro btrfs: clear bio reference after submit_one_bio() Btrfs: fix scrub race leading to use-after-free Btrfs: add missing cleanup on sysfs init failure Btrfs: fix race between transaction commit and empty block group removal btrfs: add more checks to btrfs_read_sys_array btrfs: cleanup, rename a few variables in btrfs_read_sys_array btrfs: add checks for sys_chunk_array sizes btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize ...
2015-02-16btrfs: cleanup, reduce temporary variables in btrfs_read_rootsDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16btrfs: use correct type for workqueue flagsDavid Sterba
Through all the local wrappers to alloc_workqueue, __alloc_workqueue_key takes an unsigned int. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16btrfs: factor btrfs_read_roots() out of open_ctree()Eric Sandeen
Also, remove the two local variables create_uuid_tree and check_uuid_tree; we can use the existence of the uuid root and/or the RESCAN_UUID_TREE flag to determine what action to take. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16btrfs: factor btrfs_replay_log() out of open_ctree()Eric Sandeen
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16btrfs: factor btrfs_init_workqueues() out of open_ctree()Eric Sandeen
Signed-off-by: Eric Sandeen <sandeen@redhat.com> [renamed to btrfs_init_workqueues] Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16btrfs: factor btrfs_init_qgroup() out of open_ctree()Eric Sandeen
Signed-off-by: Eric Sandeen <sandeen@redhat.com> [renamed to btrfs_init_qgroup] Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16btrfs: factor btrfs_init_dev_replace_locks() out of open_ctree()Eric Sandeen
Signed-off-by: Eric Sandeen <sandeen@redhat.com> [renamed to btrfs_init_dev_replace_locks] Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16btrfs: factor btrfs_init_btree_inode() out of open_ctree()Eric Sandeen
Signed-off-by: Eric Sandeen <sandeen@redhat.com> [renamed to btrfs_init_btree_inode] Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16btrfs: factor btrfs_init_balance() out of open_ctree()Eric Sandeen
Signed-off-by: Eric Sandeen <sandeen@redhat.com> [renamed to btrfs_init_balance] Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16btrfs: factor btrfs_init_scrub() out of open_ctree()Eric Sandeen
Signed-off-by: Eric Sandeen <sandeen@redhat.com> [renamed to btrfs_init_scrub] Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16btrfs: consistently use fs_info in close_ctree()Eric Sandeen
close_ctree() has a local fs_info var for convienience; use it consistently. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16btrfs: remove unused fs_info arg from btrfs_close_extra_devices()Eric Sandeen
The commit: 8dabb74 Btrfs: change core code of btrfs to support the device replace operations added the fs_info argument, but never used it - just remove it again. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16btrfs: fix sizeof format specifier in btrfs_check_super_valid()Fabian Frederick
This patch fixes mips compilation warning: fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid': fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat] Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16btrfs: constify structs with op functions or static definitionsDavid Sterba
There are some op tables that can be easily made const, similarly the sysfs feature and raid tables. This is motivated by PaX CONSTIFY plugin. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16Btrfs: disk-io: replace root args iff only fs_info usedDaniel Dressler
This is the 3rd independent patch of a larger project to cleanup btrfs's internal usage of btrfs_root. Many functions take btrfs_root only to grab the fs_info struct. By requiring a root these functions cause programmer overhead. That these functions can accept any valid root is not obvious until inspection. This patch reduces the specificity of such functions to accept the fs_info directly. These patches can be applied independently and thus are not being submitted as a patch series. There should be about 26 patches by the project's completion. Each patch will cleanup between 1 and 34 functions apiece. Each patch covers a single file's functions. This patch affects the following function(s): 1) csum_tree_block 2) csum_dirty_buffer 3) check_tree_block_fsid 4) btrfs_find_tree_block 5) clean_tree_block Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-03Btrfs: fix race between transaction commit and empty block group removalFilipe Manana
Committing a transaction can race with automatic removal of empty block groups (cleaner kthread), leading to a BUG_ON() in the transaction commit code while running btrfs_finish_extent_commit(). The following sequence diagram shows how it can happen: CPU 1 CPU 2 btrfs_commit_transaction() fs_info->running_transaction = NULL btrfs_finish_extent_commit() find_first_extent_bit() -> found range for block group X in fs_info->freed_extents[] btrfs_delete_unused_bgs() -> found block group X Removed block group X's range from fs_info->freed_extents[] btrfs_remove_chunk() btrfs_remove_block_group(bg X) unpin_extent_range(bg X range) btrfs_lookup_block_group(bg X) -> returns NULL -> BUG_ON() The trace that results from the BUG_ON() is: [48665.187808] ------------[ cut here ]------------ [48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675! [48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC [48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode [48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G W 3.19.0-rc5-btrfs-next-4+ #1 [48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000 [48665.197388] RIP: 0010:[<ffffffffa0350d05>] [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs] [48665.197388] RSP: 0018:ffff8801b56a7b88 EFLAGS: 00010246 [48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8 [48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0 [48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000 [48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000 [48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000 [48665.197388] FS: 0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000 [48665.197388] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0 [48665.197388] Stack: [48665.197388] ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff [48665.197388] ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0 [48665.197388] ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227 [48665.197388] Call Trace: [48665.197388] [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs] [48665.197388] [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs] [48665.197388] [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs] [48665.197388] [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33 [48665.197388] [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs] [48665.197388] [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab [48665.197388] [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab [48665.197388] [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf [48665.197388] [<ffffffff8105a55b>] worker_thread+0x210/0x2d0 [48665.197388] [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3 [48665.197388] [<ffffffff8105e5c0>] kthread+0xef/0xf7 [48665.197388] [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39 [48665.197388] [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad [48665.197388] [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0 [48665.197388] [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad [48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b [48665.197388] RIP [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs] [48665.197388] RSP <ffff8801b56a7b88> [48665.272246] ---[ end trace b9c6ab9957521376 ]--- Fix this by ensuring that unpining the block group's range in btrfs_finish_extent_commit() is done in a synchronized fashion with removing the block group's range from freed_extents[] in btrfs_delete_unused_bgs() This race got introduced with the change: Btrfs: remove empty block groups automatically commit 47ab2a6c689913db23ccae38349714edf8365e0a Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-03btrfs: add checks for sys_chunk_array sizesDavid Sterba
Verify that possible minimum and maximum size is set, validity of contents is checked in btrfs_read_sys_array. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-03btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesizeDavid Sterba
I received a few crafted images from Jiri, all got through the recently added superblock checks. The lower bounds checks for num_devices and sector/node -sizes were missing and caused a crash during mount. Tools for symbolic code execution were used to prepare the images contents. Reported-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-22Btrfs: fix unused members in struct btrfs_rootAnand Jain
There isn't any real use of following members of struct btrfs_root so delete them. struct kobject root_kobj; struct completion kobj_unregister; Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-22btrfs: set proper message level for skinny metadataDavid Sterba
This has been confusing people for too long, the message is really just informative. CC: <stable@vger.kernel.org> # 3.10+ Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-22btrfs: update message levels after checksum errorsDavid Sterba
The errors are worth noting and might get missed with INFO level. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-22btrfs: update message levels during failed mountDavid Sterba
All error conditions from open_ctree shall be ERR. Warning would suggest that something's wrong and we can continue. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-22btrfs: update message levels for errorsDavid Sterba
Several messages that point to some internal problem, level INFO is wrong here. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-22Merge branch 'cleanup/blocksize-diet-part2' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
2015-01-20fs: remove default_backing_dev_infoChristoph Hellwig
Now that default_backing_dev_info is not used for writeback purposes we can git rid of it easily: - instead of using it's name for tracing unregistered bdi we just use "unknown" - btrfs and ceph can just assign the default read ahead window themselves like several other filesystems already do. - we can assign noop_backing_dev_info as the default one in alloc_super. All filesystems already either assigned their own or noop_backing_dev_info. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20fs: remove mapping->backing_dev_infoChristoph Hellwig
Now that we never use the backing_dev_info pointer in struct address_space we can simply remove it and save 4 to 8 bytes in every inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20fs: introduce f_op->mmap_capabilities for nommu mmap supportChristoph Hellwig
Since "BDI: Provide backing device capability information [try #3]" the backing_dev_info structure also provides flags for the kind of mmap operation available in a nommu environment, which is entirely unrelated to it's original purpose. Introduce a new nommu-only file operation to provide this information to the nommu mmap code instead. Splitting this from the backing_dev_info structure allows to remove lots of backing_dev_info instance that aren't otherwise needed, and entirely gets rid of the concept of providing a backing_dev_info for a character device. It also removes the need for the mtd_inodefs filesystem. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-14btrfs: expand btrfs_find_item if found_key is NULLDavid Sterba
If the found_key is NULL, then btrfs_find_item becomes a verbose wrapper for simple btrfs_search_slot. After we've removed all such callers, passing a NULL key is not valid anymore. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14btrfs: fix leak of path in btrfs_find_itemDavid Sterba
If btrfs_find_item is called with NULL path it allocates one locally but does not free it. Affected paths are inserting an orphan item for a file and for a subvol root. Move the path allocation to the callers. CC: <stable@vger.kernel.org> # 3.14+ Fixes: 3f870c289900 ("btrfs: expand btrfs_find_item() to include find_orphan_item functionality") Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12btrfs: sink parameter len to alloc_extent_bufferDavid Sterba
Because we're using globally known nodesize. Do the same for the sanity test function variant. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12btrfs: sink blocksize parameter to btrfs_find_create_tree_blockDavid Sterba
Finally it's clear that the requested blocksize is always equal to nodesize, with one exception, the superblock. Superblock has fixed size regardless of the metadata block size, but uses the same helpers to initialize sys array/chunk tree and to work with the chunk items. So it pretends to be an extent_buffer for a moment, btrfs_read_sys_array is full of special cases, we're adding one more. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12btrfs: sink blocksize parameter to reada_tree_block_flaggedDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12btrfs: sink blocksize parameter to readahead_tree_blockDavid Sterba
All callers pass nodesize. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-10Btrfs: fix fs corruption on transaction abort if device supports discardFilipe Manana
When we abort a transaction we iterate over all the ranges marked as dirty in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them from those trees, add them back (unpin) to the free space caches and, if the fs was mounted with "-o discard", perform a discard on those regions. Also, after adding the regions to the free space caches, a fitrim ioctl call can see those ranges in a block group's free space cache and perform a discard on the ranges, so the same issue can happen without "-o discard" as well. This causes corruption, affecting one or multiple btree nodes (in the worst case leaving the fs unmountable) because some of those ranges (the ones in the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are referred by the last committed super block - breaking the rule that anything that was committed by a transaction is untouched until the next transaction commits successfully. I ran into this while running in a loop (for several hours) the fstest that I recently submitted: [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim The corruption always happened when a transaction aborted and then fsck complained like this: _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent *** fsck.btrfs output *** Check tree block failed, want=94945280, have=0 Check tree block failed, want=94945280, have=0 Check tree block failed, want=94945280, have=0 Check tree block failed, want=94945280, have=0 Check tree block failed, want=94945280, have=0 read block failed check_tree_block Couldn't open file system In this case 94945280 corresponded to the root of a tree. Using frace what I observed was the following sequence of steps happened: 1) transaction N started, fs_info->pinned_extents pointed to fs_info->freed_extents[0]; 2) node/eb 94945280 is created; 3) eb is persisted to disk; 4) transaction N commit starts, fs_info->pinned_extents now points to fs_info->freed_extents[1], and transaction N completes; 5) transaction N + 1 starts; 6) eb is COWed, and btrfs_free_tree_block() called for this eb; 7) eb range (94945280 to 94945280 + 16Kb) is added to fs_info->pinned_extents (fs_info->freed_extents[1]); 8) Something goes wrong in transaction N + 1, like hitting ENOSPC for example, and the transaction is aborted, turning the fs into readonly mode. The stack trace I got for example: [112065.253935] [<ffffffff8140c7b6>] dump_stack+0x4d/0x66 [112065.254271] [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98 [112065.254567] [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs] [112065.261674] [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50 [112065.261922] [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs] [112065.262211] [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs] [112065.262545] [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs] [112065.262771] [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs] [112065.263105] [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs] (...) [112065.264493] ---[ end trace dd7903a975a31a08 ]--- [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left [112065.264997] BTRFS info (device sdc): forced readonly 9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in fs_info->fs_state and calls btrfs_cleanup_transaction(), which in turn calls btrfs_destroy_pinned_extent(); 10) Then btrfs_destroy_pinned_extent() iterates over all the ranges marked as dirty in fs_info->freed_extents[], and for each one it calls discard, if the fs was mounted with "-o discard", and adds the range to the free space cache of the respective block group; 11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path, sees the free space entries and performs a discard; 12) After an umount and mount (or fsck), our eb's location on disk was full of zeroes, and it should have been untouched, because it was marked as dirty in the fs_info->pinned_extents tree, and therefore used by the trees that the last committed superblock points to. Fix this by not performing a discard and not adding the ranges to the free space caches - it's useless from this point since the fs is now in readonly mode and we won't write free space caches to disk anymore (otherwise we would leak space) nor any new superblock. By not adding the ranges to the free space caches, it prevents other code paths from allocating that space and write to it as well, therefore being safer and simpler. This isn't a new problem, as it's been present since 2011 (git commit acce952b0263825da32cf10489413dec78053347). Cc: stable@vger.kernel.org # any kernel released after 2011-01-06 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-12-03Btrfs: fix race between fs trimming and block group remove/allocationFilipe Manana
Our fs trim operation, which is completely transactionless (doesn't start or joins an existing transaction) consists of visiting all block groups and then for each one to iterate its free space entries and perform a discard operation against the space range represented by the free space entries. However before performing a discard, the corresponding free space entry is removed from the free space rbtree, and when the discard completes it is added back to the free space rbtree. If a block group remove operation happens while the discard is ongoing (or before it starts and after a free space entry is hidden), we end up not waiting for the discard to complete, remove the extent map that maps logical address to physical addresses and the corresponding chunk metadata from the the chunk and device trees. After that and before the discard completes, the current running transaction can finish and a new one start, allowing for new block groups that map to the same physical addresses to be allocated and written to. So fix this by keeping the extent map in memory until the discard completes so that the same physical addresses aren't reused before it completes. If the physical locations that are under a discard operation end up being used for a new metadata block group for example, and dirty metadata extents are written before the discard finishes (the VM might call writepages() of our btree inode's i_mapping for example, or an fsync log commit happens) we end up overwriting metadata with zeroes, which leads to errors from fsck like the following: checking extents Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block owner ref check failed [833912832 16384] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block root 5 root dir 256 error root 5 inode 260 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref root 5 inode 262 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref root 5 inode 263 errors 2001, no inode item, link count wrong (...) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25Merge branch 'dev/pending-changes' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
2014-11-21Btrfs: make sure logged extents complete in the current transaction V3Josef Bacik
Liu Bo pointed out that my previous fix would lose the generation update in the scenario I described. It is actually much worse than that, we could lose the entire extent if we lose power right after the transaction commits. Consider the following write extent 0-4k log extent in log tree commit transaction < power fail happens here ordered extent completes We would lose the 0-4k extent because it hasn't updated the actual fs tree, and the transaction commit will reset the log so it isn't replayed. If we lose power before the transaction commit we are save, otherwise we are not. Fix this by keeping track of all extents we logged in this transaction. Then when we go to commit the transaction make sure we wait for all of those ordered extents to complete before proceeding. This will make sure that if we lose power after the transaction commit we still have our data. This also fixes the problem of the improperly updated extent generation. Thanks, cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-11-21btrfs: fix typos in btrfs_check_super_validDavid Sterba
Copy&paste errors in some messages and add few more missing macro accessors. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-11-12btrfs: switch inode_cache option handling to pending changesDavid Sterba
The pending mount option(s) now share namespace and bits with the normal options, and the existing one for (inode_cache) is unset unconditionally at each transaction commit. Introduce a separate namespace for pending changes and enhance the descriptions of the intended change to use separate bits for each action. Signed-off-by: David Sterba <dsterba@suse.cz>