summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent-tree.c
AgeCommit message (Collapse)Author
2013-09-01Btrfs: return ENOSPC when target space is fullFilipe David Borba Manana
In extent-tree.c:do_chunk_alloc(), early on we returned 0 (success) when the target space was full and when chunk allocation is needed. However, later on in that same function we return ENOSPC if btrfs_alloc_chunk() fails (and chunk allocation was needed) and set the space's full flag. This was inconsistent, as -ENOSPC should be returned if the space is full and a chunk allocation needs to performed. If the space is full but no chunk allocation is needed, just return 0 (success). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01Btrfs: handle errors when doing slow cachingJosef Bacik
Alex Lyakas reported a bug where wait_block_group_cache_progress() would wait forever if a drive failed. This is because we just bail out if there is an error while trying to cache a block group, we don't update anybody who may be waiting. So this introduces a new enum for the cache state in case of error and makes everybody bail out if we have an error. Alex tested and verified this patch fixed his problem. This fixes bz 59431. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01Btrfs: cleanup reloc roots properly on errorJosef Bacik
I was hitting the BUG_ON() at the end of merge_reloc_roots() because we were aborting the transaction at some point previously and then getting an error when we tried to drop the reloc root. I fixed btrfs_drop_snapshot to re-add us to the dead roots list if we failed, but this isn't the right thing to do for reloc roots since it uses root->root_list for it's own stuff in order to know what needs to be cleaned up. So fix btrfs_drop_snapshot to only do the re-add if we aren't dropping for reloc, and handle errors from merge_reloc_root() by dropping the reloc root we are processing since it won't be on the list of roots to cleanup. With this patch my reproducer no longer panics. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01Btrfs/tracepoint: update delayed ref tracepointsLiu Bo
This shows exactly how btrfs processes the delayed refs onto disks, which is very helpful on understanding delayed ref mechanism and debugging related bugs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01btrfs_read_block_groups: Use enums to indexchandan
btrfs_space_info->block_groups. The current code uses integer literals to index btrfs_space_info->block_groups[] array. Instead use corresponding enums from 'enum btrfs_raid_types'. Signed-off-by: chandan <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01Btrfs: make free space caching faster with many non-inline extent referencesLiu Bo
So to cache free space, we iterate every extent item to gather free space info. When we have say 10,000 non-inline extent refs(such as BTRFS_EXTENT_DATA_REF), it takes quite a long time, and since inline extent refs and non-inline ones have same objectid in their keys, we can just re-search the tree with the next address to skip non-inline references. (This is found by dedup feature because dedup extents can end up with many non-inline extent refs.) Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01btrfs: fall back to global reservation when removing subvolumesJeff Mahoney
I recently did some ENOSPC testing that involved filling the disk while create and removing snapshots in a loop. During the test cycle, I ran into an ENOSPC when trying to remove a snapshot, leaving the fs stuck in ENOSPC even after a umount/mount cycle. This patch allow subvolume removal to fall back onto the global block reservation in order to succeed when it would have failed otherwise. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01Btrfs: optimize btrfs_lookup_extent_info()Filipe David Borba Manana
If we're looking for a metadata item in the tree and the search fails with return value of 1, and the slot doesn't point to the first item in the leaf, check if the previous item in the leaf corresponds to an extent item for the same object id - if it does, then don't do another tree search to get it. This optimization is already done by btrfs-progs. V2: updated commit message. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01Btrfs: set lockdep class before locking new extent bufferJosef Bacik
We've been seeing spurious complaints out of lockdep because the lock class name changes. This is happening because when we drop a snapshot we will lock a block before we've read it in, which sets the lockdep class to whatever the default is. Then once we read the thing in we reset the lockdep class to what it is supposed to be, which blows lockdeps' mind. This patch should fix the problem, it appears to be the only place where we do this sort of thing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-07-19Btrfs: re-add root to dead root list if we stop dropping itJosef Bacik
If we stop dropping a root for whatever reason we need to add it back to the dead root list so that we will re-start the dropping next transaction commit. The other case this happens is if we recover a drop because we will add a root without adding it to the fs radix tree, so we can leak it's root and commit root extent buffer, adding this to the dead root list makes this cleanup happen. Thanks, Cc: stable@vger.kernel.org Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-19Btrfs: fix lock leak when resuming snapshot deletionJosef Bacik
We aren't setting path->locks[level] when we resume a snapshot deletion which means we won't unlock the buffer when we free the path. This causes deadlocks if we happen to re-allocate the block before we've evicted the extent buffer from cache. Thanks, Cc: stable@vger.kernel.org Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-19Btrfs: update drop progress before stopping snapshot droppingJosef Bacik
Alex pointed out a problem and fix that exists in the drop one snapshot at a time patch. If we decide we need to exit for whatever reason (umount for example) we will just exit the snapshot dropping without updating the drop progress. So the next time we go to resume we will BUG_ON() because we can't find the extent we left off at because we never updated it. This patch fixes the problem. Cc: stable@vger.kernel.org Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: make the chunk allocator completely tree locklessJosef Bacik
When adjusting the enospc rules for relocation I ran into a deadlock because we were relocating the only system chunk and that forced us to try and allocate a new system chunk while holding locks in the chunk tree, which caused us to deadlock. To fix this I've moved all of the dev extent addition and chunk addition out to the delayed chunk completion stuff. We still keep the in-memory stuff which makes sure everything is consistent. One change I had to make was to search the commit root of the device tree to find a free dev extent, and hold onto any chunk em's that we allocated in that transaction so we do not allocate the same dev extent twice. This has the side effect of fixing a bug with balance that has been there ever since balance existed. Basically you can free a block group and it's dev extent and then immediately allocate that dev extent for a new block group and write stuff to that dev extent, all within the same transaction. So if you happen to crash during a balance you could come back to a completely broken file system. This patch should keep these sort of things from happening in the future since we won't be able to allocate free'd dev extents until after the transaction commits. This has passed all of the xfstests and my super annoying stress test followed by a balance. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: check if we can nocow if we don't have data spaceJosef Bacik
We always just try and reserve data space when we write, but if we are out of space but have prealloc'ed extents we should still successfully write. This patch will try and see if we can write to prealloc'ed space and if we can go ahead and allow the write to continue. With this patch we now pass xfstests generic/274. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delallocJosef Bacik
try_to_writeback_inodes_sb_nr returns 1 if writeback is already underway, which is completely fraking useless for us as we need to make sure pages are actually written before we go and check if there are ordered extents. So replace this with an open coding of try_to_writeback_inodes_sb_nr minus the writeback underway check so that we are sure to actually have flushed some dirty pages out and will have ordered extents to use. With this patch xfstests generic/273 now passes. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: use a percpu to keep track of possibly pinned bytesJosef Bacik
There are all of these checks in the ENOSPC code to see if committing the transaction would free up enough space to make the allocation. This is because early on we just committed the transaction and hoped and prayed, which resulted in cases where it took _forever_ to get an ENOSPC when we really were out of space. So we check space_info->bytes_pinned, except this isn't completely true because it doesn't account for space we may free but are stuck in delayed refs. So tests like xfstests 226 would fail because we wouldn't commit the transaction to free up the data space. So instead add a percpu counter that will be a little fuzzier, it will add bytes as soon as we try to free up the space, and remove any space it doesn't actually free up when we get around to doing the actual free. We then 0 out this counter every transaction period so we have a better idea of how much space we will actually free up by committing this transaction. With this patch we now pass xfstests 226. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: fix transaction throttling for delayed refsJosef Bacik
Dave has this fs_mark script that can make btrfs abort with sufficient amount of ram. This is because with more ram we can keep more dirty metadata in cache which in a round about way makes for many more pending delayed refs. What happens is we end up not throttling the transaction enough so when we go to commit the transaction when we've completely filled the file system we'll abort() because we use all of the space in the global reserve and we still have delayed refs to run. To fix this we need to make the delayed ref flushing and the transaction throttling dependant upon the number of delayed refs that we have instead of how much reserved space is left in the global reserve. With this patch we not only stop aborting transactions but we also get a smoother run speed with fs_mark and it makes us about 10% faster. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: wake up delayed ref flushing waiters on abortJosef Bacik
I hit a deadlock because we aborted when flushing delayed refs but didn't wake any of the other flushers up and so everybody was just sleeping forever. This should fix the problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: exclude logged extents before replying when we are mixedJosef Bacik
With non-mixed block groups we replay the logs before we're allowed to do any writes, so we get away with not pinning/removing the data extents until right when we replay them. However with mixed block groups we allocate out of the same pool, so we could easily allocate a metadata block that was logged in our tree log. To deal with this we just need to notice that we have mixed block groups and do the normal excluding/removal dance during the pin stage of the log replay and that way we don't allocate metadata blocks from areas we have logged data extents. With this patch we now pass xfstests generic/311 with mixed block groups turned on. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: simplify unlink reservationsJosef Bacik
Dave pointed out a problem where if you filled up a file system as much as possible you couldn't remove any files. The whole unlink reservation thing is convoluted because it tries to guess if it's going to add space to unlink something or not, and has all these odd uncommented cases where it simply does not try. So to fix this I've added a way to conditionally steal from the global reserve if we can't make our normal reservation. If we have more than half the space in the global reserve free we will go ahead and steal from the global reserve. With this patch Dave's reproducer now works and I can rm all the files on the file system. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: introduce per-subvolume ordered extent listMiao Xie
The reason we introduce per-subvolume ordered extent list is the same as the per-subvolume delalloc inode list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: introduce per-subvolume delalloc inode listMiao Xie
When we create a snapshot, we need flush all delalloc inodes in the fs, just flushing the inodes in the source tree is OK. So we introduce per-subvolume delalloc inode list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: introduce grab/put functions for the root of the fs/file treeMiao Xie
The grab/put funtions will be used in the next patch, which need grab the root object and ensure it is not freed. We use reference counter instead of the srcu lock is to aovid blocking the memory reclaim task, which invokes synchronize_srcu(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: cleanup the similar code of the fs root readMiao Xie
There are several functions whose code is similar, such as btrfs_find_last_root() btrfs_read_fs_root_no_radix() Besides that, some functions are invoked twice, it is unnecessary, for example, we are sure that all roots which is found in btrfs_find_orphan_roots() have their orphan items, so it is unnecessary to check the orphan item again. So cleanup it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: make the snap/subv deletion end more early when the fs is R/OMiao Xie
The snapshot/subvolume deletion might spend lots of time, it would make the remount task wait for a long time. This patch improve this problem, we will break the deletion if the fs is remounted to be R/O. It will make the users happy. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-18Btrfs: explicitly use global_block_rsv for quota_treeStefan Behrens
The quota_tree was set up to use the empty_block_rsv before which would be problematic when the filesystem is filled up and ENOSPC happens during internal operations while the quota tree is updated and COWed (when the btrfs_qgroup_info_item items) are written. In fact, use_block_rsv() which is used in btrfs_cow_block() falls back to the global_block_rsv in this case. But just in order to make it more clear what is happening, change it to explicitly use the global_block_rsv. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-18Btrfs: update the global reserve if it is emptyMiao Xie
Before applying this patch, we reserved the space for the global reserve by the minimum unit if we found it is empty, it was unreasonable and inefficient, because if the global reserve space was depleted, it implied that the size of the global reserve was too small. In this case, we shoud update the global reserve and fill it. Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-18Btrfs: don't steal the reserved space from the global reserve if their space ↵Miao Xie
type is different If the type of the space we need is different with the global reserve, we can not steal the space from the global reserve, because we can not allocate the space from the free space cache that the global reserve points to. Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-18Btrfs: optimize the error handle of use_block_rsv()Miao Xie
cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-18Btrfs: don't use global block reservation for inode cache truncationMiao Xie
It is very likely that there are lots of subvolumes/snapshots in the filesystem, so if we use global block reservation to do inode cache truncation, we may hog all the free space that is reserved in global rsv. So it is better that we do the free space reservation for inode cache truncation by ourselves. Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-18Btrfs: handle running extent ops with skinny metadataJosef Bacik
Chris hit a bug where we weren't finding extent records when running extent ops. This is because we use the delayed_ref_head when running the extent op, which means we can't use the ->type checks to see if we are metadata. We also lose the level of the metadata we are working on. So to fix this we can just check the ->is_data section of the extent_op, and we can store the level of the buffer we were modifying in the extent_op. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06btrfs: fix misleading variable name for flagsDavid Sterba
The variable was named 'data' in btrfs_reserve_extent and that's the only function that actually uses it to let btrfs_get_alloc_profile know what profile we want. Then it's passed down as u64 flags. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06btrfs: make static code static & remove dead codeEric Sandeen
Big patch, but all it does is add statics to functions which are in fact static, then remove the associated dead-code fallout. removed functions: btrfs_iref_to_path() __btrfs_lookup_delayed_deletion_item() __btrfs_search_delayed_insertion_item() __btrfs_search_delayed_deletion_item() find_eb_for_page() btrfs_find_block_group() range_straddles_pages() extent_range_uptodate() btrfs_file_extent_length() btrfs_scrub_cancel_devid() btrfs_start_transaction_lflush() btrfs_print_tree() is left because it is used for debugging. btrfs_start_transaction_lflush() and btrfs_reada_detach() are left for symmetry. ulist.c functions are left, another patch will take care of those. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: deal with free space cache errors while replaying logJosef Bacik
So everybody who got hit by my fsync bug will still continue to hit this BUG_ON() in the free space cache, which is pretty heavy handed. So I took a file system that had this bug and fixed up all the BUG_ON()'s and leaks that popped up when I tried to mount a broken file system like this. With this patch we just fail to mount instead of panicing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: allocate new chunks if the space is not enough for global rsvMiao Xie
When running the 208th of xfstests, the fs returned the enospc error when there was lots of free space in the disk. By bisect debug, we found it was introduced by commit 96f1bb5777. This commit makes the space check for the global reservation in can_overcommit() be inconsistent with should_alloc_chunk(). can_overcommit() requires that the free space is 2 times the size of the global reservation, or we can't do overcommit. And instead, we need reclaim some reserved space, and if we still don't have enough free space, we need allocate a new chunk. But unfortunately, should_alloc_chunk() just requires that the free space is 1 time the size of the global reservation, that is we would not try to allocate a new chunk if the free space size is in the middle of these two requires, and just return the enospc error. Fix it. Cc: Jim Schutt <jaschut@sandia.gov> Cc: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: separate sequence numbers for delayed ref tracking and tree mod logJan Schmidt
Sequence numbers for delayed refs have been introduced in the first version of the qgroup patch set. To solve the problem of find_all_roots on a busy file system, the tree mod log was introduced. The sequence numbers for that were simply shared between those two users. However, at one point in qgroup's quota accounting, there's a statement accessing the previous sequence number, that's still just doing (seq - 1) just as it would have to in the very first version. To satisfy that requirement, this patch makes the sequence number counter 64 bit and splits it into a major part (used for qgroup sequence number counting) and a minor part (incremented for each tree modification in the log). This enables us to go exactly one major step backwards, as required for qgroups, while still incrementing the sequence counter for tree mod log insertions to keep track of their order. Keeping them in a single variable means there's no need to change all the code dealing with comparisons of two sequence numbers. The sequence number is reset to 0 on commit (not new in this patch), which ensures we won't overflow the two 32 bit counters. Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs from the tree mod log code may happen. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: don't panic if we're trying to drop too many refsJosef Bacik
This is just obnoxious. Just print a message, abort the transaction, and return an error. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: fix all callers of read_tree_blockJosef Bacik
We kept leaking extent buffers when mounting a broken file system and it turns out it's because not everybody uses read_tree_block properly. You need to check and make sure the extent_buffer is uptodate before you use it. This patch fixes everybody who calls read_tree_block directly to make sure they check that it is uptodate and free it and return an error if it is not. With this we no longer leak EB's when things go horribly wrong. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: only exclude supers in the range of our block groupJosef Bacik
If we fail to load block groups halfway through we can leave extent_state's on the excluded tree. This is because we just lookup the supers and add them to the excluded tree regardless of which block group we are looking at currently. This is a problem because we remove the excluded extents for the range of the block group only, so if we don't ever load a block group for one of the excluded extents we won't ever free it. This fixes the problem by only adding excluded extents if it falls in the block group range we care about. With this patch we're no longer leaking space when we fail to read all of the block groups. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: fix possible infinite loop in slow cachingJosef Bacik
So I noticed there is an infinite loop in the slow caching code. If we return 1 when we hit the end of the tree, so we could end up caching the last block group the slow way and suddenly we're looping forever because we just keep re-searching and trying again. Fix this by only doing btrfs_next_leaf() if we don't need_resched(). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: cleanup of function where btrfs_extend_item() is calledTsutomu Itoh
Argument 'trans' became unnecessary from setup_inline_extent_backref() that called btrfs_extend_item(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: remove unused argument of btrfs_extend_item()Tsutomu Itoh
Argument 'trans' is not used in btrfs_extend_item(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: cleanup of function where fixup_low_keys() is calledTsutomu Itoh
If argument 'trans' is unnecessary in the function where fixup_low_keys() is called, 'trans' is deleted. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: don't wait on ordered extents if we have a trans openJosef Bacik
Dave was hitting a lockdep warning because we're now properly taking the ordered operations mutex in the ordered wait stuff. This is because some cases we will have a trans handle when we are flushing delalloc space, but we can't wait on ordered extents because we could potentially deadlock, so fix this by not doing the wait if we have a trans handle. Thanks Reported-and-tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: fix error handling in make/read block groupJosef Bacik
I noticed that we will add a block group to the space info before we add it to the block group cache rb tree, so we could potentially allocate from the block group before it's able to be searched for. I don't think this is too much of a problem, the race window is microscopic, but just in case move the tree insertion to above the space info linking. This makes it easier to adjust the error handling as well, so we can remove a couple of BUG_ON(ret)'s and have real error handling setup for these scenarios. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: Include the device in most error printk()sSimon Kirby
With more than one btrfs volume mounted, it can be very difficult to find out which volume is hitting an error. btrfs_error() will print this, but it is currently rigged as more of a fatal error handler, while many of the printk()s are currently for debugging and yet-unhandled cases. This patch just changes the functions where the device information is already available. Some cases remain where the root or fs_info is not passed to the function emitting the error. This may introduce some confusion with volumes backed by multiple devices emitting errors referring to the primary device in the set instead of the one on which the error occurred. Use btrfs_printk(fs_info, format, ...) rather than writing the device string every time, and introduce macro wrappers ala XFS for brevity. Since the function already cannot be used for continuations, print a newline as part of the btrfs_printk() message rather than at each caller. Signed-off-by: Simon Kirby <sim@hostway.ca> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06btrfs: clean snapshots one by oneDavid Sterba
Each time pick one dead root from the list and let the caller know if it's needed to continue. This should improve responsiveness during umount and balance which at some point waits for cleaning all currently queued dead roots. A new dead root is added to the end of the list, so the snapshots disappear in the order of deletion. The snapshot cleaning work is now done only from the cleaner thread and the others wake it if needed. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: add a incompatible format change for smaller metadata extent refsJosef Bacik
We currently store the first key of the tree block inside the reference for the tree block in the extent tree. This takes up quite a bit of space. Make a new key type for metadata which holds the level as the offset and completely removes storing the btrfs_tree_block_info inside the extent ref. This reduces the size from 51 bytes to 33 bytes per extent reference for each tree block. In practice this results in a 30-35% decrease in the size of our extent tree, which means we COW less and can keep more of the extent tree in memory which makes our heavy metadata operations go much faster. This is not an automatic format change, you must enable it at mkfs time or with btrfstune. This patch deals with having metadata stored as either the old format or the new format so it is easy to convert. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-03-29Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We've had a busy two weeks of bug fixing. The biggest patches in here are some long standing early-enospc problems (Josef) and a very old race where compression and mmap combine forces to lose writes (me). I'm fairly sure the mmap bug goes all the way back to the introduction of the compression code, which is proof that fsx doesn't trigger every possible mmap corner after all. I'm sure you'll notice one of these is from this morning, it's a small and isolated use-after-free fix in our scrub error reporting. I double checked it here." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: don't drop path when printing out tree errors in scrub Btrfs: fix wrong return value of btrfs_lookup_csum() Btrfs: fix wrong reservation of csums Btrfs: fix double free in the btrfs_qgroup_account_ref() Btrfs: limit the global reserve to 512mb Btrfs: hold the ordered operations mutex when waiting on ordered extents Btrfs: fix space accounting for unlink and rename Btrfs: fix space leak when we fail to reserve metadata space Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes Btrfs: fix race between mmap writes and compression Btrfs: fix memory leak in btrfs_create_tree() Btrfs: fix locking on ROOT_REPLACE operations in tree mod log Btrfs: fix missing qgroup reservation before fallocating Btrfs: handle a bogus chunk tree nicely Btrfs: update to use fs_state bit
2013-03-28Btrfs: limit the global reserve to 512mbJosef Bacik
A user reported a problem where he was getting early ENOSPC with hundreds of gigs of free data space and 6 gigs of free metadata space. This is because the global block reserve was taking up the entire free metadata space. This is ridiculous, we have infrastructure in place to throttle if we start using too much of the global reserve, so instead of letting it get this huge just limit it to 512mb so that users can still get work done. This allowed the user to complete his rsync without issues. Thanks Cc: stable@vger.kernel.org Reported-and-tested-by: Stefan Priebe <s.priebe@profihost.ag> Signed-off-by: Josef Bacik <jbacik@fusionio.com>