summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2008-09-25Btrfs: implement memory reclaim for leaf reference cacheYan
The memory reclaiming issue happens when snapshot exists. In that case, some cache entries may not be used during old snapshot dropping, so they will remain in the cache until umount. The patch adds a field to struct btrfs_leaf_ref to record create time. Besides, the patch makes all dead roots of a given snapshot linked together in order of create time. After a old snapshot was completely dropped, we check the dead root list and remove all cache entries created before the oldest dead root in the list. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix verify_parent_transidChris Mason
It was incorrectly clearing the up to date flag on the buffer even when the buffer properly verified. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Update and fix mount -o nodatacowYan Zheng
To check whether a given file extent is referenced by multiple snapshots, the checker walks down the fs tree through dead root and checks all tree blocks in the path. We can easily detect whether a given tree block is directly referenced by other snapshot. We can also detect any indirect reference from other snapshot by checking reference's generation. The checker can always detect multiple references, but can't reliably detect cases of single reference. So btrfs may do file data cow even there is only one reference. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: async-thread: fix possible memory leakLi Zefan
When kthread_run() returns failure, this worker hasn't been added to the list, so btrfs_stop_workers() won't free it. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Throttle operations if the reference cache gets too largeChris Mason
A large reference cache is directly related to a lot of work pending for the cleaner thread. This throttles back new operations based on the size of the reference cache so the cleaner thread will be able to keep up. Overall, this actually makes the FS faster because the cleaner thread will be more likely to find things in cache. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix version.sh when used outside of an hg repoChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Leaf reference cache updateChris Mason
This changes the reference cache to make a single cache per root instead of one cache per transaction, and to key by the byte number of the disk block instead of the keys inside. This makes it much less likely to have cache misses if a snapshot or something has an extra reference on a higher node or a leaf while the first transaction that added the leaf into the cache is dropping. Some throttling is added to functions that free blocks heavily so they wait for old transactions to drop. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add a leaf reference cacheYan Zheng
Much of the IO done while dropping snapshots is done looking up leaves in the filesystem trees to see if they point to any extents and to drop the references on any extents found. This creates a cache so that IO isn't required. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Rev the disk format magicChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Null terminate strings passed in from userspaceMark Fasheh
The 'char name[BTRFS_PATH_NAME_MAX]' member of struct btrfs_ioctl_vol_args is passed directly to strlen() after being copied from user. I haven't verified this, but in theory a userspace program could pass in an unterminated string and cause a kernel crash as strlen walks off the end of the array. This patch terminates the ->name string in all btrfs ioctl functions which currently use a 'struct btrfs_ioctl_vol_args'. Since the string is now properly terminated, it's length will never be longer than BTRFS_PATH_NAME_MAX so that error check has been removed. By the way, it might be better overall to just have the ioctl pass an unterminated string + length structure but I didn't bother with that since it'd change the kernel/user interface. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Fix path slots selection in btrfs_search_forwardYan
We should decrease the found slot by one as btrfs_search_slot does when bin_search return 1 and node level > 0. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix .. lookup corner caseYan
Inode ref item can be in the next leaf when we find "path->slots[0] == btrfs_header_nritems(...)". Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Properly release lock in pin_down_bytesYan
When buffer isn't uptodate, pin_down_bytes may leave the tree locked after it returns. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Remove unused variable in fixup_tree_root_locationBalaji Rao
Remove a unused variable 'path' in fixup_tree_root_location. Signed-off-by: Balaji Rao <balajirrao@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix a few functions that exit without stopping their transactionJosef Bacik
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Create orphan inode records to prevent lost files after a crashJosef Bacik
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add ACL supportJosef Bacik
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Remove unused xattr codeJosef Bacik
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Implement new dir index formatJosef Bacik
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix the defragmention code and the block relocation code for data=orderedChris Mason
Before setting an extent to delalloc, the code needs to wait for pending ordered extents. Also, the relocation code needs to wait for ordered IO before scanning the block group again. This is because the extents are not removed until the IO for the new extents is finished Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Use assert_spin_locked instead of spin_trylockDavid Woodhouse
On UP systems spin_trylock always succeeds Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add version strings on module loadChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix some build problems on 2.6.18 based enterprise kernelsChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Search data ordered extents first for checksums on readChris Mason
Checksum items are not inserted into the tree until all of the io from a given extent is complete. This means one dirty page from an extent may be written, freed, and then read again before the entire extent is on disk and the checksum item is inserted. The checksums themselves are stored in the ordered extent so they can be inserted in bulk when IO is complete. On read, if a checksum item isn't found, the ordered extents were being searched for a checksum record. This all worked most of the time, but the checksum insertion code tries to reduce the number of tree operations by pre-inserting checksum items based on i_size and a few other factors. This means the read code might find a checksum item that hasn't yet really been filled in. This commit changes things to check the ordered extents first and only dive into the btree if nothing was found. This removes the need for extra locking and is more reliable. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix 32 bit compiles by using an unsigned long byte count in the ↵Chris Mason
ordered extent The ordered extents have to fit in memory, so an unsigned long is sufficient. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Take the csum mutex while reading checksumsChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: alloc_mutex latency reductionChris Mason
This releases the alloc_mutex in a few places that hold it for over long operations. btrfs_lookup_block_group is changed so that it doesn't need the mutex at all. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add some conditional schedules near the alloc_mutexChris Mason
This helps prevent stalls, especially while the snapshot cleaner is running hard Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Use mutex_lock_nested for tree lockingChris Mason
Lockdep has the notion of locking subclasses so that you can identify locks you expect to be taken after other locks of the same class. This changes the per-extent buffer btree locking routines to use a subclass based on the level in the tree. Unfortunately, lockdep can only handle 8 total subclasses, and the btrfs max level is also 8. So when lockdep is on, use a lower max level. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix some data=ordered related data corruptionsChris Mason
Stress testing was showing data checksum errors, most of which were caused by a lookup bug in the extent_map tree. The tree was caching the last pointer returned, and searches would check the last pointer first. But, search callers also expect the search to return the very first matching extent in the range, which wasn't always true with the last pointer usage. For now, the code to cache the last return value is just removed. It is easy to fix, but I think lookups are rare enough that it isn't required anymore. This commit also replaces do_sync_mapping_range with a local copy of the related functions. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Use a mutex in the extent buffer for tree block lockingChris Mason
This replaces the use of the page cache lock bit for locking, which wasn't suitable for block size < page size and couldn't be used recursively. The mutexes alone don't fix either problem, but they are the first step. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Index extent buffers in an rbtreeChris Mason
Before, extent buffers were a temporary object, meant to map a number of pages at once and collect operations on them. But, a few extra fields have crept in, and they are also the best place to store a per-tree block lock field as well. This commit puts the extent buffers into an rbtree, and ensures a single extent buffer for each tree block. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Data ordered fixesChris Mason
* In btrfs_delete_inode, wait for ordered extents after calling truncate_inode_pages. This is much faster, and more correct * Properly clear our the PageChecked bit everywhere we redirty the page. * Change the writepage fixup handler to lock the page range and check to see if an ordered extent had been inserted since the improperly dirtied page was discovered * Wait for ordered extents outside the transaction. This isn't required for locking rules but does improve transaction latencies * Reduce contention on the alloc_mutex by dropping it while incrementing refs on a node/leaf and while dropping refs on a leaf. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Fix btrfs_wait_ordered_extent_range to properly waitChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Keep extent mappings in ram until pending ordered extents are doneChris Mason
It was possible for stale mappings from disk to be used instead of the new pending ordered extent. This adds a flag to the extent map struct to keep it pinned until the pending ordered extent is actually on disk. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Don't allow releasepage to succeed if EXTENT_ORDERED is setChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Handle data checksumming on bios that span multiple ordered extentsChris Mason
Data checksumming is done right before the bio is sent down the IO stack, which means a single bio might span more than one ordered extent. In this case, the checksumming data is split between two ordered extents. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Cleanup and comment ordered-data.cChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Force caching of metadata block groups on mount to avoid deadlockChris Mason
This is a temporary change to avoid deadlocks until the extent tree locking is fixed up. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25btrfs_next_leaf: do readahead when skip_locking is turned onChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Add a per-inode lock around btrfs_drop_extentsChris Mason
btrfs_drop_extents is always called with a range lock held on the inode. But, it may operate on extents outside that range as it drops and splits them. This patch adds a per-inode mutex that is held while calling btrfs_drop_extents and while inserting new extents into the tree. It prevents races from two procs working against adjacent ranges in the tree. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Don't pin pages in ram until the entire ordered extent is on disk.Chris Mason
Checksum items are not inserted until the entire ordered extent is on disk, but individual pages might be clean and available for reclaim long before the whole extent is on disk. In order to allow those pages to be freed, we need to be able to search the list of ordered extents to find the checksum that is going to be inserted in the tree. This way if the page needs to be read back in before the checksums are in the btree, we'll be able to verify the checksum on the page. This commit adds the ability to search the pending ordered extents for a given offset in the file, and changes btrfs_releasepage to allow ordered pages to be freed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25btrfs_start_transaction: wait for commits in progress to finishChris Mason
btrfs_commit_transaction has to loop waiting for any writers in the transaction to finish before it can proceed. btrfs_start_transaction should be polite and not join a transaction that is in the process of being finished off. There are a few places that can't wait, basically the ones doing IO that might be needed to finish the transaction. For them, btrfs_join_transaction is added. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Update on disk i_size only after pending ordered extents are doneChris Mason
This changes the ordered data code to update i_size after the extent is on disk. An on disk i_size is maintained in the in-memory btrfs inode structures, and this is updated as extents finish. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Use async helpers to deal with pages that have been improperly dirtiedChris Mason
Higher layers sometimes call set_page_dirty without asking the filesystem to help. This causes many problems for the data=ordered and cow code. This commit detects pages that haven't been properly setup for IO and kicks off an async helper to deal with them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: New data=ordered implementationChris Mason
The old data=ordered code would force commit to wait until all the data extents from the transaction were fully on disk. This introduced large latencies into the commit and stalled new writers in the transaction for a long time. The new code changes the way data allocations and extents work: * When delayed allocation is filled, data extents are reserved, and the extent bit EXTENT_ORDERED is set on the entire range of the extent. A struct btrfs_ordered_extent is allocated an inserted into a per-inode rbtree to track the pending extents. * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding to that page. * When all of the bytes corresponding to a single struct btrfs_ordered_extent are written, The previously reserved extent is inserted into the FS btree and into the extent allocation trees. The checksums for the file data are also updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Drop some verbose printksChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add locking around volume management (device add/remove/balance)Chris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix deadlock while searching for dead roots on mountChris Mason
btrfs_find_dead_roots called btrfs_read_fs_root_no_radix, which means we end up calling btrfs_search_slot with a path already held. The fix is to remember the key inside btrfs_find_dead_roots and drop the path. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Reduce contention on the root nodeChris Mason
This calls unlock_up sooner in btrfs_search_slot in order to decrease the amount of work done with the higher level tree locks held. Also, it changes btrfs_tree_lock to spin for a big against the page lock before scheduling. This makes a big difference in context switch rate under highly contended workloads. Longer term, a better locking structure is needed than the page lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>