summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent_map.h
AgeCommit message (Collapse)Author
2013-05-06Btrfs: fix bad extent loggingJosef Bacik
A user sent me a btrfs-image of a file system that was panicing on mount during the log recovery. I had originally thought these problems were from a bug in the free space cache code, but that was just a symptom of the problem. The problem is if your application does something like this [prealloc][prealloc][prealloc] the internal extent maps will merge those all together into one extent map, even though on disk they are 3 separate extents. So if you go to write into one of these ranges the extent map will be right since we use the physical extent when doing the write, but when we log the extents they will use the wrong sizes for the remainder prealloc space. If this doesn't happen to trip up the free space cache (which it won't in a lot of cases) then you will get bogus entries in your extent tree which will screw stuff up later. The data and such will still work, but everything else is broken. This patch fixes this by not allowing extents that are on the modified list to be merged. This has the side effect that we are no longer adding everything to the modified list all the time, which means we now have to call btrfs_drop_extents every time we log an extent into the tree. So this allows me to drop all this speciality code I was using to get around calling btrfs_drop_extents. With this patch the testcase I've created no longer creates a bogus file system after replaying the log. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06Btrfs: log ram bytes properlyJosef Bacik
When logging changed extents I was logging ram_bytes as the current length, which isn't correct, it's supposed to be the ram bytes of the original extent. This is for compression where even if we split the extent we need to know the ram bytes so when we uncompress the extent we know how big it will be. This was still working out right with compression for some reason but I think we were getting lucky. It was definitely off for prealloc which is why I noticed it, btrfsck was complaining about it. With this patch btrfsck no longer complains after a log replay. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24Btrfs: do not allow logged extents to be merged or removedJosef Bacik
We drop the extent map tree lock while we're logging extents, so somebody could come in and merge another extent into this one and screw up our logging, or they could even remove us from the list which would keep us from logging the extent or freeing our ref on it, so we need to make sure to not clear LOGGING until after the extent is logged, and then we can merge it to adjacent extents. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-12-17Btrfs: do not mark ems as prealloc if we are writing to themJosef Bacik
We are going to use EM's to log extents in the future, so we need to not mark them as prealloc if they aren't actually prealloc extents. Instead mark them with FILLING so we know to ammend mod_start/mod_len and that way we don't confuse the extent logging code. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-17Btrfs: keep track of the extents original block lengthJosef Bacik
If we've written to a prealloc extent we need to know the original block len for the extent. We can't figure this out currently since ->block_len is just set to the extent length. So introduce ->orig_block_len so that we know how many bytes were in the original extent for proper extent logging that future patches will need. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-04Btrfs: do not hold the write_lock on the extent tree while loggingJosef Bacik
Dave Sterba pointed out a sleeping while atomic bug while doing fsync. This is because I'm an idiot and didn't realize that rwlock's were spin locks, so we've been holding this thing while doing allocations and such which is not good. This patch fixes this by dropping the write lock before we do anything heavy and re-acquire it when it is done. We also need to take a ref on the em's in case their corresponding pages are evicted and mark them as being logged so that releasepage does not remove them and doesn't remove them from our local list. Thanks, Reported-by: Dave Sterba <dave@jikos.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01Btrfs: improve fsync by filtering extents that we wantLiu Bo
This is based on Josef's "Btrfs: turbo charge fsync". The above Josef's patch performs very good in random sync write test, because we won't have too much extents to merge. However, it does not performs good on the test: dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync The reason is when we do sequencial sync write, we need to merge the current extent just with the previous one, so that we can get accumulated extents to log: A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ... So we'll have to flush more and more checksum into log tree, which is the bottleneck according to my tests. But we can avoid this by telling fsync the real extents that are needed to be logged. With this, I did the above dd sync write test (size=50m), w/o (orig) w/ (josef's) w/ (this) SATA 104KB/s 109KB/s 121KB/s ramdisk 1.5MB/s 1.5MB/s 10.7MB/s (613%) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01Btrfs: turbo charge fsyncJosef Bacik
At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-02-15btrfs: fix structs where bitfields and spinlock/atomic share 8B wordDavid Sterba
On ia64, powerpc64 and sparc64 the bitfield is modified through a RMW cycle and current gcc rewrites the adjacent 4B word, which in case of a spinlock or atomic has disaterous effect. https://lkml.org/lkml/2012/2/1/220 Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02btrfs: drop gfp parameter from alloc_extent_mapDavid Sterba
pass GFP_NOFS directly to kmem_cache_alloc Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02btrfs: drop unused parameter from extent_map_tree_initDavid Sterba
the GFP flags are not stored anywhere and all allocations are done via alloc_extent_map(GFP_NOFS). Signed-off-by: David Sterba <dsterba@suse.cz>
2010-12-22btrfs: Allow to add new compression algorithmLi Zefan
Make the code aware of compression type, instead of always assuming zlib compression. Also make the zlib workspace function as common code for all compression types. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2009-09-18Btrfs: search for an allocation hint while filling file COWChris Mason
The allocator has some nice knobs for sending hints about where to try and allocate new blocks, but when we're doing file allocations we're not sending any hint at all. This commit adds a simple extent map search to see if we can quickly and easily find a hint for the allocator. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11Btrfs: Fix extent replacment raceChris Mason
Data COW means that whenever we write to a file, we replace any old extent pointers with new ones. There was a window where a readpage might find the old extent pointers on disk and cache them in the extent_map tree in ram in the middle of a given write replacing them. Even though both the readpage and the write had their respective bytes in the file locked, the extent readpage inserts may cover more bytes than it had locked down. This commit closes the race by keeping the new extent pinned in the extent map tree until after the on-disk btree is properly setup with the new extent pointers. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11Btrfs: switch extent_map to a rw lockChris Mason
There are two main users of the extent_map tree. The first is regular file inodes, where it is evenly spread between readers and writers. The second is the chunk allocation tree, which maps blocks from logical addresses to phyiscal ones, and it is 99.99% reads. The mapping tree is a point of lock contention during heavy IO workloads, so this commit switches things to a rw lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10Btrfs: Fix csum error for compressed dataYan Zheng
The decompress code doesn't take the logical offset in extent pointer into account. If the logical offset isn't zero, data will be decompressed into wrong pages. The solution used here is to record the starting offset of the extent in the file separately from the logical start of the extent_map struct. This allows us to avoid problems inserting overlapping extents. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-30Btrfs: Add fallocate support v2Yan Zheng
This patch updates btrfs-progs for fallocate support. fallocate is a little different in Btrfs because we need to tell the COW system that a given preallocated extent doesn't need to be cow'd as long as there are no snapshots of it. This leverages the -o nodatacow checks. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-30Btrfs: update hole handling v2Yan Zheng
This patch splits the hole insertion code out of btrfs_setattr into btrfs_cont_expand and updates btrfs_get_extent to properly handle the case that file extent items are not continuous. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-29Btrfs: Add zlib compression supportChris Mason
This is a large change for adding compression on reading and writing, both for inline and regular extents. It does some fairly large surgery to the writeback paths. Compression is off by default and enabled by mount -o compress. Even when the -o compress mount option is not used, it is possible to read compressed extents off the disk. If compression for a given set of pages fails to make them smaller, the file is flagged to avoid future compression attempts later. * While finding delalloc extents, the pages are locked before being sent down to the delalloc handler. This allows the delalloc handler to do complex things such as cleaning the pages, marking them writeback and starting IO on their behalf. * Inline extents are inserted at delalloc time now. This allows us to compress the data before inserting the inline extent, and it allows us to insert an inline extent that spans multiple pages. * All of the in-memory extent representations (extent_map.c, ordered-data.c etc) are changed to record both an in-memory size and an on disk size, as well as a flag for compression. From a disk format point of view, the extent pointers in the file are changed to record the on disk size of a given extent and some encoding flags. Space in the disk format is allocated for compression encoding, as well as encryption and a generic 'other' field. Neither the encryption or the 'other' field are currently used. In order to limit the amount of data read for a single random read in the file, the size of a compressed extent is limited to 128k. This is a software only limit, the disk format supports u64 sized compressed extents. In order to limit the ram consumed while processing extents, the uncompressed size of a compressed extent is limited to 256k. This is a software only limit and will be subject to tuning later. Checksumming is still done on compressed extents, and it is done on the uncompressed version of the data. This way additional encodings can be layered on without having to figure out which encoding to checksum. Compression happens at delalloc time, which is basically singled threaded because it is usually done by a single pdflush thread. This makes it tricky to spread the compression load across all the cpus on the box. We'll have to look at parallel pdflush walks of dirty inodes at a later time. Decompression is hooked into readpages and it does spread across CPUs nicely. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix some data=ordered related data corruptionsChris Mason
Stress testing was showing data checksum errors, most of which were caused by a lookup bug in the extent_map tree. The tree was caching the last pointer returned, and searches would check the last pointer first. But, search callers also expect the search to return the very first matching extent in the range, which wasn't always true with the last pointer usage. For now, the code to cache the last return value is just removed. It is easy to fix, but I think lookups are rare enough that it isn't required anymore. This commit also replaces do_sync_mapping_range with a local copy of the related functions. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Keep extent mappings in ram until pending ordered extents are doneChris Mason
It was possible for stale mappings from disk to be used instead of the new pending ordered extent. This adds a flag to the extent map struct to keep it pinned until the pending ordered extent is actually on disk. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Split the extent_map code into two partsChris Mason
There is now extent_map for mapping offsets in the file to disk and extent_io for state tracking, IO submission and extent_bufers. The new extent_map code shifts from [start,end] pairs to [start,len], and pushes the locking out into the caller. This allows a few performance optimizations and is easier to use. A number of extent_map usage bugs were fixed, mostly with failing to remove extent_map entries when changing the file. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Implement basic support for -ENOSPCChris Mason
This is intended to prevent accidentally filling the drive. A determined user can still make things oops. It includes some accounting of the current bytes under delayed allocation, but this will change as things get optimized Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: section mismatch warningsChristian Hesse
--Boundary-00=_CcOWHFYK4T+JwSj Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Hello everybody, compiling btrfs into the kernel results in section mismatch warnings. __exit functions are called where they are not allowed to. The attached patch fixes this for me. Not sure if it is correct though. Signed-off-by: Christian Hesse <mail@earthworm.de> -- Regards, Chris --Boundary-00=_CcOWHFYK4T+JwSj Content-Type: text/x-diff; charset="iso-8859-1"; name="btrfs-section_mismatches.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="btrfs-section_mismatches.patch" Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add efficient dirty accounting to the extent_map treeChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Limit btree writeback to prevent seeksChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Return value checking in module initWyatt Banks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add readpages supportChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add writepages supportChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix a number of inline extent problems that Yan Zheng reported.Chris Mason
The fixes do a number of things: 1) Most btrfs_drop_extent callers will try to leave the inline extents in place. It can truncate bytes off the beginning of the inline extent if required. 2) writepage can now update the inline extent, allowing mmap writes to go directly into the inline extent. 3) btrfs_truncate_in_transaction truncates inline extents 4) extent_map.c fixed to not merge inline extent mappings and hole mappings together Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add back metadata checksummingChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: extent_map optimizations to cut down on CPU usageChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add an extent buffer LRU to reduce radix tree hitsChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add back the online defragging codeChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Use an array of pages in the extent buffers to reduce the cost of ↵Chris Mason
find_get_page Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Allow tree blocks larger than the page sizeChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Change the remaining radix trees used by extent-tree.c to extent_map ↵Chris Mason
trees Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Stop using radix trees for the block group cacheChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix extent_buffer and extent_state leaksChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Avoid memcpy where possible in extent_buffersChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Optimizations for the extent_buffer codeChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Create extent_buffer interface for large blocksizesChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: factor page private preparations into a helperChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-09-11Btrfs: [PATCH] extent_map: add writepage_end_io hookChristoph Hellwig
XFS updates the ondisk inode size only after the data I/O has finished, so it needs a hook when the writepage end_bio handler has finished. Might not be worth applying as-is as the per-page callback is very ineffcient. What XFS really wants is a callback when writeout of a whole extent has completed. This delayed i_size updates scheme might be worthwile for btrfs aswell, btw. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-09-11Btrfs: [PATCH] extent_map: provide generic bmapChristoph Hellwig
generic_bmap is completely trivial, while the extent to bh mapping in btrfs is rather complex. So provide a extent_bmap instead that takes a get_extent callback and can be used by filesystem using the extent_map code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-30Btrfs: Add file data csums back in via hooks in the extent map codeChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-27Btrfs: Add delayed allocation to the extent based page tree codeChris Mason
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-27Btrfs: Extent based page cache code. This uses an rbtree of extents and testsChris Mason
instead of buffer heads. Signed-off-by: Chris Mason <chris.mason@oracle.com>