summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-10-05xfs: increase log reservations for reflinkDarrick J. Wong
Increase the log reservations to handle the increased rolling that happens at the end of copy-on-write operations. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: garbage collect old cowextsz reservationsDarrick J. Wong
Trim CoW reservations made on behalf of a cowextsz hint if they get too old or we run low on quota, so long as we don't have dirty data awaiting writeback or directio operations in progress. Garbage collection of the cowextsize extents are kept separate from prealloc extent reaping because setting the CoW prealloc lifetime to a (much) higher value than the regular prealloc extent lifetime has been useful for combatting CoW fragmentation on VM hosts where the VMs experience bursty write behaviors and we can keep the utilization ratios low enough that we don't start to run out of space. IOWs, it benefits us to keep the CoW fork reservations around for as long as we can unless we run out of blocks or hit inode reclaim. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: try other AGs to allocate a BMBT blockDarrick J. Wong
Prior to the introduction of reflink, allocating a block and mapping it into a file was performed in a single transaction with a single block reservation, and the allocator was supposed to find enough blocks to allocate the extent and any BMBT blocks that might be necessary (unless we're low on space). However, due to the way copy on write works, allocation and mapping have been split into two transactions, which means that we must be able to handle the case where we allocate an extent for CoW but that AG runs out of free space before the blocks can be mapped into a file, and the mapping requires a new BMBT block. When this happens, look in one of the other AGs for a BMBT block instead of taking the FS down. The same applies to the functions that convert a data fork to extents and later btree format. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: don't allow reflink when the AG is low on spaceDarrick J. Wong
If the AG free space is down to the reserves, refuse to reflink our way out of space. Hopefully userspace will make a real copy and/or go elsewhere. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: preallocate blocks for worst-case btree expansionDarrick J. Wong
To gracefully handle the situation where a CoW operation turns a single refcount extent into a lot of tiny ones and then run out of space when a tree split has to happen, use the per-AG reserved block pool to pre-allocate all the space we'll ever need for a maximal btree. For a 4K block size, this only costs an overhead of 0.3% of available disk space. When reflink is enabled, we have an unfortunate problem with rmap -- since we can share a block billions of times, this means that the reverse mapping btree can expand basically infinitely. When an AG is so full that there are no free blocks with which to expand the rmapbt, the filesystem will shut down hard. This is rather annoying to the user, so use the AG reservation code to reserve a "reasonable" amount of space for rmap. We'll prevent reflinks and CoW operations if we think we're getting close to exhausting an AG's free space rather than shutting down, but this permanent reservation should be enough for "most" users. Hopefully. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch@lst.de: ensure that we invalidate the freed btree buffer] Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: create a separate cow extent size hint for the allocatorDarrick J. Wong
Create a per-inode extent size allocator hint for copy-on-write. This hint is separate from the existing extent size hint so that CoW can take advantage of the fragmentation-reducing properties of extent size hints without disabling delalloc for regular writes. The extent size hint that's fed to the allocator during a copy on write operation is the greater of the cowextsize and regular extsize hint. During reflink, if we're sharing the entire source file to the entire destination file and the destination file doesn't already have a cowextsize hint, propagate the source file's cowextsize hint to the destination file. Furthermore, zero the bulkstat buffer prior to setting the fields so that we don't copy kernel memory contents into userspace. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: unshare a range of blocks via fallocateDarrick J. Wong
Unshare all shared extents if the user calls fallocate with the new unshare mode flag set, so that we can guarantee that a subsequent write will not ENOSPC. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: pass inode instead of file to xfs_reflink_dirty_range, use iomap infrastructure for copy up] Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: swap inode reflink flags when swapping inode extentsDarrick J. Wong
When we're swapping the extents of two inodes, be sure to swap the reflink inode flag too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: teach get_bmapx about shared extents and the CoW forkDarrick J. Wong
Teach xfs_getbmapx how to report shared extents and CoW fork contents accurately in the bmap output by querying the refcount btree appropriately. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: add dedupe range vfs functionDarrick J. Wong
Define a VFS function which allows userspace to request that the kernel reflink a range of blocks between two files if the ranges' contents match. The function fits the new VFS ioctl that standardizes the checking for the btrfs EXTENT SAME ioctl. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: add clone file and clone range vfs functionsDarrick J. Wong
Define two VFS functions which allow userspace to reflink a range of blocks between two files or to reflink one file's contents to another. These functions fit the new VFS ioctls that standardize the checking for the btrfs CLONE and CLONE RANGE ioctls. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: reflink extents from one file to anotherDarrick J. Wong
Reflink extents from one file to another; that is to say, iteratively remove the mappings from the destination file, copy the mappings from the source file to the destination file, and increment the reference count of all the blocks that got remapped. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: store in-progress CoW allocations in the refcount btreeDarrick J. Wong
Due to the way the CoW algorithm in XFS works, there's an interval during which blocks allocated to handle a CoW can be lost -- if the FS goes down after the blocks are allocated but before the block remapping takes place. This is exacerbated by the cowextsz hint -- allocated reservations can sit around for a while, waiting to get used. Since the refcount btree doesn't normally store records with refcount of 1, we can use it to record these in-progress extents. In-progress blocks cannot be shared because they're not user-visible, so there shouldn't be any conflicts with other programs. This is a better solution than holding EFIs during writeback because (a) EFIs can't be relogged currently, (b) even if they could, EFIs are bound by available log space, which puts an unnecessary upper bound on how much CoW we can have in flight, and (c) we already have a mechanism to track blocks. At mount time, read the refcount records and free anything we find with a refcount of 1 because those were in-progress when the FS went down. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: cancel pending CoW reservations when destroying inodesDarrick J. Wong
When destroying the inode, cancel all pending reservations in the CoW fork so that all the reserved blocks go back to the free pile. In theory this sort of cleanup is only needed to clean up after write errors. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: cancel CoW reservations and clear inode reflink flag when freeing blocksDarrick J. Wong
When we're freeing blocks (truncate, punch, etc.), clear all CoW reservations in the range being freed. If the file block count drops to zero, also clear the inode reflink flag. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: implement CoW for directio writesDarrick J. Wong
For O_DIRECT writes to shared blocks, we have to CoW them just like we would with buffered writes. For writes that are not block-aligned, just bounce them to the page cache. For block-aligned writes, however, we can do better than that. Use the same mechanisms that we employ for buffered CoW to set up a delalloc reservation, allocate all the blocks at once, issue the writes against the new blocks and use the same ioend functions to remap the blocks after the write. This should be fairly performant. Christoph discovered that xfs_reflink_allocate_cow_range may stumble over invalid entries in the extent array given that it drops the ilock but still expects the index to be stable. Simple fixing it to a new lookup for every iteration still isn't correct given that xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and there is nothing preventing a xfs_bunmapi_cow call removing extents once we dropped the ilock either. This patch duplicates the inner loop of xfs_bmapi_allocate into a helper for xfs_reflink_allocate_cow_range so that it can be done under the same ilock critical section as our CoW fork delayed allocation. The directio CoW warts will be revisited in a later patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: report shared extent mappings to userspace correctlyDarrick J. Wong
Report shared extents through the iomap interface so that FIEMAP flags shared blocks accurately. Have xfs_vm_bmap return zero for reflinked files because the bmap-based swap code requires static block mappings, which is incompatible with copy on write. NOTE: Existing userspace bmap users such as lilo will have the same problem with reflink files. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-10-05xfs: move mappings from cow fork to data fork after copy-writeDarrick J. Wong
After the write component of a copy-write operation finishes, clean up the bookkeeping left behind. On error, we simply free the new blocks and pass the error up. If we succeed, however, then we must remove the old data fork mapping and move the cow fork mapping to the data fork. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: Call the CoW failure function during xfs_cancel_ioend] Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: support removing extents from CoW forkDarrick J. Wong
Create a helper method to remove extents from the CoW fork without any of the side effects (rmapbt/bmbt updates) of the regular extent deletion routine. We'll eventually use this to clear out the CoW fork during ioend processing. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: allocate delayed extents in CoW forkDarrick J. Wong
Modify the writepage handler to find and convert pending delalloc extents to real allocations. Furthermore, when we're doing non-cow writes to a part of a file that already has a CoW reservation (the cowextsz hint that we set up in a subsequent patch facilitates this), promote the write to copy-on-write so that the entire extent can get written out as a single extent on disk, thereby reducing post-CoW fragmentation. Christoph moved the CoW support code in _map_blocks to a separate helper function, refactored other functions, and reduced the number of CoW fork lookups, so I merged those changes here to reduce churn. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: support allocating delayed extents in CoW forkDarrick J. Wong
Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed allocation extents in the CoW fork to real allocations, and wire this up all the way back to xfs_iomap_write_allocate(). In a subsequent patch, we'll modify the writepage handler to call this. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: create delalloc extents in CoW forkDarrick J. Wong
Wire up iomap_begin to detect shared extents and create delayed allocation extents in the CoW fork: 1) Check if we already have an extent in the COW fork for the area. If so nothing to do, we can move along. 2) Look up block number for the current extent, and if there is none it's not shared move along. 3) Unshare the current extent as far as we are going to write into it. For this we avoid an additional COW fork lookup and use the information we set aside in step 1) above. 4) Goto 1) unless we've covered the whole range. Last but not least, this updates the xfs_reflink_reserve_cow_range calling convention to pass a byte offset and length, as that is what both callers expect anyway. This patch has been refactored considerably as part of the iomap transition. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: support bmapping delalloc extents in the CoW forkDarrick J. Wong
Allow the creation of delayed allocation extents in the CoW fork. In a subsequent patch we'll wire up iomap_begin to actually do this via reflink helper functions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: introduce the CoW forkDarrick J. Wong
Introduce a new in-core fork for storing copy-on-write delalloc reservations and allocated extents that are in the process of being written out. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: don't allow reflinked dir/dev/fifo/socket/pipe filesDarrick J. Wong
Only non-rt files can be reflinked, so check that when we load an inode. Also, don't leak the attr fork if there's a failure. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: add reflink feature flag to geometryDarrick J. Wong
Report the reflink feature in the XFS geometry so that xfs_info and friends know the filesystem has this feature. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: define tracepoints for reflink activitiesDarrick J. Wong
Define all the tracepoints we need to inspect the runtime operation of reflink/dedupe/copy-on-write. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: return work remaining at the end of a bunmapi operationDarrick J. Wong
Return the range of file blocks that bunmapi didn't free. This hint is used by CoW and reflink to figure out what part of an extent actually got freed so that it can set up the appropriate atomic remapping of just the freed range. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04xfs: when replaying bmap operations, don't let unlinked inodes get reapedDarrick J. Wong
Log recovery will iget an inode to replay BUI items and iput the inode when it's done. Unfortunately, if the inode was unlinked, the iput will see that i_nlink == 0 and decide to truncate & free the inode, which prevents us from replaying subsequent BUIs. We can't skip the BUIs because we have to replay all the redo items to ensure that atomic operations complete. Since unlinked inode recovery will reap the inode anyway, we can safely introduce a new inode flag to indicate that an inode is in this 'unlinked recovery' state and should not be auto-reaped in the drop_inode path. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04xfs: implement deferred bmbt map/unmap operationsDarrick J. Wong
Implement deferred versions of the inode block map/unmap functions. These will be used in subsequent patches to make reflink operations atomic. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04xfs: pass bmapi flags through to bmap_del_extentDarrick J. Wong
Pass BMAPI_ flags from bunmapi into bmap_del_extent and extend BMAPI_REMAP (which means "don't touch the allocator or the quota accounting") to apply to bunmapi as well. This will be used to implement the unmap operation, which will be used by swapext. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04xfs: map an inode's offset to an exact physical blockDarrick J. Wong
Teach the bmap routine to know how to map a range of file blocks to a specific range of physical blocks, instead of simply allocating fresh blocks. This enables reflink to map a file to blocks that are already in use. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04xfs: log bmap intent itemsDarrick J. Wong
Provide a mechanism for higher levels to create BUI/BUD items, submit them to the log, and a stub function to deal with recovered BUI items. These parts will be connected to the rmapbt in a later patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04xfs: create bmbt update intent log itemsDarrick J. Wong
Create bmbt update intent/done log items to record redo information in the log. Because we roll transactions multiple times for reflink operations, we also have to track the status of the metadata updates that will be recorded in the post-roll transactions in case we crash before committing the final transaction. This mechanism enables log recovery to finish what was already started. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: introduce reflink utility functionsDarrick J. Wong
These functions will be used by the other reflink functions to find the maximum length of a range of shared blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: reserve AG space for the refcount btree rootDarrick J. Wong
Reduce the max AG usable space size so that we always have space for the refcount btree root. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: add refcount btree block detection to log recoveryDarrick J. Wong
Identify refcountbt blocks in the log correctly so that we can validate them during log recovery. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: adjust refcount when unmapping file blocksDarrick J. Wong
When we're unmapping blocks from a reflinked file, decrease the refcount of the affected blocks and free the extents that are no longer in use. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: connect refcount adjust functions to upper layersDarrick J. Wong
Plumb in the upper level interface to schedule and finish deferred refcount operations via the deferred ops mechanism. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: adjust refcount of an extent of blocks in refcount btreeDarrick J. Wong
Provide functions to adjust the reference counts for an extent of physical blocks stored in the refcount btree. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: log refcount intent itemsDarrick J. Wong
Provide a mechanism for higher levels to create CUI/CUD items, submit them to the log, and a stub function to deal with recovered CUI items. These parts will be connected to the refcountbt in a later patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: create refcount update intent log itemsDarrick J. Wong
Create refcount update intent/done log items to record redo information in the log. Because we need to roll transactions between updating the bmbt mapping and updating the reverse mapping, we also have to track the status of the metadata updates that will be recorded in the post-roll transactions, just in case we crash before committing the final transaction. This mechanism enables log recovery to finish what was already started. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: add refcount btree operationsDarrick J. Wong
Implement the generic btree operations required to manipulate refcount btree blocks. The implementation is similar to the bmapbt, though it will only allocate and free blocks from the AG. Since the refcount root and level fields are separate from the existing roots and levels array, they need a separate logging flag. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: fix logging of AGF refcount btree fields] Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: account for the refcount btree in the alloc/free log reservationDarrick J. Wong
Every time we allocate or free a data extent, we might need to split the refcount btree. Reserve some blocks in the transaction to handle this possibility. Even though the deferred refcount code can roll a transaction to avoid overloading the transaction, we can still exceed the reservation. Certain pathological workloads (1k blocks, no cowextsize hint, random directio writes), cause a perfect storm wherein a refcount adjustment of a large range of blocks causes full tree splits in two separate extents in two separate refcount tree blocks; allocating new refcount tree blocks causes rmap btree splits; and all the allocation activity causes the freespace btrees to split, blowing the reservation. (Reproduced by generic/167 over NFS atop XFS) Signed-off-by: Christoph Hellwig <hch@lst.de> [darrick.wong@oracle.com: add commit message] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-10-03xfs: add refcount btree support to growfsDarrick J. Wong
Modify the growfs code to initialize new refcount btree blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: define the on-disk refcount btree formatDarrick J. Wong
Start constructing the refcount btree implementation by establishing the on-disk format and everything needed to read, write, and manipulate the refcount btree blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: refcount btree add more reserved blocksDarrick J. Wong
Since XFS reserves a small amount of space in each AG as the minimum free space needed for an operation, save some more space in case we touch the refcount btree. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: introduce refcount btree definitionsDarrick J. Wong
Add new per-AG refcount btree definitions to the per-AG structures. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: define tracepoints for refcount btree activitiesDarrick J. Wong
Define all the tracepoints we need to inspect the refcount btree runtime operation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: return an error when an inline directory is too smallDarrick J. Wong
If the size of an inline directory is so small that it doesn't even cover the required header size, return an error to userspace instead of ASSERTing and returning 0 like everything's ok. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>