summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-08-03xfs: rework xfs_bmap_free callers to use xfs_defer_opsDarrick J. Wong
Restructure everything that used xfs_bmap_free to use xfs_defer_ops instead. For now we'll just remove the old symbols and play some cpp magic to make it work; in the next patch we'll actually rename everything. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: enable the xfs_defer mechanism to process extents to freeDarrick J. Wong
Connect the xfs_defer mechanism with the pieces that we'll need to handle deferred extent freeing. We'll wire up the existing code to our new deferred mechanism later. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: clean up typedef usage in the EFI/EFD handling codeDarrick J. Wong
Replace structure typedefs with struct xfs_foo_* in the EFI/EFD handling code in preparation to move it over to deferred ops. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: add tracepoints for the deferred ops mechanismDarrick J. Wong
Add tracepoints for the internals of the deferred ops mechanism and tracepoint classes for clients of the dops, to make debugging easier. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: move deferred operations into a separate fileDarrick J. Wong
All the code around struct xfs_bmap_free basically implements a deferred operation framework through which we can roll transactions (to unlock buffers and avoid violating lock order rules) while managing all the necessary log redo items. Previously we only used this code to free extents after some sort of mapping operation, but with the advent of rmap and reflink, we suddenly need to do more than that. With that in mind, xfs_bmap_free really becomes a deferred ops control structure. Rename the structure and move the deferred ops into their own file to avoid further bloating of the bmap code. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: refactor btree owner change into a separate visit-blocks functionDarrick J. Wong
Refactor the btree_change_owner function into a more generic apparatus which visits all blocks in a btree. We'll use this in a subsequent patch for counting btree blocks for AG reservations. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: introduce interval queries on btreesDarrick J. Wong
Create a function to enable querying of btree records mapping to a range of keys. This will be used in subsequent patches to allow querying the reverse mapping btree to find the extents mapped to a range of physical blocks, though the generic code can be used for any range query. The overlapped query range function needs to use the btree get_block helper because the root block could be an inode, in which case bc_bufs[nlevels-1] will be NULL. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: support btrees with overlapping intervals for keysDarrick J. Wong
On a filesystem with both reflink and reverse mapping enabled, it's possible to have multiple rmap records referring to the same blocks on disk. When overlapping intervals are possible, querying a classic btree to find all records intersecting a given interval is inefficient because we cannot use the left side of the search interval to filter out non-matching records the same way that we can use the existing btree key to filter out records coming after the right side of the search interval. This will become important once we want to use the rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl. (For the non-overlapping case, we can perform such queries trivially by starting at the left side of the interval and walking the tree until we pass the right side.) Therefore, extend the btree code to come closer to supporting intervals as a first-class record attribute. This involves widening the btree node's key space to store both the lowest key reachable via the node pointer (as the btree does now) and the highest key reachable via the same pointer and teaching the btree modifying functions to keep the highest-key records up to date. This behavior can be turned on via a new btree ops flag so that btrees that cannot store overlapping intervals don't pay the overhead costs in terms of extra code and disk format changes. When we're deleting a record in a btree that supports overlapped interval records and the deletion results in two btree blocks being joined, we defer updating the high/low keys until after all possible joining (at higher levels in the tree) have finished. At this point, the btree pointers at all levels have been updated to remove the empty blocks and we can update the low and high keys. When we're doing this, we must be careful to update the keys of all node pointers up to the root instead of stopping at the first set of keys that don't need updating. This is because it's possible for a single deletion to cause joining of multiple levels of tree, and so we need to update everything going back to the root. The diff_two_keys functions return < 0, 0, or > 0 if key1 is less than, equal to, or greater than key2, respectively. This is consistent with the rest of the kernel and the C library. In btree_updkeys(), we need to evaluate the force_all parameter before running the key diff to avoid reading uninitialized memory when we're forcing a key update. This happens when we've allocated an empty slot at level N + 1 to point to a new block at level N and we're in the process of filling out the new keys. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: add function pointers for get/update keys to the btreeDarrick J. Wong
Add some function pointers to bc_ops to get the btree keys for leaf and node blocks, and to update parent keys of a block. Convert the _btree_updkey calls to use our new pointer, and modify the tree shape changing code to call the appropriate get_*_keys pointer instead of _btree_copy_keys because the overlapping btree has to calculate high key values. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: during btree split, save new block key & ptr for future insertionDarrick J. Wong
When a btree block has to be split, we pass the new block's ptr from xfs_btree_split() back to xfs_btree_insert() via a pointer parameter; however, we pass the block's key through the cursor's record. It is a little weird to "initialize" a record from a key since the non-key attributes will have garbage values. When we go to add support for interval queries, we have to be able to pass the lowest and highest keys accessible via a pointer. There's no clean way to pass this back through the cursor's record field. Therefore, pass the key directly back to xfs_btree_insert() the same way that we pass the btree_ptr. As a bonus, we no longer need init_rec_from_key and can drop it from the codebase. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: set *stat=1 after iroot reallocDarrick J. Wong
If we make the inode root block of a btree unfull by expanding the root, we must set *stat to 1 to signal success, rather than leaving it uninitialized. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: fix locking of the rt bitmap/summary inodesDarrick J. Wong
When we're deleting realtime extents, we need to lock the summary inode in case we need to update the summary info to prevent an assert on the rsumip inode lock on a debug kernel. While we're at it, fix the locking annotations so that we avoid triggering lockdep warnings. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: fix attr shortform structure alignment on crisDarrick J. Wong
Apparently cris doesn't require structure stride to align with the largest type in the struct, so list[0] isn't at offset 4 like it is everywhere else. Fix this... insofar as existing XFSes on cris are screwed. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: in _attrlist_by_handle, copy the cursor back to userspaceDarrick J. Wong
When we're iterating inode xattrs by handle, we have to copy the cursor back to userspace so that a subsequent invocation actually retrieves subsequent contents. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22Merge branch 'xfs-4.8-misc-fixes-4' into for-nextDave Chinner
2016-07-22xfs: remove EXPERIMENTAL tag from sparse inode featureDave Chinner
Been around for long enough now, hasn't caused any regression test failures in the past 3 months, so it's time to make it a fully supported feature. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-21xfs: bufferhead chains are invalid after end_page_writebackDave Chinner
In xfs_finish_page_writeback(), we have a loop that looks like this: do { if (off < bvec->bv_offset) goto next_bh; if (off > end) break; bh->b_end_io(bh, !error); next_bh: off += bh->b_size; } while ((bh = bh->b_this_page) != head); The b_end_io function is end_buffer_async_write(), which will call end_page_writeback() once all the buffers have marked as no longer under IO. This issue here is that the only thing currently protecting both the bufferhead chain and the page from being reclaimed is the PageWriteback state held on the page. While we attempt to limit the loop to just the buffers covered by the IO, we still read from the buffer size and follow the next pointer in the bufferhead chain. There is no guarantee that either of these are valid after the PageWriteback flag has been cleared. Hence, loops like this are completely unsafe, and result in use-after-free issues. One such problem was caught by Calvin Owens with KASAN: ..... INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1 free_buffer_head+0x41/0x90 __slab_free+0x1ed/0x340 kmem_cache_free+0x270/0x300 free_buffer_head+0x41/0x90 try_to_free_buffers+0x171/0x240 xfs_vm_releasepage+0xcb/0x3b0 try_to_release_page+0x106/0x190 shrink_page_list+0x118e/0x1a10 shrink_inactive_list+0x42c/0xdf0 shrink_zone_memcg+0xa09/0xfa0 shrink_zone+0x2c3/0xbc0 ..... Call Trace: <IRQ> [<ffffffff81e8b8e4>] dump_stack+0x68/0x94 [<ffffffff8153a995>] print_trailer+0x115/0x1a0 [<ffffffff81541174>] object_err+0x34/0x40 [<ffffffff815436e7>] kasan_report_error+0x217/0x530 [<ffffffff81543b33>] __asan_report_load8_noabort+0x43/0x50 [<ffffffff819d651f>] xfs_destroy_ioend+0x3bf/0x4c0 [<ffffffff819d69d4>] xfs_end_bio+0x154/0x220 [<ffffffff81de0c58>] bio_endio+0x158/0x1b0 [<ffffffff81dff61b>] blk_update_request+0x18b/0xb80 [<ffffffff821baf57>] scsi_end_request+0x97/0x5a0 [<ffffffff821c5558>] scsi_io_completion+0x438/0x1690 [<ffffffff821a8d95>] scsi_finish_command+0x375/0x4e0 [<ffffffff821c3940>] scsi_softirq_done+0x280/0x340 Where the access is occuring during IO completion after the buffer had been freed from direct memory reclaim. Prevent use-after-free accidents in this end_io processing loop by pre-calculating the loop conditionals before calling bh->b_end_io(). The loop is already limited to just the bufferheads covered by the IO in progress, so the offset checks are sufficient to prevent accessing buffers in the chain after end_page_writeback() has been called by the the bh->b_end_io() callout. Yet another example of why Bufferheads Must Die. cc: <stable@vger.kernel.org> # 4.7 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reported-and-Tested-by: Calvin Owens <calvinowens@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-21xfs: allocate log vector buffers outside CIL context lockDave Chinner
One of the problems we currently have with delayed logging is that under serious memory pressure we can deadlock memory reclaim. THis occurs when memory reclaim (such as run by kswapd) is reclaiming XFS inodes and issues a log force to unpin inodes that are dirty in the CIL. The CIL is pushed, but this will only occur once it gets the CIL context lock to ensure that all committing transactions are complete and no new transactions start being committed to the CIL while the push switches to a new context. The deadlock occurs when the CIL context lock is held by a committing process that is doing memory allocation for log vector buffers, and that allocation is then blocked on memory reclaim making progress. Memory reclaim, however, is blocked waiting for a log force to make progress, and so we effectively deadlock at this point. To solve this problem, we have to move the CIL log vector buffer allocation outside of the context lock so that memory reclaim can always make progress when it needs to force the log. The problem with doing this is that a CIL push can take place while we are determining if we need to allocate a new log vector buffer for an item and hence the current log vector may go away without warning. That means we canot rely on the existing log vector being present when we finally grab the context lock and so we must have a replacement buffer ready to go at all times. To ensure this, introduce a "shadow log vector" buffer that is always guaranteed to be present when we gain the CIL context lock and format the item. This shadow buffer may or may not be used during the formatting, but if the log item does not have an existing log vector buffer or that buffer is too small for the new modifications, we swap it for the new shadow buffer and format the modifications into that new log vector buffer. The result of this is that for any object we modify more than once in a given CIL checkpoint, we double the memory required to track dirty regions in the log. For single modifications then we consume the shadow log vectorwe allocate on commit, and that gets consumed by the checkpoint. However, if we make multiple modifications, then the second transaction commit will allocate a shadow log vector and hence we will end up with double the memory usage as only one of the log vectors is consumed by the CIL checkpoint. The remaining shadow vector will be freed when th elog item is freed. This can probably be optimised in future - access to the shadow log vector is serialised by the object lock (as opposited to the active log vector, which is controlled by the CIL context lock) and so we can probably free shadow log vector from some objects when the log item is marked clean on removal from the AIL. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-21libxfs: directory node splitting does not have an extra blockDave Chinner
xfsprogs source commit 4280e59dcbc4cd8e01585efe788a68eb378048e8 xfs_da3_split() has to handle all three versions of the directory/attribute btree structure. The attr tree is v1, the dir tre is v2 or v3. The main difference between the v1 and v2/3 trees is the way tree nodes are split - in the v1 tree we can require a double split to occur because the object to be inserted may be larger than the space made by splitting a leaf. In this case we need to do a double split - one to split the full leaf, then another to allocate an empty leaf block in the correct location for the new entry. This does not happen with dir (v2/v3) formats as the objects being inserted are always guaranteed to fit into the new space in the split blocks. Indeed, for directories they *may* be an extra block on this buffer pointer. However, it's guaranteed not to be a leaf block (i.e. a directory data block) - the directory code only ever places hash index or free space blocks in this pointer (as a cursor of sorts), and so to use it as a directory data block will immediately corrupt the directory. The problem is that the code assumes that there may be extra blocks that we need to link into the tree once we've split the root, but this is not true for either dir or attr trees, because the extra attr block is always consumed by the last node split before we split the root. Hence the linking in an extra block is always wrong at the root split level, and this manifests itself in repair as a directory corruption in a repaired directory, leaving the directory rebuild incomplete. This is a dir v2 zero-day bug - it was in the initial dir v2 commit that was made back in February 1998. Fix this by ensuring the linking of the blocks after the root split never tries to make use of the extra blocks that may be held in the cursor. They are held there for other purposes and should never be touched by the root splitting code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-21xfs: remove dax code from object file when disabledArnd Bergmann
We check IS_DAX(inode) before calling either xfs_file_dax_read or xfs_file_dax_write, and this will lead the call being optimized out at compile time when CONFIG_FS_DAX is disabled. However, the two functions are marked STATIC, so they become global symbols when CONFIG_XFS_DEBUG is set, leaving us with two unused global functions that call into an undefined function and a broken "allmodconfig" build: fs/built-in.o: In function `xfs_file_dax_read': fs/xfs/xfs_file.c:348: undefined reference to `dax_do_io' fs/built-in.o: In function `xfs_file_dax_write': fs/xfs/xfs_file.c:758: undefined reference to `dax_do_io' Marking the two functions 'static noinline' instead of 'STATIC' will let the compiler drop the symbols when there are no callers but avoid the implicit inlining. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 16d4d43595b4 ("xfs: split direct I/O and DAX path") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-21xfs: skip dirty pages in ->releasepage()Brian Foster
XFS has had scattered reports of delalloc blocks present at ->releasepage() time. This results in a warning with a stack trace similar to the following: ... Call Trace: [<ffffffffa23c5b8f>] dump_stack+0x63/0x84 [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0 [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140 [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0 [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150 [<ffffffffa21521c2>] try_to_release_page+0x32/0x50 [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0 [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0 [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0 [<ffffffffa2168539>] kswapd+0x4f9/0x970 [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0 [<ffffffffa20a0d99>] kthread+0xc9/0xe0 [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100 [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70 [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100 This occurs because it is possible for shrink_active_list() to send pages marked dirty to ->releasepage() when certain buffer_head threshold conditions are met. shrink_active_list() doesn't check the page dirty state apparently to handle an old ext3 corner case where in some cases clean pages would not have the dirty bit cleared, thus it is up to the filesystem to determine how to handle the page. XFS currently handles the delalloc case properly, but this behavior makes the warning spurious. Update the XFS ->releasepage() handler to explicitly skip dirty pages. Retain the existing delalloc/unwritten checks so we continue to warn if such buffers exist on clean pages when they shouldn't. Diagnosed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20Merge branch 'xfs-4.8-dir2-sf-fixes' into for-nextDave Chinner
2016-07-20Merge branch 'xfs-4.8-split-dax-dio' into for-nextDave Chinner
2016-07-20Merge branch 'xfs-4.8-buf-fixes' into for-nextDave Chinner
2016-07-20Merge branch 'xfs-4.8-misc-fixes-3' into for-nextDave Chinner
2016-07-20xfs: remove __arch_packChristoph Hellwig
Instead we always declare struct xfs_dir2_sf_hdr as packed. That's the expected layout, and while most major architectures do the packing by default the new structure size and offset checker showed that not only the ARM old ABI got this wrong, but various minor embedded architectures did as well. [Verified that no code change on x86-64 results from this change] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: kill xfs_dir2_inou_tChristoph Hellwig
And use an array of unsigned char values directly to avoid problems with architectures that pad the size of structures. This also gets rid of the xfs_dir2_ino4_t and xfs_dir2_ino8_t types, and introduces new constants for the size of 4 and 8 bytes as well as the size difference between the two. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: kill xfs_dir2_sf_off_tChristoph Hellwig
Just use an array of two unsigned chars directly to avoid problems with architectures that pad the size of structures. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: split direct I/O and DAX pathChristoph Hellwig
So far the DAX code overloaded the direct I/O code path. There is very little in common between the two, and untangling them allows to clean up both variants. As a side effect we also get separate trace points for both I/O types. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: direct calls in the direct I/O pathChristoph Hellwig
We control both the callers and callees of ->direct_IO, so remove the indirect calls. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: stop using generic_file_read_iter for direct I/OChristoph Hellwig
XFS already implement it's own flushing of the pagecache because it implements proper synchronization for direct I/O reads. This means calling generic_file_read_iter for direct I/O is rather useless, as it doesn't do much but updating the atime and iocb position for us. This also gets rid of the buffered I/O fallback that isn't used for XFS. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: split xfs_file_read_iter into buffered and direct I/O helpersChristoph Hellwig
Similar to what we did on the write side a while ago. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: remove s_maxbytes enforcement in xfs_file_read_iterChristoph Hellwig
All the three low-level read implementations that we might call already take care of not overflowing the maximum supported bytes, no need to duplicate it here. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: kill ioflagsChristoph Hellwig
Now that we have the direct I/O kiocb flag there is no real need to sample the value inside of XFS, and the invis flag was always just partially used and isn't worth keeping this infrastructure around for. This also splits the read tracepoint into buffered vs direct as we've done for writes a long time ago. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: don't pass ioflags around in the ioctl pathChristoph Hellwig
Instead check the file pointer for the invisble I/O flag directly, and use the chance to drop redundant arguments from the xfs_ioc_space prototype. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: track and serialize in-flight async buffers against unmountBrian Foster
Newly allocated XFS metadata buffers are added to the LRU once the hold count is released, which typically occurs after I/O completion. There is no other mechanism at current that tracks the existence or I/O state of a new buffer. Further, readahead I/O tends to be submitted asynchronously by nature, which means the I/O can remain in flight and actually complete long after the calling context is gone. This means that file descriptors or any other holds on the filesystem can be released, allowing the filesystem to be unmounted while I/O is still in flight. When I/O completion occurs, core data structures may have been freed, causing completion to run into invalid memory accesses and likely to panic. This problem is reproduced on XFS via directory readahead. A filesystem is mounted, a directory is opened/closed and the filesystem immediately unmounted. The open/close cycle triggers a directory readahead that if delayed long enough, runs buffer I/O completion after the unmount has completed. To address this problem, add a mechanism to track all in-flight, asynchronous buffers using per-cpu counters in the buftarg. The buffer is accounted on the first I/O submission after the current reference is acquired and unaccounted once the buffer is returned to the LRU or freed. Update xfs_wait_buftarg() to wait on all in-flight I/O before walking the LRU list. Once in-flight I/O has completed and the workqueue has drained, all new buffers should have been released onto the LRU. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: exclude never-released buffers from buftarg I/O accountingBrian Foster
The upcoming buftarg I/O accounting mechanism maintains a count of all buffers that have undergone I/O in the current hold-release cycle. Certain buffers associated with core infrastructure (e.g., the xfs_mount superblock buffer, log buffers) are never released, however. This means that accounting I/O submission on such buffers elevates the buftarg count indefinitely and could lead to lockup on unmount. Define a new buffer flag to explicitly exclude buffers from buftarg I/O accounting. Set the flag on the superblock and associated log buffers. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: don't reset b_retries to 0 on every failureEric Sandeen
With the code as it stands today, b_retries never increments because it gets reset to 0 in the error callback. Remove that, and fix a similar problem where the first retry time was constantly being overwritten, which defeated the timeout tunable as well. We now only set first retry time if a non-zero timeout is set, to match the behavior of only incrementing retries if a retry value is set. This way max retries & timeouts consistently take effect after a tunable is set, rather than acting retroactively on a buffer which has failed at some point in the past and has accumulated state from those prior failures. Thanks to dchinner for talking through this with me. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: remove extraneous buffer flag changesEric Sandeen
Fix up a couple places where extra flag manipulation occurs. In the first case we clear XBF_ASYNC and then immediately reset it - so don't bother clearing in the first place. In the 2nd case we are at a point in the function where the buffer must already be async, so there is no need to reset it. Add consistent spacing around the " | " while we're at it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: fix xfs_error_get_cfg for negative errnosEric Sandeen
xfs_error_get_cfg() is called with bp->b_error as an arg, which is negative, so the switch statement won't ever find any matches. This results in only the default error handler having any effect, as EIO/ENOSPC/ENODEV get ignored due to the wrong sign. It seems simplest to always flip the error sign to positive, so that we can handle either negative errors in bp->b_error, or possibly a positive errno via something like xfs_error_get_cfg(EIO) - this future-proofs the function. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: remove the magic numbers in xfs_btree_block-related len macrosHou Tao
replace the magic numbers by offsetof(...) and sizeof(...), and add two extra checks on xfs_check_ondisk_structs() [dchinner: renamed header structures to be more descriptive] Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: indentation fix in xfs_btree_get_iroot()Kaho Ng
The indentation in this function is different from the other functions. Those spacebars are converted to tabs to improve readability. Signed-off-by: Kaho Ng <ngkaho1234@gmail.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: don't allow negative error tagsDan Carpenter
Errors go from zero which means no error to XFS_ERRTAG_MAX (22). My static checker complains that xfs_errortag_add() puts an upper bound on this but not a lower bound. Let's fix it by making it unsigned. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: fix type confusion in xfs_ioc_swapextJann Horn
When calling fdget() in xfs_ioc_swapext(), we need to verify that the file descriptors passed into the ioctl point to XFS inodes before we start operations on them. If we don't do this, we could be referencing arbitrary kernel memory as an XFS inode. THis could lead to memory corruption and/or performing locking operations on attacker-chosen structures in kernel memory. [dchinner: rewrite commit message ] [dchinner: add comment explaining new check ] Signed-off-by: Jann Horn <jann@thejh.net> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21Merge branch 'xfs-4.8-misc-fixes-2' into for-nextDave Chinner
2016-06-21xfs: refactor btree maxlevels computationDarrick J. Wong
Create a common function to calculate the maximum height of a per-AG btree. This will eventually be used by the rmapbt and refcountbt code to calculate appropriate maxlevels values for each. This is important because the verifiers and the transaction block reservations depend on accurate estimates of how many blocks are needed to satisfy a btree split. We were mistakenly using the max bnobt height for all the btrees, which creates a dangerous situation since the larger records and keys in an rmapbt make it very possible that the rmapbt will be taller than the bnobt and so we can run out of transaction block reservation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: convert list of extents to free into a regular listDarrick J. Wong
In struct xfs_bmap_free, convert the open-coded free extent list to a regular list, then use list_sort to sort it prior to processing. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: separate freelist fixing into a separate helperDave Chinner
Break up xfs_free_extent() into a helper that fixes the freelist. This helper will be used subsequently to ensure the freelist during deferred rmap processing. [darrick: refactor to put this at the head of the patchset] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: rearrange xfs_bmap_add_free parametersDarrick J. Wong
This is already in xfsprogs' libxfs, so port it to the kernel. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: check for a valid error_tag in errortag_addDarrick J. Wong
Currently we don't check the error_tag when someone's trying to set up error injection testing. If userspace passes in a value we don't know about, send back an error. This will help xfstests to _notrun a test that uses error injection to test things like log replay. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>