From 99579ccec4e271c3d4d4e7c946058766812afdab Mon Sep 17 00:00:00 2001
From: Brian Foster <bfoster@redhat.com>
Date: Fri, 22 Jul 2016 09:50:38 +1000
Subject: xfs: skip dirty pages in ->releasepage()

XFS has had scattered reports of delalloc blocks present at
->releasepage() time. This results in a warning with a stack trace
similar to the following:

 ...
 Call Trace:
  [<ffffffffa23c5b8f>] dump_stack+0x63/0x84
  [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0
  [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20
  [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140
  [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0
  [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150
  [<ffffffffa21521c2>] try_to_release_page+0x32/0x50
  [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0
  [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0
  [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0
  [<ffffffffa2168539>] kswapd+0x4f9/0x970
  [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
  [<ffffffffa20a0d99>] kthread+0xc9/0xe0
  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
  [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70
  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100

This occurs because it is possible for shrink_active_list() to send
pages marked dirty to ->releasepage() when certain buffer_head threshold
conditions are met. shrink_active_list() doesn't check the page dirty
state apparently to handle an old ext3 corner case where in some cases
clean pages would not have the dirty bit cleared, thus it is up to the
filesystem to determine how to handle the page.

XFS currently handles the delalloc case properly, but this behavior
makes the warning spurious. Update the XFS ->releasepage() handler to
explicitly skip dirty pages. Retain the existing delalloc/unwritten
checks so we continue to warn if such buffers exist on clean pages when
they shouldn't.

Diagnosed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3ba0809..6135787 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1040,6 +1040,20 @@ xfs_vm_releasepage(
 
 	trace_xfs_releasepage(page->mapping->host, page, 0, 0);
 
+	/*
+	 * mm accommodates an old ext3 case where clean pages might not have had
+	 * the dirty bit cleared. Thus, it can send actual dirty pages to
+	 * ->releasepage() via shrink_active_list(). Conversely,
+	 * block_invalidatepage() can send pages that are still marked dirty
+	 * but otherwise have invalidated buffers.
+	 *
+	 * We've historically freed buffers on the latter. Instead, quietly
+	 * filter out all dirty pages to avoid spurious buffer state warnings.
+	 * This can likely be removed once shrink_active_list() is fixed.
+	 */
+	if (PageDirty(page))
+		return 0;
+
 	xfs_count_page_state(page, &delalloc, &unwritten);
 
 	if (WARN_ON_ONCE(delalloc))
-- 
cgit v0.10.2


From f021bd071f06b545926b1031348873b05442139f Mon Sep 17 00:00:00 2001
From: Arnd Bergmann <arnd@arndb.de>
Date: Fri, 22 Jul 2016 09:50:55 +1000
Subject: xfs: remove dax code from object file when disabled

We check IS_DAX(inode) before calling either xfs_file_dax_read or
xfs_file_dax_write, and this will lead the call being optimized out at
compile time when CONFIG_FS_DAX is disabled.

However, the two functions are marked STATIC, so they become global
symbols when CONFIG_XFS_DEBUG is set, leaving us with two unused global
functions that call into an undefined function and a broken "allmodconfig"
build:

fs/built-in.o: In function `xfs_file_dax_read':
fs/xfs/xfs_file.c:348: undefined reference to `dax_do_io'
fs/built-in.o: In function `xfs_file_dax_write':
fs/xfs/xfs_file.c:758: undefined reference to `dax_do_io'

Marking the two functions 'static noinline' instead of 'STATIC' will let
the compiler drop the symbols when there are no callers but avoid the
implicit inlining.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 16d4d43595b4 ("xfs: split direct I/O and DAX path")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index d97e8cb..49fc9ac 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -370,7 +370,7 @@ xfs_file_dio_aio_read(
 	return ret;
 }
 
-STATIC ssize_t
+static noinline ssize_t
 xfs_file_dax_read(
 	struct kiocb		*iocb,
 	struct iov_iter		*to)
@@ -875,7 +875,7 @@ out:
 	return ret;
 }
 
-STATIC ssize_t
+static noinline ssize_t
 xfs_file_dax_write(
 	struct kiocb		*iocb,
 	struct iov_iter		*from)
-- 
cgit v0.10.2


From 160ae76fa1a2ee2345cb66eb343e24a34d2f051d Mon Sep 17 00:00:00 2001
From: Dave Chinner <dchinner@redhat.com>
Date: Fri, 22 Jul 2016 09:51:05 +1000
Subject: libxfs: directory node splitting does not have an extra block

xfsprogs source commit 4280e59dcbc4cd8e01585efe788a68eb378048e8

xfs_da3_split() has to handle all three versions of the
directory/attribute btree structure. The attr tree is v1, the dir
tre is v2 or v3. The main difference between the v1 and v2/3 trees
is the way tree nodes are split - in the v1 tree we can require a
double split to occur because the object to be inserted may be
larger than the space made by splitting a leaf. In this case we need
to do a double split - one to split the full leaf, then another to
allocate an empty leaf block in the correct location for the new
entry.  This does not happen with dir (v2/v3) formats as the objects
being inserted are always guaranteed to fit into the new space in
the split blocks.

Indeed, for directories they *may* be an extra block on this buffer
pointer. However, it's guaranteed not to be a leaf block (i.e. a
directory data block) - the directory code only ever places hash
index or free space blocks in this pointer (as a cursor of
sorts), and so to use it as a directory data block will immediately
corrupt the directory.

The problem is that the code assumes that there may be extra blocks
that we need to link into the tree once we've split the root, but
this is not true for either dir or attr trees, because the extra
attr block is always consumed by the last node split before we split
the root. Hence the linking in an extra block is always wrong at the
root split level, and this manifests itself in repair as a directory
corruption in a repaired directory, leaving the directory rebuild
incomplete.

This is a dir v2 zero-day bug - it was in the initial dir v2 commit
that was made back in February 1998.

Fix this by ensuring the linking of the blocks after the root split
never tries to make use of the extra blocks that may be held in the
cursor. They are held there for other purposes and should never be
touched by the root splitting code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>

diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 097bf77..0f1f165 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -356,7 +356,6 @@ xfs_da3_split(
 	struct xfs_da_state_blk	*newblk;
 	struct xfs_da_state_blk	*addblk;
 	struct xfs_da_intnode	*node;
-	struct xfs_buf		*bp;
 	int			max;
 	int			action = 0;
 	int			error;
@@ -397,7 +396,9 @@ xfs_da3_split(
 				break;
 			}
 			/*
-			 * Entry wouldn't fit, split the leaf again.
+			 * Entry wouldn't fit, split the leaf again. The new
+			 * extrablk will be consumed by xfs_da3_node_split if
+			 * the node is split.
 			 */
 			state->extravalid = 1;
 			if (state->inleaf) {
@@ -446,6 +447,14 @@ xfs_da3_split(
 		return 0;
 
 	/*
+	 * xfs_da3_node_split() should have consumed any extra blocks we added
+	 * during a double leaf split in the attr fork. This is guaranteed as
+	 * we can't be here if the attr fork only has a single leaf block.
+	 */
+	ASSERT(state->extravalid == 0 ||
+	       state->path.blk[max].magic == XFS_DIR2_LEAFN_MAGIC);
+
+	/*
 	 * Split the root node.
 	 */
 	ASSERT(state->path.active == 0);
@@ -457,43 +466,33 @@ xfs_da3_split(
 	}
 
 	/*
-	 * Update pointers to the node which used to be block 0 and
-	 * just got bumped because of the addition of a new root node.
-	 * There might be three blocks involved if a double split occurred,
-	 * and the original block 0 could be at any position in the list.
+	 * Update pointers to the node which used to be block 0 and just got
+	 * bumped because of the addition of a new root node.  Note that the
+	 * original block 0 could be at any position in the list of blocks in
+	 * the tree.
 	 *
-	 * Note: the magic numbers and sibling pointers are in the same
-	 * physical place for both v2 and v3 headers (by design). Hence it
-	 * doesn't matter which version of the xfs_da_intnode structure we use
-	 * here as the result will be the same using either structure.
+	 * Note: the magic numbers and sibling pointers are in the same physical
+	 * place for both v2 and v3 headers (by design). Hence it doesn't matter
+	 * which version of the xfs_da_intnode structure we use here as the
+	 * result will be the same using either structure.
 	 */
 	node = oldblk->bp->b_addr;
 	if (node->hdr.info.forw) {
-		if (be32_to_cpu(node->hdr.info.forw) == addblk->blkno) {
-			bp = addblk->bp;
-		} else {
-			ASSERT(state->extravalid);
-			bp = state->extrablk.bp;
-		}
-		node = bp->b_addr;
+		ASSERT(be32_to_cpu(node->hdr.info.forw) == addblk->blkno);
+		node = addblk->bp->b_addr;
 		node->hdr.info.back = cpu_to_be32(oldblk->blkno);
-		xfs_trans_log_buf(state->args->trans, bp,
-		    XFS_DA_LOGRANGE(node, &node->hdr.info,
-		    sizeof(node->hdr.info)));
+		xfs_trans_log_buf(state->args->trans, addblk->bp,
+				  XFS_DA_LOGRANGE(node, &node->hdr.info,
+				  sizeof(node->hdr.info)));
 	}
 	node = oldblk->bp->b_addr;
 	if (node->hdr.info.back) {
-		if (be32_to_cpu(node->hdr.info.back) == addblk->blkno) {
-			bp = addblk->bp;
-		} else {
-			ASSERT(state->extravalid);
-			bp = state->extrablk.bp;
-		}
-		node = bp->b_addr;
+		ASSERT(be32_to_cpu(node->hdr.info.back) == addblk->blkno);
+		node = addblk->bp->b_addr;
 		node->hdr.info.forw = cpu_to_be32(oldblk->blkno);
-		xfs_trans_log_buf(state->args->trans, bp,
-		    XFS_DA_LOGRANGE(node, &node->hdr.info,
-		    sizeof(node->hdr.info)));
+		xfs_trans_log_buf(state->args->trans, addblk->bp,
+				  XFS_DA_LOGRANGE(node, &node->hdr.info,
+				  sizeof(node->hdr.info)));
 	}
 	addblk->bp = NULL;
 	return 0;
-- 
cgit v0.10.2


From b1c5ebb21301fcc47392ba3dfc7214f6c2b54032 Mon Sep 17 00:00:00 2001
From: Dave Chinner <dchinner@redhat.com>
Date: Fri, 22 Jul 2016 09:52:35 +1000
Subject: xfs: allocate log vector buffers outside CIL context lock

One of the problems we currently have with delayed logging is that
under serious memory pressure we can deadlock memory reclaim. THis
occurs when memory reclaim (such as run by kswapd) is reclaiming XFS
inodes and issues a log force to unpin inodes that are dirty in the
CIL.

The CIL is pushed, but this will only occur once it gets the CIL
context lock to ensure that all committing transactions are complete
and no new transactions start being committed to the CIL while the
push switches to a new context.

The deadlock occurs when the CIL context lock is held by a
committing process that is doing memory allocation for log vector
buffers, and that allocation is then blocked on memory reclaim
making progress. Memory reclaim, however, is blocked waiting for
a log force to make progress, and so we effectively deadlock at this
point.

To solve this problem, we have to move the CIL log vector buffer
allocation outside of the context lock so that memory reclaim can
always make progress when it needs to force the log. The problem
with doing this is that a CIL push can take place while we are
determining if we need to allocate a new log vector buffer for
an item and hence the current log vector may go away without
warning. That means we canot rely on the existing log vector being
present when we finally grab the context lock and so we must have a
replacement buffer ready to go at all times.

To ensure this, introduce a "shadow log vector" buffer that is
always guaranteed to be present when we gain the CIL context lock
and format the item. This shadow buffer may or may not be used
during the formatting, but if the log item does not have an existing
log vector buffer or that buffer is too small for the new
modifications, we swap it for the new shadow buffer and format
the modifications into that new log vector buffer.

The result of this is that for any object we modify more than once
in a given CIL checkpoint, we double the memory required
to track dirty regions in the log. For single modifications then
we consume the shadow log vectorwe allocate on commit, and that gets
consumed by the checkpoint. However, if we make multiple
modifications, then the second transaction commit will allocate a
shadow log vector and hence we will end up with double the memory
usage as only one of the log vectors is consumed by the CIL
checkpoint. The remaining shadow vector will be freed when th elog
item is freed.

This can probably be optimised in future - access to the shadow log
vector is serialised by the object lock (as opposited to the active
log vector, which is controlled by the CIL context lock) and so we
can probably free shadow log vector from some objects when the log
item is marked clean on removal from the AIL.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 3425799..4561b1e 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -949,6 +949,7 @@ xfs_buf_item_free(
 	xfs_buf_log_item_t	*bip)
 {
 	xfs_buf_item_free_format(bip);
+	kmem_free(bip->bli_item.li_lv_shadow);
 	kmem_zone_free(xfs_buf_item_zone, bip);
 }
 
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index e064665..ccb0811 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -74,6 +74,7 @@ xfs_qm_dqdestroy(
 {
 	ASSERT(list_empty(&dqp->q_lru));
 
+	kmem_free(dqp->q_logitem.qli_item.li_lv_shadow);
 	mutex_destroy(&dqp->q_qlock);
 
 	XFS_STATS_DEC(dqp->q_mount, xs_qm_dquot);
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 814cff9..2c7a162 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -370,6 +370,8 @@ xfs_qm_qoffend_logitem_committed(
 	spin_lock(&ailp->xa_lock);
 	xfs_trans_ail_delete(ailp, &qfs->qql_item, SHUTDOWN_LOG_IO_ERROR);
 
+	kmem_free(qfs->qql_item.li_lv_shadow);
+	kmem_free(lip->li_lv_shadow);
 	kmem_free(qfs);
 	kmem_free(qfe);
 	return (xfs_lsn_t)-1;
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 4aa0153..ab77946 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -40,6 +40,7 @@ void
 xfs_efi_item_free(
 	struct xfs_efi_log_item	*efip)
 {
+	kmem_free(efip->efi_item.li_lv_shadow);
 	if (efip->efi_format.efi_nextents > XFS_EFI_MAX_FAST_EXTENTS)
 		kmem_free(efip);
 	else
@@ -300,6 +301,7 @@ static inline struct xfs_efd_log_item *EFD_ITEM(struct xfs_log_item *lip)
 STATIC void
 xfs_efd_item_free(struct xfs_efd_log_item *efdp)
 {
+	kmem_free(efdp->efd_item.li_lv_shadow);
 	if (efdp->efd_format.efd_nextents > XFS_EFD_MAX_FAST_EXTENTS)
 		kmem_free(efdp);
 	else
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index a1b0761..892c2ac 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -651,6 +651,7 @@ void
 xfs_inode_item_destroy(
 	xfs_inode_t	*ip)
 {
+	kmem_free(ip->i_itemp->ili_item.li_lv_shadow);
 	kmem_zone_free(xfs_ili_zone, ip->i_itemp);
 }
 
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 5e54e79..a4ab192 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -78,6 +78,157 @@ xlog_cil_init_post_recovery(
 	log->l_cilp->xc_ctx->sequence = 1;
 }
 
+static inline int
+xlog_cil_iovec_space(
+	uint	niovecs)
+{
+	return round_up((sizeof(struct xfs_log_vec) +
+					niovecs * sizeof(struct xfs_log_iovec)),
+			sizeof(uint64_t));
+}
+
+/*
+ * Allocate or pin log vector buffers for CIL insertion.
+ *
+ * The CIL currently uses disposable buffers for copying a snapshot of the
+ * modified items into the log during a push. The biggest problem with this is
+ * the requirement to allocate the disposable buffer during the commit if:
+ *	a) does not exist; or
+ *	b) it is too small
+ *
+ * If we do this allocation within xlog_cil_insert_format_items(), it is done
+ * under the xc_ctx_lock, which means that a CIL push cannot occur during
+ * the memory allocation. This means that we have a potential deadlock situation
+ * under low memory conditions when we have lots of dirty metadata pinned in
+ * the CIL and we need a CIL commit to occur to free memory.
+ *
+ * To avoid this, we need to move the memory allocation outside the
+ * xc_ctx_lock, but because the log vector buffers are disposable, that opens
+ * up a TOCTOU race condition w.r.t. the CIL committing and removing the log
+ * vector buffers between the check and the formatting of the item into the
+ * log vector buffer within the xc_ctx_lock.
+ *
+ * Because the log vector buffer needs to be unchanged during the CIL push
+ * process, we cannot share the buffer between the transaction commit (which
+ * modifies the buffer) and the CIL push context that is writing the changes
+ * into the log. This means skipping preallocation of buffer space is
+ * unreliable, but we most definitely do not want to be allocating and freeing
+ * buffers unnecessarily during commits when overwrites can be done safely.
+ *
+ * The simplest solution to this problem is to allocate a shadow buffer when a
+ * log item is committed for the second time, and then to only use this buffer
+ * if necessary. The buffer can remain attached to the log item until such time
+ * it is needed, and this is the buffer that is reallocated to match the size of
+ * the incoming modification. Then during the formatting of the item we can swap
+ * the active buffer with the new one if we can't reuse the existing buffer. We
+ * don't free the old buffer as it may be reused on the next modification if
+ * it's size is right, otherwise we'll free and reallocate it at that point.
+ *
+ * This function builds a vector for the changes in each log item in the
+ * transaction. It then works out the length of the buffer needed for each log
+ * item, allocates them and attaches the vector to the log item in preparation
+ * for the formatting step which occurs under the xc_ctx_lock.
+ *
+ * While this means the memory footprint goes up, it avoids the repeated
+ * alloc/free pattern that repeated modifications of an item would otherwise
+ * cause, and hence minimises the CPU overhead of such behaviour.
+ */
+static void
+xlog_cil_alloc_shadow_bufs(
+	struct xlog		*log,
+	struct xfs_trans	*tp)
+{
+	struct xfs_log_item_desc *lidp;
+
+	list_for_each_entry(lidp, &tp->t_items, lid_trans) {
+		struct xfs_log_item *lip = lidp->lid_item;
+		struct xfs_log_vec *lv;
+		int	niovecs = 0;
+		int	nbytes = 0;
+		int	buf_size;
+		bool	ordered = false;
+
+		/* Skip items which aren't dirty in this transaction. */
+		if (!(lidp->lid_flags & XFS_LID_DIRTY))
+			continue;
+
+		/* get number of vecs and size of data to be stored */
+		lip->li_ops->iop_size(lip, &niovecs, &nbytes);
+
+		/*
+		 * Ordered items need to be tracked but we do not wish to write
+		 * them. We need a logvec to track the object, but we do not
+		 * need an iovec or buffer to be allocated for copying data.
+		 */
+		if (niovecs == XFS_LOG_VEC_ORDERED) {
+			ordered = true;
+			niovecs = 0;
+			nbytes = 0;
+		}
+
+		/*
+		 * We 64-bit align the length of each iovec so that the start
+		 * of the next one is naturally aligned.  We'll need to
+		 * account for that slack space here. Then round nbytes up
+		 * to 64-bit alignment so that the initial buffer alignment is
+		 * easy to calculate and verify.
+		 */
+		nbytes += niovecs * sizeof(uint64_t);
+		nbytes = round_up(nbytes, sizeof(uint64_t));
+
+		/*
+		 * The data buffer needs to start 64-bit aligned, so round up
+		 * that space to ensure we can align it appropriately and not
+		 * overrun the buffer.
+		 */
+		buf_size = nbytes + xlog_cil_iovec_space(niovecs);
+
+		/*
+		 * if we have no shadow buffer, or it is too small, we need to
+		 * reallocate it.
+		 */
+		if (!lip->li_lv_shadow ||
+		    buf_size > lip->li_lv_shadow->lv_size) {
+
+			/*
+			 * We free and allocate here as a realloc would copy
+			 * unecessary data. We don't use kmem_zalloc() for the
+			 * same reason - we don't need to zero the data area in
+			 * the buffer, only the log vector header and the iovec
+			 * storage.
+			 */
+			kmem_free(lip->li_lv_shadow);
+
+			lv = kmem_alloc(buf_size, KM_SLEEP|KM_NOFS);
+			memset(lv, 0, xlog_cil_iovec_space(niovecs));
+
+			lv->lv_item = lip;
+			lv->lv_size = buf_size;
+			if (ordered)
+				lv->lv_buf_len = XFS_LOG_VEC_ORDERED;
+			else
+				lv->lv_iovecp = (struct xfs_log_iovec *)&lv[1];
+			lip->li_lv_shadow = lv;
+		} else {
+			/* same or smaller, optimise common overwrite case */
+			lv = lip->li_lv_shadow;
+			if (ordered)
+				lv->lv_buf_len = XFS_LOG_VEC_ORDERED;
+			else
+				lv->lv_buf_len = 0;
+			lv->lv_bytes = 0;
+			lv->lv_next = NULL;
+		}
+
+		/* Ensure the lv is set up according to ->iop_size */
+		lv->lv_niovecs = niovecs;
+
+		/* The allocated data region lies beyond the iovec region */
+		lv->lv_buf = (char *)lv + xlog_cil_iovec_space(niovecs);
+	}
+
+}
+
 /*
  * Prepare the log item for insertion into the CIL. Calculate the difference in
  * log space and vectors it will consume, and if it is a new item pin it as
@@ -100,16 +251,19 @@ xfs_cil_prepare_item(
 	/*
 	 * If there is no old LV, this is the first time we've seen the item in
 	 * this CIL context and so we need to pin it. If we are replacing the
-	 * old_lv, then remove the space it accounts for and free it.
+	 * old_lv, then remove the space it accounts for and make it the shadow
+	 * buffer for later freeing. In both cases we are now switching to the
+	 * shadow buffer, so update the the pointer to it appropriately.
 	 */
-	if (!old_lv)
+	if (!old_lv) {
 		lv->lv_item->li_ops->iop_pin(lv->lv_item);
-	else if (old_lv != lv) {
+		lv->lv_item->li_lv_shadow = NULL;
+	} else if (old_lv != lv) {
 		ASSERT(lv->lv_buf_len != XFS_LOG_VEC_ORDERED);
 
 		*diff_len -= old_lv->lv_bytes;
 		*diff_iovecs -= old_lv->lv_niovecs;
-		kmem_free(old_lv);
+		lv->lv_item->li_lv_shadow = old_lv;
 	}
 
 	/* attach new log vector to log item */
@@ -133,11 +287,13 @@ xfs_cil_prepare_item(
  * write it out asynchronously without needing to relock the object that was
  * modified at the time it gets written into the iclog.
  *
- * This function builds a vector for the changes in each log item in the
- * transaction. It then works out the length of the buffer needed for each log
- * item, allocates them and formats the vector for the item into the buffer.
- * The buffer is then attached to the log item are then inserted into the
- * Committed Item List for tracking until the next checkpoint is written out.
+ * This function takes the prepared log vectors attached to each log item, and
+ * formats the changes into the log vector buffer. The buffer it uses is
+ * dependent on the current state of the vector in the CIL - the shadow lv is
+ * guaranteed to be large enough for the current modification, but we will only
+ * use that if we can't reuse the existing lv. If we can't reuse the existing
+ * lv, then simple swap it out for the shadow lv. We don't free it - that is
+ * done lazily either by th enext modification or the freeing of the log item.
  *
  * We don't set up region headers during this process; we simply copy the
  * regions into the flat buffer. We can do this because we still have to do a
@@ -170,59 +326,29 @@ xlog_cil_insert_format_items(
 	list_for_each_entry(lidp, &tp->t_items, lid_trans) {
 		struct xfs_log_item *lip = lidp->lid_item;
 		struct xfs_log_vec *lv;
-		struct xfs_log_vec *old_lv;
-		int	niovecs = 0;
-		int	nbytes = 0;
-		int	buf_size;
+		struct xfs_log_vec *old_lv = NULL;
+		struct xfs_log_vec *shadow;
 		bool	ordered = false;
 
 		/* Skip items which aren't dirty in this transaction. */
 		if (!(lidp->lid_flags & XFS_LID_DIRTY))
 			continue;
 
-		/* get number of vecs and size of data to be stored */
-		lip->li_ops->iop_size(lip, &niovecs, &nbytes);
-
-		/* Skip items that do not have any vectors for writing */
-		if (!niovecs)
-			continue;
-
 		/*
-		 * Ordered items need to be tracked but we do not wish to write
-		 * them. We need a logvec to track the object, but we do not
-		 * need an iovec or buffer to be allocated for copying data.
+		 * The formatting size information is already attached to
+		 * the shadow lv on the log item.
 		 */
-		if (niovecs == XFS_LOG_VEC_ORDERED) {
+		shadow = lip->li_lv_shadow;
+		if (shadow->lv_buf_len == XFS_LOG_VEC_ORDERED)
 			ordered = true;
-			niovecs = 0;
-			nbytes = 0;
-		}
 
-		/*
-		 * We 64-bit align the length of each iovec so that the start
-		 * of the next one is naturally aligned.  We'll need to
-		 * account for that slack space here. Then round nbytes up
-		 * to 64-bit alignment so that the initial buffer alignment is
-		 * easy to calculate and verify.
-		 */
-		nbytes += niovecs * sizeof(uint64_t);
-		nbytes = round_up(nbytes, sizeof(uint64_t));
-
-		/* grab the old item if it exists for reservation accounting */
-		old_lv = lip->li_lv;
-
-		/*
-		 * The data buffer needs to start 64-bit aligned, so round up
-		 * that space to ensure we can align it appropriately and not
-		 * overrun the buffer.
-		 */
-		buf_size = nbytes +
-			   round_up((sizeof(struct xfs_log_vec) +
-				     niovecs * sizeof(struct xfs_log_iovec)),
-				    sizeof(uint64_t));
+		/* Skip items that do not have any vectors for writing */
+		if (!shadow->lv_niovecs && !ordered)
+			continue;
 
 		/* compare to existing item size */
-		if (lip->li_lv && buf_size <= lip->li_lv->lv_size) {
+		old_lv = lip->li_lv;
+		if (lip->li_lv && shadow->lv_size <= lip->li_lv->lv_size) {
 			/* same or smaller, optimise common overwrite case */
 			lv = lip->li_lv;
 			lv->lv_next = NULL;
@@ -236,32 +362,29 @@ xlog_cil_insert_format_items(
 			 */
 			*diff_iovecs -= lv->lv_niovecs;
 			*diff_len -= lv->lv_bytes;
+
+			/* Ensure the lv is set up according to ->iop_size */
+			lv->lv_niovecs = shadow->lv_niovecs;
+
+			/* reset the lv buffer information for new formatting */
+			lv->lv_buf_len = 0;
+			lv->lv_bytes = 0;
+			lv->lv_buf = (char *)lv +
+					xlog_cil_iovec_space(lv->lv_niovecs);
 		} else {
-			/* allocate new data chunk */
-			lv = kmem_zalloc(buf_size, KM_SLEEP|KM_NOFS);
+			/* switch to shadow buffer! */
+			lv = shadow;
 			lv->lv_item = lip;
-			lv->lv_size = buf_size;
 			if (ordered) {
 				/* track as an ordered logvec */
 				ASSERT(lip->li_lv == NULL);
-				lv->lv_buf_len = XFS_LOG_VEC_ORDERED;
 				goto insert;
 			}
-			lv->lv_iovecp = (struct xfs_log_iovec *)&lv[1];
 		}
 
-		/* Ensure the lv is set up according to ->iop_size */
-		lv->lv_niovecs = niovecs;
-
-		/* The allocated data region lies beyond the iovec region */
-		lv->lv_buf_len = 0;
-		lv->lv_bytes = 0;
-		lv->lv_buf = (char *)lv + buf_size - nbytes;
 		ASSERT(IS_ALIGNED((unsigned long)lv->lv_buf, sizeof(uint64_t)));
-
 		lip->li_ops->iop_format(lip, lv);
 insert:
-		ASSERT(lv->lv_buf_len <= nbytes);
 		xfs_cil_prepare_item(log, lv, old_lv, diff_len, diff_iovecs);
 	}
 }
@@ -783,6 +906,13 @@ xfs_log_commit_cil(
 	struct xlog		*log = mp->m_log;
 	struct xfs_cil		*cil = log->l_cilp;
 
+	/*
+	 * Do all necessary memory allocation before we lock the CIL.
+	 * This ensures the allocation does not deadlock with a CIL
+	 * push in memory reclaim (e.g. from kswapd).
+	 */
+	xlog_cil_alloc_shadow_bufs(log, tp);
+
 	/* lock out background commit */
 	down_read(&cil->xc_ctx_lock);
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 9a462e8..9b2b9fa 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -52,6 +52,7 @@ typedef struct xfs_log_item {
 	/* delayed logging */
 	struct list_head		li_cil;		/* CIL pointers */
 	struct xfs_log_vec		*li_lv;		/* active log vector */
+	struct xfs_log_vec		*li_lv_shadow;	/* standby vector */
 	xfs_lsn_t			li_seq;		/* CIL commit seq */
 } xfs_log_item_t;
 
-- 
cgit v0.10.2


From 28b783e47ad702b8e0f4861ef94cdfce6abd7c80 Mon Sep 17 00:00:00 2001
From: Dave Chinner <dchinner@redhat.com>
Date: Fri, 22 Jul 2016 09:56:38 +1000
Subject: xfs: bufferhead chains are invalid after end_page_writeback

In xfs_finish_page_writeback(), we have a loop that looks like this:

        do {
                if (off < bvec->bv_offset)
                        goto next_bh;
                if (off > end)
                        break;
                bh->b_end_io(bh, !error);
next_bh:
                off += bh->b_size;
        } while ((bh = bh->b_this_page) != head);

The b_end_io function is end_buffer_async_write(), which will call
end_page_writeback() once all the buffers have marked as no longer
under IO.  This issue here is that the only thing currently
protecting both the bufferhead chain and the page from being
reclaimed is the PageWriteback state held on the page.

While we attempt to limit the loop to just the buffers covered by
the IO, we still read from the buffer size and follow the next
pointer in the bufferhead chain. There is no guarantee that either
of these are valid after the PageWriteback flag has been cleared.
Hence, loops like this are completely unsafe, and result in
use-after-free issues. One such problem was caught by Calvin Owens
with KASAN:

.....
 INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1
  free_buffer_head+0x41/0x90
  __slab_free+0x1ed/0x340
  kmem_cache_free+0x270/0x300
  free_buffer_head+0x41/0x90
  try_to_free_buffers+0x171/0x240
  xfs_vm_releasepage+0xcb/0x3b0
  try_to_release_page+0x106/0x190
  shrink_page_list+0x118e/0x1a10
  shrink_inactive_list+0x42c/0xdf0
  shrink_zone_memcg+0xa09/0xfa0
  shrink_zone+0x2c3/0xbc0
.....
 Call Trace:
  <IRQ>  [<ffffffff81e8b8e4>] dump_stack+0x68/0x94
  [<ffffffff8153a995>] print_trailer+0x115/0x1a0
  [<ffffffff81541174>] object_err+0x34/0x40
  [<ffffffff815436e7>] kasan_report_error+0x217/0x530
  [<ffffffff81543b33>] __asan_report_load8_noabort+0x43/0x50
  [<ffffffff819d651f>] xfs_destroy_ioend+0x3bf/0x4c0
  [<ffffffff819d69d4>] xfs_end_bio+0x154/0x220
  [<ffffffff81de0c58>] bio_endio+0x158/0x1b0
  [<ffffffff81dff61b>] blk_update_request+0x18b/0xb80
  [<ffffffff821baf57>] scsi_end_request+0x97/0x5a0
  [<ffffffff821c5558>] scsi_io_completion+0x438/0x1690
  [<ffffffff821a8d95>] scsi_finish_command+0x375/0x4e0
  [<ffffffff821c3940>] scsi_softirq_done+0x280/0x340


Where the access is occuring during IO completion after the buffer
had been freed from direct memory reclaim.

Prevent use-after-free accidents in this end_io processing loop by
pre-calculating the loop conditionals before calling bh->b_end_io().
The loop is already limited to just the bufferheads covered by the
IO in progress, so the offset checks are sufficient to prevent
accessing buffers in the chain after end_page_writeback() has been
called by the the bh->b_end_io() callout.

Yet another example of why Bufferheads Must Die.

cc: <stable@vger.kernel.org> # 4.7
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reported-and-Tested-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 6135787..f1c7f8c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -87,6 +87,12 @@ xfs_find_bdev_for_inode(
  * We're now finished for good with this page.  Update the page state via the
  * associated buffer_heads, paying attention to the start and end offsets that
  * we need to process on the page.
+ *
+ * Landmine Warning: bh->b_end_io() will call end_page_writeback() on the last
+ * buffer in the IO. Once it does this, it is unsafe to access the bufferhead or
+ * the page at all, as we may be racing with memory reclaim and it can free both
+ * the bufferhead chain and the page as it will see the page as clean and
+ * unused.
  */
 static void
 xfs_finish_page_writeback(
@@ -95,8 +101,9 @@ xfs_finish_page_writeback(
 	int			error)
 {
 	unsigned int		end = bvec->bv_offset + bvec->bv_len - 1;
-	struct buffer_head	*head, *bh;
+	struct buffer_head	*head, *bh, *next;
 	unsigned int		off = 0;
+	unsigned int		bsize;
 
 	ASSERT(bvec->bv_offset < PAGE_SIZE);
 	ASSERT((bvec->bv_offset & ((1 << inode->i_blkbits) - 1)) == 0);
@@ -105,15 +112,17 @@ xfs_finish_page_writeback(
 
 	bh = head = page_buffers(bvec->bv_page);
 
+	bsize = bh->b_size;
 	do {
+		next = bh->b_this_page;
 		if (off < bvec->bv_offset)
 			goto next_bh;
 		if (off > end)
 			break;
 		bh->b_end_io(bh, !error);
 next_bh:
-		off += bh->b_size;
-	} while ((bh = bh->b_this_page) != head);
+		off += bsize;
+	} while ((bh = next) != head);
 }
 
 /*
-- 
cgit v0.10.2


From 72ccbbe154fc307c98153725822be515fc0326d3 Mon Sep 17 00:00:00 2001
From: Dave Chinner <dchinner@redhat.com>
Date: Fri, 22 Jul 2016 14:10:18 +1000
Subject: xfs: remove EXPERIMENTAL tag from sparse inode feature

Been around for long enough now, hasn't caused any regression test
failures in the past 3 months, so it's time to make it a fully
supported feature.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 11ea5d5..1fa3f8f 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1565,10 +1565,6 @@ xfs_fs_fill_super(
 		}
 	}
 
-	if (xfs_sb_version_hassparseinodes(&mp->m_sb))
-		xfs_alert(mp,
-	"EXPERIMENTAL sparse inode feature enabled. Use at your own risk!");
-
 	error = xfs_mountfs(mp);
 	if (error)
 		goto out_filestream_unmount;
-- 
cgit v0.10.2