summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-03-14Orangefs: merge to v4.5Mike Marshall
Merge tag 'v4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into current Linux 4.5
2016-03-11Merge tag 'xfs-for-linus-4.5-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs fixes from Dave Chinner: "This is a fix for a regression introduced in 4.5-rc1 by the new torn log write detection code. The regression only affects people moving a clean filesystem between machines/kernels of different architecture (such as changing between 32 bit and 64 bit kernels), but this is the recommended (and only!) safe way to migrate a filesystem between architectures so we really need to ensure it works. The changes are larger than I'd prefer right at the end of the release cycle, but the majority of the change is just factoring code to enable the detection of a clean log at the correct time to avoid this issue. Changes: - Only perform torn log write detection on dirty logs. This prevents failures being detected due to a clean filesystem being moved between machines or kernels of different architectures (e.g. 32 -> 64 bit, BE -> LE, etc). This fixes a regression introduced by the torn log write detection in 4.5-rc1" * tag 'xfs-for-linus-4.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: only run torn log write detection on dirty logs xfs: refactor in-core log state update to helper xfs: refactor unmount record detection into helper xfs: separate log head record discovery from verification
2016-03-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "A couple of fixes: Fix for my dumb braino in ncpfs and a long-standing breakage on recovery from failed rename() in jffs2" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: jffs2: reduce the breakage on recovery from halfway failed rename() ncpfs: fix a braino in OOM handling in ncp_fill_cache()
2016-03-10Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fix from Ted Ts'o: "This fixes a regression which crept in v4.5-rc5" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: iterate over buffer heads correctly in move_extent_per_page()
2016-03-10ext4: iterate over buffer heads correctly in move_extent_per_page()Eryu Guan
In commit bcff24887d00 ("ext4: don't read blocks from disk after extents being swapped") bh is not updated correctly in the for loop and wrong data has been written to disk. generic/324 catches this on sub-page block size ext4. Fixes: bcff24887d00 ("ext4: don't read blocks from disk after extentsbeing swapped") Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09dax: check return value of dax_radix_entry()Ross Zwisler
dax_pfn_mkwrite() previously wasn't checking the return value of the call to dax_radix_entry(), which was a mistake. Instead, capture this return value and return the appropriate VM_FAULT_ value. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09ocfs2: fix return value from ocfs2_page_mkwrite()Jan Kara
ocfs2_page_mkwrite() could mistakenly return error code instead of mkwrite status value. Fix it. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09orangefs: make fs_mount_pending staticMartin Brandenburg
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-09orangefs: Avoid symlink upcall if target is too long.Martin Brandenburg
Previously the client-core detected this condition by sheer luck! Since we used strncpy, no NUL byte would be included on the name. The client-core would call strlen, which would read past the end of its buffer, but return a number large enough that the client-core would return ENAMETOOLONG. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-09Orangefs: improve the POSIXness of interrupted writes...Mike Marshall
Don't return EINTR on interrupted writes if some data has already been written. Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-09Orangefs: add a new gossip statementMike Marshall
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-08jffs2: reduce the breakage on recovery from halfway failed rename()Al Viro
d_instantiate(new_dentry, old_inode) is absolutely wrong thing to do - it will oops if new_dentry used to be positive, for starters. What we need is d_invalidate() the target and be done with that. Cc: stable@vger.kernel.org # v3.18+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-08ncpfs: fix a braino in OOM handling in ncp_fill_cache()Al Viro
Failing to allocate an inode for child means that cache for *parent* is incompletely populated. So it's parent directory inode ('dir') that needs NCPI_DIR_CACHE flag removed, *not* the child inode ('inode', which is what we'd failed to allocate in the first place). Fucked-up-in: commit 5e993e25 ("ncpfs: get rid of d_validate() nonsense") Fucked-up-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org # v3.19 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-07Merge branch 'overlayfs-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "Overlayfs bug fixes. All marked as -stable material" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: copy new uid/gid into overlayfs runtime inode ovl: ignore lower entries when checking purity of non-directory entries ovl: fix getcwd() failure after unsuccessful rmdir ovl: fix working on distributed fs as lower layer
2016-03-06xfs: only run torn log write detection on dirty logsBrian Foster
XFS uses CRC verification over a sub-range of the head of the log to detect and handle torn writes. This torn log write detection currently runs unconditionally at mount time, regardless of whether the log is dirty or clean. This is problematic in cases where a filesystem might end up being moved across different, incompatible (i.e., opposite byte-endianness) architectures. The problem lies in the fact that log data is not necessarily written in an architecture independent format. For example, certain bits of data are written in native endian format. Further, the size of certain log data structures differs (i.e., struct xlog_rec_header) depending on the word size of the cpu. This leads to false positive crc verification errors and ultimately failed mounts when a cleanly unmounted filesystem is mounted on a system with an incompatible architecture from data that was written near the head of the log. Update the log head/tail discovery code to run torn write detection only when the log is not clean. This means something other than an unmount record resides at the head of the log and log recovery is imminent. It is a requirement to run log recovery on the same type of host that had written the content of the dirty log and therefore CRC failures are legitimate corruptions in that scenario. Reported-by: Jan Beulich <JBeulich@suse.com> Tested-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-06xfs: refactor in-core log state update to helperBrian Foster
Once the record at the head of the log is identified and verified, the in-core log state is updated based on the record. This includes information such as the current head block and cycle, the start block of the last record written to the log, the tail lsn, etc. Once torn write detection is conditional, this logic will need to be reused. Factor the code to update the in-core log data structures into a new helper function. This patch does not change behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-06xfs: refactor unmount record detection into helperBrian Foster
Once the mount sequence has identified the head and tail blocks of the physical log, the record at the head of the log is located and examined for an unmount record to determine if the log is clean. This currently occurs after torn write verification of the head region of the log. This must ultimately be separated from torn write verification and may need to be called again if the log head is walked back due to a torn write (to determine whether the new head record is an unmount record). Separate this logic into a new helper function. This patch does not change behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-06xfs: separate log head record discovery from verificationBrian Foster
The code that locates the log record at the head of the log is buried in the log head verification function. This is fine when torn write verification occurs unconditionally, but this behavior is problematic for filesystems that might be moved across systems with different architectures. In preparation for separating examination of the log head for unmount records from torn write detection, lift the record location logic out of the log verification function and into the caller. This patch does not change behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph fix from Sage Weil: "This is a final commit we missed to align the protocol compatibility with the feature bits. It decodes a few extra fields in two different messages and reports EIO when they are used (not yet supported)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support
2016-03-05Merge branch 'for-linus2' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Round 2 of this. I cut back to the bare necessities, the patch is still larger than it usually would be at this time, due to the number of NVMe fixes in there. This pull request contains: - The 4 core fixes from Ming, that fix both problems with exceeding the virtual boundary limit in case of merging, and the gap checking for cloned bio's. - NVMe fixes from Keith and Christoph: - Regression on larger user commands, causing problems with reading log pages (for instance). This touches both NVMe, and the block core since that is now generally utilized also for these types of commands. - Hot removal fixes. - User exploitable issue with passthrough IO commands, if !length is given, causing us to fault on writing to the zero page. - Fix for a hang under error conditions - And finally, the current series regression for umount with cgroup writeback, where the final flush would happen async and hence open up window after umount where the device wasn't consistent. fsck right after umount would show this. From Tejun" * 'for-linus2' of git://git.kernel.dk/linux-block: block: support large requests in blk_rq_map_user_iov block: fix blk_rq_get_max_sectors for driver private requests nvme: fix max_segments integer truncation nvme: set queue limits for the admin queue writeback: flush inode cgroup wb switches instead of pinning super_block NVMe: Fix 0-length integrity payload NVMe: Don't allow unsupported flags NVMe: Move error handling to failed reset handler NVMe: Simplify device reset failure NVMe: Fix namespace removal deadlock NVMe: Use IDA for namespace disk naming NVMe: Don't unmap controller registers on reset block: merge: get the 1st and last bvec via helpers block: get the 1st and last bvec via helpers block: check virt boundary in bio_will_gap() block: bio: introduce helpers to get the 1st and last bvec
2016-03-05Merge tag 'for-linus-20160304' of git://git.infradead.org/linux-mtdLinus Torvalds
Pull jffs2 fixes from David Woodhouse: "This contains two important JFFS2 fixes marked for stable: - a lock ordering problem between the page lock and the internal f->sem mutex, which was causing occasional deadlocks in garbage collection - a scan failure causing moved directories to sometimes end up appearing to have hard links. There are also a couple of trivial MAINTAINERS file updates" * tag 'for-linus-20160304' of git://git.infradead.org/linux-mtd: MAINTAINERS: add maintainer entry for FREESCALE GPMI NAND driver Fix directory hardlinks from deleted directories jffs2: Fix page lock / f->sem deadlock Revert "jffs2: Fix lock acquisition order bug in jffs2_write_begin" MAINTAINERS: update Han's email
2016-03-05Merge branch 'for-linus-4.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "Filipe nailed down a problem where tree log replay would do some work that orphan code wasn't expecting to be done yet, leading to BUG_ON" * 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix loading of orphan roots leading to BUG_ON
2016-03-04ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 supportYan, Zheng
Add support for the format change of MClientReply/MclientCaps. Also add code that denies access to inodes with pool_ns layouts. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2016-03-03Btrfs: fix loading of orphan roots leading to BUG_ONFilipe Manana
When looking for orphan roots during mount we can end up hitting a BUG_ON() (at root-item.c:btrfs_find_orphan_roots()) if a log tree is replayed and qgroups are enabled. This is because after a log tree is replayed, a transaction commit is made, which triggers qgroup extent accounting which in turn does backref walking which ends up reading and inserting all roots in the radix tree fs_info->fs_root_radix, including orphan roots (deleted snapshots). So after the log tree is replayed, when finding orphan roots we hit the BUG_ON with the following trace: [118209.182438] ------------[ cut here ]------------ [118209.183279] kernel BUG at fs/btrfs/root-tree.c:314! [118209.184074] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [118209.185123] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic ppdev xor raid6_pq evdev sg parport_pc parport acpi_cpufreq tpm_tis tpm psmouse processor i2c_piix4 serio_raw pcspkr i2c_core button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs] [118209.186318] CPU: 14 PID: 28428 Comm: mount Tainted: G W 4.5.0-rc5-btrfs-next-24+ #1 [118209.186318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [118209.186318] task: ffff8801ec131040 ti: ffff8800af34c000 task.ti: ffff8800af34c000 [118209.186318] RIP: 0010:[<ffffffffa04237d7>] [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs] [118209.186318] RSP: 0018:ffff8800af34faa8 EFLAGS: 00010246 [118209.186318] RAX: 00000000ffffffef RBX: 00000000ffffffef RCX: 0000000000000001 [118209.186318] RDX: 0000000080000000 RSI: 0000000000000001 RDI: 00000000ffffffff [118209.186318] RBP: ffff8800af34fb08 R08: 0000000000000001 R09: 0000000000000000 [118209.186318] R10: ffff8800af34f9f0 R11: 6db6db6db6db6db7 R12: ffff880171b97000 [118209.186318] R13: ffff8801ca9d65e0 R14: ffff8800afa2e000 R15: 0000160000000000 [118209.186318] FS: 00007f5bcb914840(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000 [118209.186318] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [118209.186318] CR2: 00007f5bcaceb5d9 CR3: 00000000b49b5000 CR4: 00000000000006e0 [118209.186318] Stack: [118209.186318] fffffbffffffffff 010230ffffffffff 0101000000000000 ff84000000000000 [118209.186318] fbffffffffffffff 30ffffffffffffff 0000000000000101 ffff880082348000 [118209.186318] 0000000000000000 ffff8800afa2e000 ffff8800afa2e000 0000000000000000 [118209.186318] Call Trace: [118209.186318] [<ffffffffa042e2db>] open_ctree+0x1e37/0x21b9 [btrfs] [118209.186318] [<ffffffffa040a753>] btrfs_mount+0x97e/0xaed [btrfs] [118209.186318] [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf [118209.186318] [<ffffffff8117b87e>] mount_fs+0x67/0x131 [118209.186318] [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde [118209.186318] [<ffffffffa0409f81>] btrfs_mount+0x1ac/0xaed [btrfs] [118209.186318] [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf [118209.186318] [<ffffffff8108c26b>] ? lockdep_init_map+0xb9/0x1b3 [118209.186318] [<ffffffff8117b87e>] mount_fs+0x67/0x131 [118209.186318] [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde [118209.186318] [<ffffffff81195637>] do_mount+0x8a6/0x9e8 [118209.186318] [<ffffffff8119598d>] SyS_mount+0x77/0x9f [118209.186318] [<ffffffff81493017>] entry_SYSCALL_64_fastpath+0x12/0x6b [118209.186318] Code: 64 00 00 85 c0 89 c3 75 24 f0 41 80 4c 24 20 20 49 8b bc 24 f0 01 00 00 4c 89 e6 e8 e8 65 00 00 85 c0 89 c3 74 11 83 f8 ef 75 02 <0f> 0b 4c 89 e7 e8 da 72 00 00 eb 1c 41 83 bc 24 00 01 00 00 00 [118209.186318] RIP [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs] [118209.186318] RSP <ffff8800af34faa8> [118209.230735] ---[ end trace 83938f987d85d477 ]--- So fix this by not treating the error -EEXIST, returned when attempting to insert a root already inserted by the backref walking code, as an error. The following test case for xfstests reproduces the bug: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _require_dm_target flakey _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey _run_btrfs_util_prog quota enable $SCRATCH_MNT # Create 2 directories with one file in one of them. # We use these just to trigger a transaction commit later, moving the file from # directory a to directory b and doing an fsync against directory a. mkdir $SCRATCH_MNT/a mkdir $SCRATCH_MNT/b touch $SCRATCH_MNT/a/f sync # Create our test file with 2 4K extents. $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io # Create a snapshot and delete it. This doesn't really delete the snapshot # immediately, just makes it inaccessible and invisible to user space, the # snapshot is deleted later by a dedicated kernel thread (cleaner kthread) # which is woke up at the next transaction commit. # A root orphan item is inserted into the tree of tree roots, so that if a # power failure happens before the dedicated kernel thread does the snapshot # deletion, the next time the filesystem is mounted it resumes the snapshot # deletion. _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap # Now overwrite half of the extents we wrote before. Because we made a snapshpot # before, which isn't really deleted yet (since no transaction commit happened # after we did the snapshot delete request), the non overwritten extents get # referenced twice, once by the default subvolume and once by the snapshot. $XFS_IO_PROG -c "pwrite -S 0xbb 4K 8K" $SCRATCH_MNT/foobar | _filter_xfs_io # Now move file f from directory a to directory b and fsync directory a. # The fsync on the directory a triggers a transaction commit (because a file # was moved from it to another directory) and the file fsync leaves a log tree # with file extent items to replay. mv $SCRATCH_MNT/a/f $SCRATCH_MNT/a/b $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar echo "File digest before power failure:" md5sum $SCRATCH_MNT/foobar | _filter_scratch # Now simulate a power failure and mount the filesystem to replay the log tree. # After the log tree was replayed, we used to hit a BUG_ON() when processing # the root orphan item for the deleted snapshot. This is because when processing # an orphan root the code expected to be the first code inserting the root into # the fs_info->fs_root_radix radix tree, while in reallity it was the second # caller attempting to do it - the first caller was the transaction commit that # took place after replaying the log tree, when updating the qgroup counters. _flakey_drop_and_remount echo "File digest before after failure:" # Must match what he got before the power failure. md5sum $SCRATCH_MNT/foobar | _filter_scratch _unmount_flakey status=0 exit Fixes: 2d9e97761087 ("Btrfs: use btrfs_get_fs_root in resolve_indirect_ref") Cc: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-03-03writeback: flush inode cgroup wb switches instead of pinning super_blockTejun Heo
If cgroup writeback is in use, inodes can be scheduled for asynchronous wb switching. Before 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches"), this could race with umount leading to super_block being destroyed while inodes are pinned for wb switching. 5ff8eaac1636 fixed it by bumping s_active while wb switches are in flight; however, this allowed in-flight wb switches to make umounts asynchronous when the userland expected synchronosity - e.g. fsck immediately following umount may fail because the device is still busy. This patch removes the problematic super_block pinning and instead makes generic_shutdown_super() flush in-flight wb switches. wb switches are now executed on a dedicated isw_wq so that they can be flushed and isw_nr_in_flight keeps track of the number of in-flight wb switches so that flushing can be avoided in most cases. v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE check in inode_switch_wbs() as Jan an Al suggested. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Tahsin Erdogan <tahsin@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@ZenIV.linux.org.uk> Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com Fixes: 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches") Cc: stable@vger.kernel.org #v4.5 Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03Orangefs: improve gossip statementsMike Marshall
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-03ovl: copy new uid/gid into overlayfs runtime inodeKonstantin Khlebnikov
Overlayfs must update uid/gid after chown, otherwise functions like inode_owner_or_capable() will check user against stale uid. Catched by xfstests generic/087, it chowns file and calls utimes. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Cc: <stable@vger.kernel.org>
2016-03-03ovl: ignore lower entries when checking purity of non-directory entriesKonstantin Khlebnikov
After rename file dentry still holds reference to lower dentry from previous location. This doesn't matter for data access because data comes from upper dentry. But this stale lower dentry taints dentry at new location and turns it into non-pure upper. Such file leaves visible whiteout entry after remove in directory which shouldn't have whiteouts at all. Overlayfs already tracks pureness of file location in oe->opaque. This patch just uses that for detecting actual path type. Comment from Vivek Goyal's patch: Here are the details of the problem. Do following. $ mkdir upper lower work merged upper/dir/ $ touch lower/test $ sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir= work merged $ mv merged/test merged/dir/ $ rm merged/dir/test $ ls -l merged/dir/ /usr/bin/ls: cannot access merged/dir/test: No such file or directory total 0 c????????? ? ? ? ? ? test Basic problem seems to be that once a file has been unlinked, a whiteout has been left behind which was not needed and hence it becomes visible. Whiteout is visible because parent dir is of not type MERGE, hence od->is_real is set during ovl_dir_open(). And that means ovl_iterate() passes on iterate handling directly to underlying fs. Underlying fs does not know/filter whiteouts so it becomes visible to user. Why did we leave a whiteout to begin with when we should not have. ovl_do_remove() checks for OVL_TYPE_PURE_UPPER() and does not leave whiteout if file is pure upper. In this case file is not found to be pure upper hence whiteout is left. So why file was not PURE_UPPER in this case? I think because dentry is still carrying some leftover state which was valid before rename. For example, od->numlower was set to 1 as it was a lower file. After rename, this state is not valid anymore as there is no such file in lower. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Viktor Stanchev <me@viktorstanchev.com> Suggested-by: Vivek Goyal <vgoyal@redhat.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=109611 Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Cc: <stable@vger.kernel.org>
2016-03-03ovl: fix getcwd() failure after unsuccessful rmdirRui Wang
ovl_remove_upper() should do d_drop() only after it successfully removes the dir, otherwise a subsequent getcwd() system call will fail, breaking userspace programs. This is to fix: https://bugzilla.kernel.org/show_bug.cgi?id=110491 Signed-off-by: Rui Wang <rui.y.wang@intel.com> Reviewed-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Cc: <stable@vger.kernel.org>
2016-03-03ovl: fix working on distributed fs as lower layerKonstantin Khlebnikov
This adds missing .d_select_inode into alternative dentry_operations. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Fixes: 7c03b5d45b8e ("ovl: allow distributed fs as lower layer") Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay") Reviewed-by: Nikolay Borisov <kernel@kyup.com> Tested-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Cc: <stable@vger.kernel.org> # 4.2+
2016-03-02Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Various small CIFS/SMB3 fixes for stable: Fixes address oops that can occur when accessing Macs with SMB3, and another problem found to Samba when read responses queued (e.g. with gluster under Samba)" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: CIFS: Fix duplicate line introduced by clone_file_range patch Fix cifs_uniqueid_to_ino_t() function for s390x CIFS: Fix SMB2+ interim response processing for read requests cifs: fix out-of-bounds access in lease parsing
2016-03-02userfaultfd: don't block on the last VM updates at exit timeLinus Torvalds
The exit path will do some final updates to the VM of an exiting process to inform others of the fact that the process is going away. That happens, for example, for robust futex state cleanup, but also if the parent has asked for a TID update when the process exits (we clear the child tid field in user space). However, at the time we do those final VM accesses, we've already stopped accepting signals, so the usual "stop waiting for userfaults on signal" code in fs/userfaultfd.c no longer works, and the process can become an unkillable zombie waiting for something that will never happen. To solve this, just make handle_userfault() abort any user fault handling if we're already in the exit path past the signal handling state being dead (marked by PF_EXITING). This VM special case is pretty ugly, and it is possible that we should look at finalizing signals later (or move the VM final accesses earlier). But in the meantime this is a fairly minimally intrusive fix. Reported-and-tested-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull d_inode/d_flags race fix from Al Viro. I love this fix. Not only does it fix the race in the dentry type handling, it entirely gets rid of the nasty and subtle memory ordering rules for d_type and d_inode, and replaces them with the basic dentry locking rules (sequence numbers under RCU, d_lock elsewhere). * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: use ->d_seq to get coherency between ->d_inode and ->d_flags
2016-03-01CIFS: Fix duplicate line introduced by clone_file_range patchSteve French
Commit 04b38d601239b4 ("vfs: pull btrfs clone API to vfs layer") added a duplicated line (in cifsfs.c) which causes a sparse compile warning. Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-02-29use ->d_seq to get coherency between ->d_inode and ->d_flagsAl Viro
Games with ordering and barriers are way too brittle. Just bump ->d_seq before and after updating ->d_inode and ->d_flags type bits, so that verifying ->d_seq would guarantee they are coherent. Cc: stable@vger.kernel.org # v3.13+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-29Fix cifs_uniqueid_to_ino_t() function for s390xYadan Fan
This issue is caused by commit 02323db17e3a7 ("cifs: fix cifs_uniqueid_to_ino_t not to ever return 0"), when BITS_PER_LONG is 64 on s390x, the corresponding cifs_uniqueid_to_ino_t() function will cast 64-bit fileid to 32-bit by using (ino_t)fileid, because ino_t (typdefed __kernel_ino_t) is int type. It's defined in arch/s390/include/uapi/asm/posix_types.h #ifndef __s390x__ typedef unsigned long __kernel_ino_t; ... #else /* __s390x__ */ typedef unsigned int __kernel_ino_t; So the #ifdef condition is wrong for s390x, we can just still use one cifs_uniqueid_to_ino_t() function with comparing sizeof(ino_t) and sizeof(u64) to choose the correct execution accordingly. Signed-off-by: Yadan Fan <ydfan@suse.com> CC: stable <stable@vger.kernel.org> Signed-off-by: Steve French <smfrench@gmail.com>
2016-02-29CIFS: Fix SMB2+ interim response processing for read requestsPavel Shilovsky
For interim responses we only need to parse a header and update a number credits. Now it is done for all SMB2+ command except SMB2_READ which is wrong. Fix this by adding such processing. Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Tested-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <smfrench@gmail.com>
2016-02-29cifs: fix out-of-bounds access in lease parsingJustin Maggard
When opening a file, SMB2_open() attempts to parse the lease state from the SMB2 CREATE Response. However, the parsing code was not careful to ensure that the create contexts are not empty or invalid, which can lead to out- of-bounds memory access. This can be seen easily by trying to read a file from a OSX 10.11 SMB3 server. Here is sample crash output: BUG: unable to handle kernel paging request at ffff8800a1a77cc6 IP: [<ffffffff8828a734>] SMB2_open+0x804/0x960 PGD 8f77067 PUD 0 Oops: 0000 [#1] SMP Modules linked in: CPU: 3 PID: 2876 Comm: cp Not tainted 4.5.0-rc3.x86_64.1+ #14 Hardware name: NETGEAR ReadyNAS 314 /ReadyNAS 314 , BIOS 4.6.5 10/11/2012 task: ffff880073cdc080 ti: ffff88005b31c000 task.ti: ffff88005b31c000 RIP: 0010:[<ffffffff8828a734>] [<ffffffff8828a734>] SMB2_open+0x804/0x960 RSP: 0018:ffff88005b31fa08 EFLAGS: 00010282 RAX: 0000000000000015 RBX: 0000000000000000 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff88007eb8c8b0 RBP: ffff88005b31fad8 R08: 666666203d206363 R09: 6131613030383866 R10: 3030383866666666 R11: 00000000000002b0 R12: ffff8800660fd800 R13: ffff8800a1a77cc2 R14: 00000000424d53fe R15: ffff88005f5a28c0 FS: 00007f7c8a2897c0(0000) GS:ffff88007eb80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff8800a1a77cc6 CR3: 000000005b281000 CR4: 00000000000006e0 Stack: ffff88005b31fa70 ffffffff88278789 00000000000001d3 ffff88005f5a2a80 ffffffff00000003 ffff88005d029d00 ffff88006fde05a0 0000000000000000 ffff88005b31fc78 ffff88006fde0780 ffff88005b31fb2f 0000000100000fe0 Call Trace: [<ffffffff88278789>] ? cifsConvertToUTF16+0x159/0x2d0 [<ffffffff8828cf68>] smb2_open_file+0x98/0x210 [<ffffffff8811e80c>] ? __kmalloc+0x1c/0xe0 [<ffffffff882685f4>] cifs_open+0x2a4/0x720 [<ffffffff88122cef>] do_dentry_open+0x1ff/0x310 [<ffffffff88268350>] ? cifsFileInfo_get+0x30/0x30 [<ffffffff88123d92>] vfs_open+0x52/0x60 [<ffffffff88131dd0>] path_openat+0x170/0xf70 [<ffffffff88097d48>] ? remove_wait_queue+0x48/0x50 [<ffffffff88133a29>] do_filp_open+0x79/0xd0 [<ffffffff8813f2ca>] ? __alloc_fd+0x3a/0x170 [<ffffffff881240c4>] do_sys_open+0x114/0x1e0 [<ffffffff881241a9>] SyS_open+0x19/0x20 [<ffffffff8896e257>] entry_SYSCALL_64_fastpath+0x12/0x6a Code: 4d 8d 6c 07 04 31 c0 4c 89 ee e8 47 6f e5 ff 31 c9 41 89 ce 44 89 f1 48 c7 c7 28 b1 bd 88 31 c0 49 01 cd 4c 89 ee e8 2b 6f e5 ff <45> 0f b7 75 04 48 c7 c7 31 b1 bd 88 31 c0 4d 01 ee 4c 89 f6 e8 RIP [<ffffffff8828a734>] SMB2_open+0x804/0x960 RSP <ffff88005b31fa08> CR2: ffff8800a1a77cc6 ---[ end trace d9f69ba64feee469 ]--- Signed-off-by: Justin Maggard <jmaggard@netgear.com> Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org>
2016-02-28Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: do_last(): ELOOP failure exit should be done after leaving RCU mode should_follow_link(): validate ->d_seq after having decided to follow namei: ->d_inode of a pinned dentry is stable only for positives do_last(): don't let a bogus return value from ->open() et.al. to confuse us fs: return -EOPNOTSUPP if clone is not supported hpfs: don't truncate the file when delete fails
2016-02-28do_last(): ELOOP failure exit should be done after leaving RCU modeAl Viro
... or we risk seeing a bogus value of d_is_symlink() there. Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-28should_follow_link(): validate ->d_seq after having decided to followAl Viro
... otherwise d_is_symlink() above might have nothing to do with the inode value we've got. Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-28namei: ->d_inode of a pinned dentry is stable only for positivesAl Viro
both do_last() and walk_component() risk picking a NULL inode out of dentry about to become positive, *then* checking its flags and seeing that it's not negative anymore and using (already stale by then) value they'd fetched earlier. Usually ends up oopsing soon after that... Cc: stable@vger.kernel.org # v3.13+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-28do_last(): don't let a bogus return value from ->open() et.al. to confuse usAl Viro
... into returning a positive to path_openat(), which would interpret that as "symlink had been encountered" and proceed to corrupt memory, etc. It can only happen due to a bug in some ->open() instance or in some LSM hook, etc., so we report any such event *and* make sure it doesn't trick us into further unpleasantness. Cc: stable@vger.kernel.org # v3.6+, at least Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-28fs: return -EOPNOTSUPP if clone is not supportedChristoph Hellwig
-EBADF is a rather confusing error if an operations is not supported, and nfsd gets rather upset about it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-28hpfs: don't truncate the file when delete failsMikulas Patocka
The delete opration can allocate additional space on the HPFS filesystem due to btree split. The HPFS driver checks in advance if there is available space, so that it won't corrupt the btree if we run out of space during splitting. If there is not enough available space, the HPFS driver attempted to truncate the file, but this results in a deadlock since the commit 7dd29d8d865efdb00c0542a5d2c87af8c52ea6c7 ("HPFS: Introduce a global mutex and lock it on every callback from VFS"). This patch removes the code that tries to truncate the file and -ENOSPC is returned instead. If the user hits -ENOSPC on delete, he should try to delete other files (that are stored in a leaf btree node), so that the delete operation will make some space for deleting the file stored in non-leaf btree node. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: stable@vger.kernel.org # 2.6.39+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge fixes from Andrew Morton: "10 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: dax: move writeback calls into the filesystems dax: give DAX clearing code correct bdev ext4: online defrag not supported with DAX ext2, ext4: only set S_DAX for regular inodes block: disable block device DAX by default ocfs2: unlock inode if deleting inode from orphan fails mm: ASLR: use get_random_long() drivers: char: random: add get_random_long() mm: numa: quickly fail allocations for NUMA balancing on full nodes mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED
2016-02-27Merge tag 'tags/ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext2/4 DAX fix from Ted Ts'o: "This fixes a file system corruption bug with DAX" * tag 'tags/ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite()
2016-02-27ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite()Ross Zwisler
As it is currently written ext4_dax_mkwrite() assumes that the call into __dax_mkwrite() will not have to do a block allocation so it doesn't create a journal entry. For a read that creates a zero page to cover a hole followed by a write that actually allocates storage this is incorrect. The ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls get_blocks() to allocate storage. Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault() as this function already has all the logic needed to allocate a journal entry and call __dax_fault(). Also update the ext2 fault handlers in this same way to remove duplicate code and keep the logic between ext2 and ext4 the same. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-27dax: move writeback calls into the filesystemsRoss Zwisler
Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem ->writepages function so that it can supply us with a valid block device. This also fixes DAX code to properly flush caches in response to sync(2). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27dax: give DAX clearing code correct bdevRoss Zwisler
dax_clear_blocks() needs a valid struct block_device and previously it was using inode->i_sb->s_bdev in all cases. This is correct for normal inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time devices. Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change its arguments to take a bdev and a sector instead of an inode and a block. This better reflects what the function does, and it allows the filesystem and raw block device code to pass in an appropriate struct block_device. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>