summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-01-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: "This adds SEEK_HOLE and SEEK_DATA support in lseek" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: add support for SEEK_HOLE and SEEK_DATA in lseek
2016-01-21ceph: use i_size_{read,write} to get/set i_sizeYan, Zheng
Cap message from MDS can update i_size. In that case, we don't hold i_mutex. So it's unsafe to directly access inode->i_size while holding i_mutex. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-01-21ceph: re-send AIO write request when getting -EOLDSNAP errorYan, Zheng
When receiving -EOLDSNAP from OSD, we need to re-send corresponding write request. Due to locking issue, we can send new request inside another OSD request's complete callback. So we use worker to re-send request for AIO write. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-01-21ceph: Asynchronous IO supportYan, Zheng
The basic idea of AIO support is simple, just call kiocb::ki_complete() in OSD request's complete callback. But there are several special cases. when IO span multiple objects, we need to wait until all OSD requests are complete, then call kiocb::ki_complete(). Error handling in this case is tricky too. For simplify, AIO both span multiple objects and extends i_size are not allowed. Another special case is check EOF for reading (other client can write to the file and extend i_size concurrently). For simplify, the direct-IO/AIO code path does do the check, fallback to normal syn read instead. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-01-21ceph: Avoid to propagate the invalid page pointMinfei Huang
The variant pagep will still get the invalid page point, although ceph fails in function ceph_update_writeable_page. To fix this issue, Assigne the page to pagep until there is no failure in function ceph_update_writeable_page. Signed-off-by: Minfei Huang <mnfhuang@gmail.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-01-21ceph: fix double page_unlock() in page_mkwrite()Yan, Zheng
ceph_update_writeable_page() unlocks the page on errors, so page_mkwrite() should not unlock the page again. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-01-21btrfs: synchronize incompat feature bits with sysfs filesDavid Sterba
The files under /sys/fs/UUID/features get out of sync with the actual incompat bits set for the filesystem if they change after mount (eg. the LZO compression). Synchronize the feature bits with the sysfs files representing them right after we set/clear them. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-21btrfs: sysfs: introduce helper for syncing bits with sysfs filesDavid Sterba
The files under /sys/fs/UUID/features get out of sync with the actual incompat bits set for the filesystem if they change after mount. We're going to sync them and need a helper to do that. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-21btrfs: sysfs: add free-space-tree bit attributeDavid Sterba
The incompat bit representing the newly added free space tree feature is missing. Right now it will be listed only among features supported by the module, not per-fs. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-21Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending Pull SCSI target updates from Nicholas Bellinger: "The highlights this round include: - Introduce configfs support for unlocked configfs_depend_item() (krzysztof + andrezej) - Conversion of usb-gadget target driver to new function registration interface (andrzej + sebastian) - Enable qla2xxx FC target mode support for Extended Logins (himansu + giridhar) - Enable qla2xxx FC target mode support for Exchange Offload (himansu + giridhar) - Add qla2xxx FC target mode irq affinity notification + selective command queuing. (quinn + himanshu) - Fix iscsi-target deadlock in se_node_acl configfs deletion (sagi + nab) - Convert se_node_acl configfs deletion + se_node_acl->queue_depth to proper se_session->sess_kref + target_get_session() usage. (hch + sagi + nab) - Fix long-standing race between se_node_acl->acl_kref get and get_initiator_node_acl() lookup. (hch + nab) - Fix target/user block-size handling, and make sure netlink reaches all network namespaces (sheng + andy) Note there is an outstanding bug-fix series for remote I_T nexus port TMR LUN_RESET has been posted and still being tested, and will likely become post -rc1 material at this point" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (56 commits) scsi: qla2xxxx: avoid type mismatch in comparison target/user: Make sure netlink would reach all network namespaces target: Obtain se_node_acl->acl_kref during get_initiator_node_acl target: Convert ACL change queue_depth se_session reference usage iscsi-target: Fix potential dead-lock during node acl delete ib_srpt: Convert acl lookup to modern get_initiator_node_acl usage tcm_fc: Convert acl lookup to modern get_initiator_node_acl usage tcm_fc: Wait for command completion before freeing a session target: Fix a memory leak in target_dev_lba_map_store() target: Support aborting tasks with a 64-bit tag usb/gadget: Remove set-but-not-used variables target: Remove an unused variable target: Fix indentation in target_core_configfs.c target/user: Allow user to set block size before enabling device iser-target: Fix non negative ERR_PTR isert_device_get usage target/fcoe: Add tag support to tcm_fc qla2xxx: Check for online flag instead of active reset when transmitting responses qla2xxx: Set all queues to 4k qla2xxx: Disable ZIO at start time. qla2xxx: Move atioq to a different lock to reduce lock contention ...
2016-01-21fs/adfs/adfs.h: tidy up commentsAndrew Morton
Lots of needless 80-col overflows. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21fs/overlayfs/super.c needs pagemap.hAndrew Morton
i386 allmodconfig: In file included from fs/overlayfs/super.c:10:0: fs/overlayfs/super.c: In function 'ovl_fill_super': include/linux/fs.h:898:36: error: 'PAGE_CACHE_SIZE' undeclared (first use in this function) #define MAX_LFS_FILESIZE (((loff_t)PAGE_CACHE_SIZE << (BITS_PER_LONG-1))-1) ^ fs/overlayfs/super.c:939:19: note: in expansion of macro 'MAX_LFS_FILESIZE' sb->s_maxbytes = MAX_LFS_FILESIZE; ^ include/linux/fs.h:898:36: note: each undeclared identifier is reported only once for each function it appears in #define MAX_LFS_FILESIZE (((loff_t)PAGE_CACHE_SIZE << (BITS_PER_LONG-1))-1) ^ fs/overlayfs/super.c:939:19: note: in expansion of macro 'MAX_LFS_FILESIZE' sb->s_maxbytes = MAX_LFS_FILESIZE; ^ Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21proc read mm's {arg,env}_{start,end} with mmap semaphore taken.Mateusz Guzik
Only functions doing more than one read are modified. Consumeres happened to deal with possibly changing data, but it does not seem like a good thing to rely on. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jarod Wilson <jarod@redhat.com> Cc: Jan Stancek <jstancek@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Anshuman Khandual <anshuman.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21fs/coredump: prevent "" / "." / ".." core path componentsJann Horn
Let %h and %e print empty values as "!", "." as "!" and ".." as "!.". This prevents hostnames and comm values that are empty or consist of one or two dots from changing the directory level at which the corefile will be stored. Consider the case where someone decides to sort coredumps by hostname with a core pattern like "/cores/%h/core.%e.%p.%t" or so. In this case, hostnames "" and "." would cause the coredump to land directly in /cores, which is not what the intent behind the core pattern is, and ".." would cause the coredump to land in /. Yeah, there probably aren't many people who do that, but I still don't want this edgecase to be kind of broken. It seems very unlikely that this caused security issues anywhere, so I'm not requesting a stable backport. [akpm@linux-foundation.org: tweak code comment] Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21ptrace: use fsuid, fsgid, effective creds for fs access checksJann Horn
By checking the effective credentials instead of the real UID / permitted capabilities, ensure that the calling process actually intended to use its credentials. To ensure that all ptrace checks use the correct caller credentials (e.g. in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS flag), use two new flags and require one of them to be set. The problem was that when a privileged task had temporarily dropped its privileges, e.g. by calling setreuid(0, user_uid), with the intent to perform following syscalls with the credentials of a user, it still passed ptrace access checks that the user would not be able to pass. While an attacker should not be able to convince the privileged task to perform a ptrace() syscall, this is a problem because the ptrace access check is reused for things in procfs. In particular, the following somewhat interesting procfs entries only rely on ptrace access checks: /proc/$pid/stat - uses the check for determining whether pointers should be visible, useful for bypassing ASLR /proc/$pid/maps - also useful for bypassing ASLR /proc/$pid/cwd - useful for gaining access to restricted directories that contain files with lax permissions, e.g. in this scenario: lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar drwx------ root root /root drwxr-xr-x root root /root/foobar -rw-r--r-- root root /root/foobar/secret Therefore, on a system where a root-owned mode 6755 binary changes its effective credentials as described and then dumps a user-specified file, this could be used by an attacker to reveal the memory layout of root's processes or reveal the contents of files he is not allowed to access (through /proc/$pid/cwd). [akpm@linux-foundation.org: fix warning] Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21fat: constify fatent_operations structuresJulia Lawall
The fatent_operations structures are never modified, so declare them as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21fat: permit to return phy block number by fibmap in fallocated regionNamjae Jeon
Make the fibmap call return the proper physical block number for any offset request in the fallocated range. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21fat: skip cluster allocation on fallocated regionNamjae Jeon
Skip new cluster allocation after checking i_blocks limit in _fat_get_block, because the blocks are already allocated in fallocated region. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21fat: add fat_fallocate operationNamjae Jeon
Implement preallocation via the fallocate syscall on VFAT partitions. This patch is based on an earlier patch of the same name which had some issues detailed below and did not get accepted. Refer https://lkml.org/lkml/2007/12/22/130. a) The preallocated space was not persistent when the FALLOC_FL_KEEP_SIZE flag was set. It will deallocate cluster at evict time. b) There was no need to zero out the clusters when the flag was set Instead of doing an expanding truncate, just allocate clusters and add them to the fat chain. This reduces preallocation time. Compatibility with windows: There are no issues when FALLOC_FL_KEEP_SIZE is not set because it just does an expanding truncate. Thus reading from the preallocated area on windows returns null until data is written to it. When a file with preallocated area using the FALLOC_FL_KEEP_SIZE was written to on windows, the windows driver freed-up the preallocated clusters and allocated new clusters for the new data. The freed up clusters gets reflected in the free space available for the partition which can be seen from the Volume properties. The windows chkdsk tool also does not report any errors on a disk containing files with preallocated space. And there is also no issue using linux fat fsck. because discard preallocated clusters at repair time. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21fat: add simple validation for directory inodeOGAWA Hirofumi
This detects simple corruption cases of directory, and tries to avoid further damage to user data. And performance impact of this validation should be very low, or not measurable. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Tested-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21fat: allow time_offset to be up to 24 hoursJan Kara
Currently we limit values of time_offset mount option to be between -12 and 12 hours. However e.g. zone GMT+12 can have a DST correction on top which makes the total time difference 13 hours. Update the checks in mount option parsing to allow offset of upto 24 hours to allow for unusual cases. Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Volker Kuhlmann <list0570@paradise.net.nz> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21fs/hfs/catalog.c: use list_for_each_entry in hfs_cat_deleteGeliang Tang
Use list_for_each_entry() instead of list_for_each() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21epoll: add EPOLLEXCLUSIVE flagJason Baron
Currently, epoll file descriptors or epfds (the fd returned from epoll_create[1]()) that are added to a shared wakeup source are always added in a non-exclusive manner. This means that when we have multiple epfds attached to a shared fd source they are all woken up. This creates thundering herd type behavior. Introduce a new 'EPOLLEXCLUSIVE' flag that can be passed as part of the 'event' argument during an epoll_ctl() EPOLL_CTL_ADD operation. This new flag allows for exclusive wakeups when there are multiple epfds attached to a shared fd event source. The implementation walks the list of exclusive waiters, and queues an event to each epfd, until it finds the first waiter that has threads blocked on it via epoll_wait(). The idea is to search for threads which are idle and ready to process the wakeup events. Thus, we queue an event to at least 1 epfd, but may still potentially queue an event to all epfds that are attached to the shared fd source. Performance testing was done by Madars Vitolins using a modified version of Enduro/X. The use of the 'EPOLLEXCLUSIVE' flag reduce the length of this particular workload from 860s down to 24s. Sample epoll_clt text: EPOLLEXCLUSIVE Sets an exclusive wakeup mode for the epfd file descriptor that is being attached to the target file descriptor, fd. Thus, when an event occurs and multiple epfd file descriptors are attached to the same target file using EPOLLEXCLUSIVE, one or more epfds will receive an event with epoll_wait(2). The default in this scenario (when EPOLLEXCLUSIVE is not set) is for all epfds to receive an event. EPOLLEXCLUSIVE may only be specified with the op EPOLL_CTL_ADD. Signed-off-by: Jason Baron <jbaron@akamai.com> Tested-by: Madars Vitolins <m@silodev.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Eric Wong <normalperson@yhbt.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21fs/proc/task_mmu.c: add workaround for old compilersKirill A. Shutemov
For THP=n, HPAGE_PMD_NR in smaps_account() expands to BUILD_BUG(). That's fine since this codepath is eliminated by modern compilers. But older compilers have not that efficient dead code elimination. It causes problem at least with gcc 4.1.2 on m68k: fs/built-in.o: In function `smaps_account': task_mmu.c:(.text+0x4f8fa): undefined reference to `__compiletime_assert_471' Let's replace HPAGE_PMD_NR with 1 << compound_order(page). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20btrfs: sysfs: fix typo in compat_ro attribute definitionDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-20Merge branch 'kbuild' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kbuild updates from Michal Marek: - Make <modname>-m in makefiles work like <modname>-y and fix the fallout - Minor genksyms fix - Fix race with make -j install modules_install - Move -Wsign-compare from make W=1 to W=2 - Other minor fixes * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: kbuild: Demote 'sign-compare' warning to W=2 Makefile: revert "Makefile: Document ability to make file.lst and file.S" partially kbuild: Do not run modules_install and install in paralel genksyms: Handle string literals with spaces in reference files fixdep: constify strrcmp arguments ath10k: Fix build with CONFIG_THERMAL=m Revert "drm: Hack around CONFIG_AGP=m build failures" kbuild: Allow to specify composite modules with modname-m staging/ad7606: Actually build the interface modules
2016-01-20btrfs: raid56: Use raid_write_end_io for scrubZhao Lei
No need to create additional end_io function for scrub, it increased code size and introduced some un-unified lines, as: raid_write_parity_end_io(): int err = bio->bi_error; if (bio->bi_error) raid_write_end_io(): int err = bio->bi_error; if (err) This patch combines them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: Remove unnecessary ClearPageUptodate for raid56Zhao Lei
PageUptodate flag already initialized to 0 for new page, no need to set it again. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: use rbio->nr_pages to reduce calculationZhao Lei
We can use rbio->stripe_npages to reduce unnecessary calculation in many code place. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: Use unified stripe_page's index calculationZhao Lei
We are using different index calculation method for stripe_page in current code: 1: (rbio->stripe_len / PAGE_CACHE_SIZE) * stripe_index + page_index 2: DIV_ROUND_UP(rbio->stripe_len, PAGE_CACHE_SIZE) * stripe_index + page_index 3: DIV_ROUND_UP(rbio->stripe_len * stripe_index, PAGE_CACHE_SIZE) + page_index ... They can get same result when stripe_len align to PAGE_CACHE_SIZE, this is why current code can work, intruduce and use a common function for calculation is a better choose. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: Fix calculation of rbio->dbitmap's size calculationZhao Lei
Current code is trying to calculate rbio->dbitmap's size to make it align to sizeof(long), but implement haven't achived this object, it is align to sizeof(char) instead. This patch fixed above calculation, and use sizeof(long) instead of fixed "8" to increate compatibility. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: Fix no_space in write and rm loopZhao Lei
I see no_space in v4.4-rc1 again in xfstests generic/102. It happened randomly in some node only. (one of 4 phy-node, and a kvm with non-virtio block driver) By bisect, we can found the first-bad is: commit bdced438acd8 ("block: setup bi_phys_segments after splitting")' But above patch only triggered the bug by making bio operation faster(or slower). Main reason is in our space_allocating code, we need to commit page writeback before wait it complish, this patch fixed above bug. BTW, there is another reason for generic/102 fail, caused by disable default mixed-blockgroup, I'll fix it in xfstests. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: merge functions for wait snapshot creationZhao Lei
wait_for_snapshot_creation() is in same group with oher two: btrfs_start_write_no_snapshoting() btrfs_end_write_no_snapshoting() Rename wait_for_snapshot_creation() and move it into same place with other two. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: delete unused argument in btrfs_copy_from_userZhao Lei
size_t write_bytes is not necessary for btrfs_copy_from_user(), delete it. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: Use direct way to determine raid56 write/recover modeZhao Lei
Old code used bbio->raid_map to determine whether in raid56 write/recover operation, because we didn't't have bbio->map_type. Now we have direct way for this condition, rid of using the function-relative data, and make the code more readable. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: Small cleanup for get index_srcdev loopZhao Lei
1: Adjust condition in loop to make less TAB 2: Move btrfs_put_bbio()'s line for combine, and makes logic clean. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: Enhance chunk validation checkQu Wenruo
Enhance chunk validation: 1) Num_stripes We already have such check but it's only in super block sys chunk array. Now check all on-disk chunks. 2) Chunk logical It should be aligned to sector size. This behavior should be *DOUBLE CHECKED* for 64K sector size like PPC64 or AArch64. Maybe we can found some hidden bugs. 3) Chunk length Same as chunk logical, should be aligned to sector size. 4) Stripe length It should be power of 2. 5) Chunk type Any bit out of TYPE_MAS | PROFILE_MASK is invalid. With all these much restrict rules, several fuzzed image reported in mail list should no longer cause kernel panic. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: Enhance super validation checkQu Wenruo
Enhance btrfs_check_super_valid() function by the following points: 1) Restrict sector/node size check Not the old max/min valid check, but also check if it's a power of 2. So some bogus number like 12K node size won't pass now. 2) Super flag check For now, there is still some inconsistency between kernel and btrfs-progs super flags. And considering btrfs-progs may add new flags for super block, this check will only output warning. 3) Better root alignment check Now root bytenr is checked against sector size. 4) Move some check into btrfs_check_super_valid(). Like node size vs leaf size check, and PAGESIZE vs sectorsize check. And magic number check. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20Btrfs: fix deadlock running delayed iputs at transaction commit timeFilipe Manana
While running a stress test I ran into a deadlock when running the delayed iputs at transaction time, which produced the following report and trace: [ 886.399989] ============================================= [ 886.400871] [ INFO: possible recursive locking detected ] [ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted [ 886.402384] --------------------------------------------- [ 886.403182] fio/8277 is trying to acquire lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] but task is already holding lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] other info that might help us debug this: [ 886.403568] Possible unsafe locking scenario: [ 886.403568] [ 886.403568] CPU0 [ 886.403568] ---- [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] [ 886.403568] *** DEADLOCK *** [ 886.403568] [ 886.403568] May be due to missing lock nesting notation [ 886.403568] [ 886.403568] 3 locks held by fio/8277: [ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0 [ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs] [ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] stack backtrace: [ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 886.403568] 0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0 [ 886.403568] ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290 [ 886.403568] 0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804 [ 886.403568] Call Trace: [ 886.403568] [<ffffffff8125d4fd>] dump_stack+0x4e/0x79 [ 886.403568] [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b [ 886.403568] [<ffffffff810c22db>] ? __module_address+0xdf/0x108 [ 886.403568] [<ffffffff8108eb77>] lock_acquire+0x10d/0x194 [ 886.403568] [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194 [ 886.403568] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffff8148556b>] down_read+0x3e/0x4d [ 886.489542] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs] [ 886.489542] [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs] [ 886.489542] [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs] [ 886.489542] [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs] [ 886.489542] [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs] [ 886.489542] [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs] [ 886.489542] [<ffffffff81188e31>] evict+0xa7/0x15c [ 886.489542] [<ffffffff81189878>] iput+0x1d3/0x266 [ 886.489542] [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs] [ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [<ffffffff81085096>] ? signal_pending_state+0x31/0x31 [ 886.489542] [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs] [ 886.489542] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 886.489542] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 886.489542] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 886.489542] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128 [ 886.489542] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 886.489542] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50 [ 886.489542] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5 [ 886.489542] [<ffffffff81172cda>] vfs_write+0xa0/0xe4 [ 886.489542] [<ffffffff811734cc>] SyS_write+0x50/0x7e [ 886.489542] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds. [ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.863227] fio D ffff880213f9bb28 0 8244 8240 0x00000000 [ 1081.868719] ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240 [ 1081.872499] ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400 [ 1081.876834] ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4 [ 1081.880782] Call Trace: [ 1081.881793] [<ffffffff81482ba4>] schedule+0x7f/0x97 [ 1081.883340] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325 [ 1081.895525] [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab [ 1081.897419] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20 [ 1081.899251] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20 [ 1081.901063] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21 [ 1081.902365] [<ffffffff814855bd>] down_write+0x43/0x57 [ 1081.903846] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.906078] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.908846] [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c [ 1081.910409] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 1081.912482] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 1081.914597] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 1081.919037] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128 [ 1081.920754] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 1081.922496] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50 [ 1081.923922] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5 [ 1081.925275] [<ffffffff81172cda>] vfs_write+0xa0/0xe4 [ 1081.926584] [<ffffffff811734cc>] SyS_write+0x50/0x7e [ 1081.927968] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.985293] INFO: lockdep is turned off. [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds. [ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.990147] fio D ffff880218febbb8 0 8249 8240 0x00000000 [ 1081.991626] ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240 [ 1081.993258] ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00 [ 1081.994850] ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4 [ 1081.996485] Call Trace: [ 1081.997037] [<ffffffff81482ba4>] schedule+0x7f/0x97 [ 1081.998017] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325 [ 1081.999241] [<ffffffff810852a5>] ? finish_wait+0x6d/0x76 [ 1082.000306] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20 [ 1082.001533] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20 [ 1082.002776] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21 [ 1082.003995] [<ffffffff814855bd>] down_write+0x43/0x57 [ 1082.005000] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.007403] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.008988] [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs] [ 1082.010193] [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77 [ 1082.011280] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0 [ 1082.012265] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0 [ 1082.013021] [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff [ 1082.013738] [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b [ 1082.014778] [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea [ 1082.015778] [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e [ 1082.016806] [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71 [ 1082.017789] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [ 1082.018706] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f This happens because we can recursively acquire the semaphore fs_info->delayed_iput_sem when attempting to allocate space to satisfy a file write request as shown in the first trace above - when committing a transaction we acquire (down_read) the semaphore before running the delayed iputs, and when running a delayed iput() we can end up calling an inode's eviction handler, which in turn commits another transaction and attempts to acquire (down_read) again the semaphore to run more delayed iput operations. This results in a deadlock because if a task acquires multiple times a semaphore it should invoke down_read_nested() with a different lockdep class for each level of recursion. Fix this by simplifying the implementation and use a mutex instead that is acquired by the cleaner kthread before it runs the delayed iputs instead of always acquiring a semaphore before delayed references are run from anywhere. Fixes: d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput) Cc: stable@vger.kernel.org # 4.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20Btrfs: fix typo in log message when starting a balanceFilipe Manana
The recent change titled "Btrfs: Check metadata redundancy on balance" (already in linux-next) left a typo in a message for users: metatdata -> metadata. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20Merge branch 'misc-for-4.5' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
2016-01-20Merge branch 'misc-cleanups-4.5' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
2016-01-20pipe: limit the per-user amount of pages allocated in pipesWilly Tarreau
On no-so-small systems, it is possible for a single process to cause an OOM condition by filling large pipes with data that are never read. A typical process filling 4000 pipes with 1 MB of data will use 4 GB of memory. On small systems it may be tricky to set the pipe max size to prevent this from happening. This patch makes it possible to enforce a per-user soft limit above which new pipes will be limited to a single page, effectively limiting them to 4 kB each, as well as a hard limit above which no new pipes may be created for this user. This has the effect of protecting the system against memory abuse without hurting other users, and still allowing pipes to work correctly though with less data at once. The limit are controlled by two new sysctls : pipe-user-pages-soft, and pipe-user-pages-hard. Both may be disabled by setting them to zero. The default soft limit allows the default number of FDs per process (1024) to create pipes of the default size (64kB), thus reaching a limit of 64MB before starting to create only smaller pipes. With 256 processes limited to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB = 1084 MB of memory allocated for a user. The hard limit is disabled by default to avoid breaking existing applications that make intensive use of pipes (eg: for splicing). Reported-by: socketpair@gmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Mitigates: CVE-2013-4312 (Linux 2.0+) Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-19ELF: Also pass any interpreter's file header to `arch_check_elf'Maciej W. Rozycki
Also pass any interpreter's file header to `arch_check_elf' so that any architecture handler can have a look at it if needed. Signed-off-by: Maciej W. Rozycki <macro@imgtec.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Fortune <Matthew.Fortune@imgtec.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/11478/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2016-01-19Merge branch 'for-4.5/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block updates from Jens Axboe: "We don't have a lot of core changes this time around, it's mostly in drivers, which will come in a subsequent pull. The cores changes include: - blk-mq - Prep patch from Christoph, changing blk_mq_alloc_request() to take flags instead of just using gfp_t for sleep/nosleep. - Doc patch from me, clarifying the difference between legacy and blk-mq for timer usage. - Fixes from Raghavendra for memory-less numa nodes, and a reuse of CPU masks. - Cleanup from Geliang Tang, using offset_in_page() instead of open coding it. - From Ilya, rename request_queue slab to it reflects what it holds, and a fix for proper use of bdgrab/put. - A real fix for the split across stripe boundaries from Keith. We yanked a broken version of this from 4.4-rc final, this one works. - From Mike Krinkin, emit a trace message when we split. - From Wei Tang, two small cleanups, not explicitly clearing memory that is already cleared" * 'for-4.5/core' of git://git.kernel.dk/linux-block: block: use bd{grab,put}() instead of open-coding block: split bios to max possible length block: add call to split trace point blk-mq: Avoid memoryless numa node encoded in hctx numa_node blk-mq: Reuse hardware context cpumask for tags blk-mq: add a flags parameter to blk_mq_alloc_request Revert "blk-flush: Queue through IO scheduler when flush not required" block: clarify blk_add_timer() use case for blk-mq bio: use offset_in_page macro block: do not initialise statics to 0 or NULL block: do not initialise globals to 0 or NULL block: rename request_queue slab cache
2016-01-19find_filesystem(): simplify comparisonAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-19btrfs: remove duplicate const specifierColin Ian King
duplicate const is redundant so remove it Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-18Merge branch 'xfs-misc-fixes-for-4.5-3' into for-nextDave Chinner
2016-01-18xfs: log mount failures don't wait for buffers to be releasedDave Chinner
Recently I've been seeing xfs/051 fail on 1k block size filesystems. Trying to trace the events during the test lead to the problem going away, indicating that it was a race condition that lead to this ASSERT failure: XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156 ..... [<ffffffff814e1257>] xfs_free_perag+0x87/0xb0 [<ffffffff814e21b9>] xfs_mountfs+0x4d9/0x900 [<ffffffff814e5dff>] xfs_fs_fill_super+0x3bf/0x4d0 [<ffffffff811d8800>] mount_bdev+0x180/0x1b0 [<ffffffff814e3ff5>] xfs_fs_mount+0x15/0x20 [<ffffffff811d90a8>] mount_fs+0x38/0x170 [<ffffffff811f4347>] vfs_kern_mount+0x67/0x120 [<ffffffff811f7018>] do_mount+0x218/0xd60 [<ffffffff811f7e5b>] SyS_mount+0x8b/0xd0 When I finally caught it with tracing enabled, I saw that AG 2 had an elevated reference count and a buffer was responsible for it. I tracked down the specific buffer, and found that it was missing the final reference count release that would put it back on the LRU and hence be found by xfs_wait_buftarg() calls in the log mount failure handling. The last four traces for the buffer before the assert were (trimmed for relevance) kworker/0:1-5259 xfs_buf_iodone: hold 2 lock 0 flags ASYNC kworker/0:1-5259 xfs_buf_ioerror: hold 2 lock 0 error -5 mount-7163 xfs_buf_lock_done: hold 2 lock 0 flags ASYNC mount-7163 xfs_buf_unlock: hold 2 lock 1 flags ASYNC This is an async write that is completing, so there's nobody waiting for it directly. Hence we call xfs_buf_relse() once all the processing is complete. That does: static inline void xfs_buf_relse(xfs_buf_t *bp) { xfs_buf_unlock(bp); xfs_buf_rele(bp); } Now, it's clear that mount is waiting on the buffer lock, and that it has been released by xfs_buf_relse() and gained by mount. This is expected, because at this point the mount process is in xfs_buf_delwri_submit() waiting for all the IO it submitted to complete. The mount process, however, is waiting on the lock for the buffer because it is in xfs_buf_delwri_submit(). This waits for IO completion, but it doesn't wait for the buffer reference owned by the IO to go away. The mount process collects all the completions, fails the log recovery, and the higher level code then calls xfs_wait_buftarg() to free all the remaining buffers in the filesystem. The issue is that on unlocking the buffer, the scheduler has decided that the mount process has higher priority than the the kworker thread that is running the IO completion, and so immediately switched contexts to the mount process from the semaphore unlock code, hence preventing the kworker thread from finishing the IO completion and releasing the IO reference to the buffer. Hence by the time that xfs_wait_buftarg() is run, the buffer still has an active reference and so isn't on the LRU list that the function walks to free the remaining buffers. Hence we miss that buffer and continue onwards to tear down the mount structures, at which time we get find a stray reference count on the perag structure. On a non-debug kernel, this will be ignored and the structure torn down and freed. Hence when the kworker thread is then rescheduled and the buffer released and freed, it will access a freed perag structure. The problem here is that when the log mount fails, we still need to quiesce the log to ensure that the IO workqueues have returned to idle before we run xfs_wait_buftarg(). By synchronising the workqueues, we ensure that all IO completions are fully processed, not just to the point where buffers have been unlocked. This ensures we don't end up in the situation above. cc: <stable@vger.kernel.org> # 3.18 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-18Revert "xfs: clear PF_NOFREEZE for xfsaild kthread"Dave Chinner
This reverts commit 24ba16bb3d499c49974669cd8429c3e4138ab102 as it prevents machines from suspending. This regression occurs when the xfsaild is idle on entry to suspend, and so there s no activity to wake it from it's idle sleep and hence see that it is supposed to freeze. Hence the freezer times out waiting for it and suspend is cancelled. There is no obvious fix for this short of freezing the filesystem properly, so revert this change for now. cc: <stable@vger.kernel.org> # 4.4 Signed-off-by: Dave Chinner <david@fromorbit.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>