summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-10-09f2fs: do not skip dentry block writesJaegeuk Kim
Previously, we skip dentry block writes when wbc is SYNC_NONE with no memory pressure and the number of dirty pages is pretty small. But, we didn't skip for normal data writes, which gives us not much big impact on overall performance. Moreover, by skipping some data writes, kworker falls into infinite loop to try to write blocks, when many dir inodes have only one dentry block. So, this patch removes skipping data writes. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: remove unneeded f2fs_{,un}lock_op in do_recover_data()Chao Yu
Protecting recovery flow by using cp_rwsem is not needed, since we have prevent triggering any checkpoint by locking cp_mutex previously. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: fix incorrect bimodal calculationChao Yu
In update_sit_info, we use div_u64 to handle 'u64 divide u64' case, but div_u64 can only handle 32-bits divisor, so our divisor with u64 type passed to div_u64 will overflow, result in the wrong calculation when show debug info of f2fs as below: BDF: 464, avg. vblocks: 23509 (BDF should never exceed 100) So change to use div64_u64 to handle this case correctly. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: introduce __try_update_largest_extentChao Yu
This patch adds a new helper __try_update_largest_extent for cleanup. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: fix error handling for calls to various functions in the function ↵Nicholas Krause
recover_inline_data This fixes error handling for calls to various functions in the function recover_inline_data to check if these particular functions either return a error code or the boolean value false to signal their caller they have failed internally and if this arises return false to signal failure immediately to the caller of recover_inline_data as we cannot continue after failures to calling either the function truncate_inline_inode or truncate_blocks. Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: disallow switch extent_cache option dynamicallyChao Yu
Swith extent_cache option dynamically when remount may casue consistency issue between extent cache and dnode page. Fix in this patch to avoid that condition. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: use correct flag in f2fs_map_blocks()Chao Yu
We introduce F2FS_GET_BLOCK_READ in commit e2b4e2bc8865 ("f2fs: fix incorrect mapping for bmap"), but forget to use this flag in the right place, fix it. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: fix to handle io error in ->direct_IOChao Yu
Here is a oops reported as following message when testing generic/019 of xfstest: ------------[ cut here ]------------ kernel BUG at /home/yuchao/git/f2fs-dev/segment.c:882! invalid opcode: 0000 [#1] SMP Modules linked in: zram lz4_compress lz4_decompress f2fs(O) ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_def CPU: 2 PID: 25441 Comm: fio Tainted: G O 4.3.0-rc1+ #6 Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.61 05/16/2013 task: ffff8803f4e85580 ti: ffff8803fd61c000 task.ti: ffff8803fd61c000 RIP: 0010:[<ffffffffa0784981>] [<ffffffffa0784981>] new_curseg+0x321/0x330 [f2fs] RSP: 0018:ffff8803fd61f918 EFLAGS: 00010246 RAX: 00000000000007ed RBX: 0000000000000224 RCX: 000000000000001f RDX: 0000000000000800 RSI: ffffffffffffffff RDI: ffff8803f56f4300 RBP: ffff8803fd61f978 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000024 R11: ffff8800d23bbd78 R12: ffff8800d0ef0000 R13: 0000000000000224 R14: 0000000000000000 R15: 0000000000000001 FS: 00007f827ff85700(0000) GS:ffff88041ea80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffff600000 CR3: 00000003fef17000 CR4: 00000000001406e0 Stack: 000007ea00000002 0000000100000001 ffff8803f6456248 000007ed0000002b 0000000000000224 ffff880404d1aa20 ffff8803fd61f9c8 ffff8800d0ef0000 ffff8803f6456248 0000000000000001 00000000ffffffff ffffffffa078f358 Call Trace: [<ffffffffa0785b87>] allocate_segment_by_default+0x1a7/0x1f0 [f2fs] [<ffffffffa078322c>] allocate_data_block+0x17c/0x360 [f2fs] [<ffffffffa0779521>] __allocate_data_block+0x131/0x1d0 [f2fs] [<ffffffffa077a995>] f2fs_direct_IO+0x4b5/0x580 [f2fs] [<ffffffff811510ae>] generic_file_direct_write+0xae/0x160 [<ffffffff811518f5>] __generic_file_write_iter+0xd5/0x1f0 [<ffffffff81151e07>] generic_file_write_iter+0xf7/0x200 [<ffffffff81319e38>] ? apparmor_file_permission+0x18/0x20 [<ffffffffa0768480>] ? f2fs_fallocate+0x1190/0x1190 [f2fs] [<ffffffffa07684c6>] f2fs_file_write_iter+0x46/0x90 [f2fs] [<ffffffff8120b4fe>] aio_run_iocb+0x1ee/0x290 [<ffffffff81700f7e>] ? mutex_lock+0x1e/0x50 [<ffffffff8120a1d7>] ? aio_read_events+0x207/0x2b0 [<ffffffff8120b913>] do_io_submit+0x373/0x630 [<ffffffff8120a4f6>] ? SyS_io_getevents+0x56/0xb0 [<ffffffff8120bbe0>] SyS_io_submit+0x10/0x20 [<ffffffff81703857>] entry_SYSCALL_64_fastpath+0x12/0x6a Code: 45 c8 48 8b 78 10 e8 9f 23 bf e0 41 8b 8c 24 cc 03 00 00 89 c7 31 d2 89 c6 89 d8 29 df f7 f1 29 d1 39 cf 0f 83 be fd ff ff eb RIP [<ffffffffa0784981>] new_curseg+0x321/0x330 [f2fs] RSP <ffff8803fd61f918> ---[ end trace 2e577d7f711ddb86 ]--- The reason is that: in the test of generic/019, we will trigger a manmade IO error in block layer through debugfs, after that, prefree segment will no longer be freed, because we always skip doing gc or checkpoint when there occurs an IO error. Meanwhile fio with aio engine generated a large number of direct IOs, which continue allocating spaces in free segment until we run out of them, eventually, results in panic in new_curseg as no more free segment was found. So, this patch changes to return EIO in direct_IO for this condition. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: do in batches truncation in truncate_holeChao Yu
truncate_data_blocks_range can do in batches truncation which makes all changes in dnode page content, dnode page status, extent cache, block count updating together. But previously, truncate_hole() always truncates one block in dnode page at a time by invoking truncate_data_blocks_range(,1), which make thing slow. This patch changes truncate_hole() to do in batches truncation for all target blocks in one direct node inside truncate_data_blocks_range, which can make our punch hole operation in ->fallocate more efficent. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: optimize code of f2fs_update_extent_tree_rangeFan Li
Fix 2 potential problems: 1. when largest extent needs to be invalidated, it will be reset in __drop_largest_extent, which makes __is_extent_same after always return false, and largest extent unchanged. Now we update it properly. 2. when extent is split and the latter part remains in tree, next_en should be the latter part instead of next extent of original extent. It will cause merge failure if there is in-place update, although there is not, I think this fix will still makes codes less ambiguous. This patch also simplifies codes of invalidating extents, and optimizes the procedues that split extent into two. There are a few modifications after last patch: 1. prev_en now is updated properly. 2. more codes and branches are simplified. Signed-off-by: Fan li <fanofcode.li@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: drop largest extent by rangeFan Li
now we update extent by range, fofs may not be on the largest extent if the new extent overlaps with it. so add a new function to drop largest extent properly. Signed-off-by: Fan li <fanofcode.li@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: check end_io for metapages before making next checkpoint blocksJaegeuk Kim
This patch avoids to produce new checkpoint blocks before the previous meta pages were written completely. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs crypto: allocate buffer for decrypting filenameJaegeuk Kim
We got dentry pages from high_mem, and its address space directly goes into the decryption path via f2fs_fname_disk_to_usr. But, sg_init_one assumes the address is not from high_mem, so we can get this panic since it doesn't call kmap_high but kunmap_high is triggered at the end. kernel BUG at ../../../../../../kernel/mm/highmem.c:290! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM ... (kunmap_high+0xb0/0xb8) from [<c0114534>] (__kunmap_atomic+0xa0/0xa4) (__kunmap_atomic+0xa0/0xa4) from [<c035f028>] (blkcipher_walk_done+0x128/0x1ec) (blkcipher_walk_done+0x128/0x1ec) from [<c0366c24>] (crypto_cbc_decrypt+0xc0/0x170) (crypto_cbc_decrypt+0xc0/0x170) from [<c0367148>] (crypto_cts_decrypt+0xc0/0x114) (crypto_cts_decrypt+0xc0/0x114) from [<c035ea98>] (async_decrypt+0x40/0x48) (async_decrypt+0x40/0x48) from [<c032ca34>] (f2fs_fname_disk_to_usr+0x124/0x304) (f2fs_fname_disk_to_usr+0x124/0x304) from [<c03056fc>] (f2fs_fill_dentries+0xac/0x188) (f2fs_fill_dentries+0xac/0x188) from [<c03059c8>] (f2fs_readdir+0x1f0/0x300) (f2fs_readdir+0x1f0/0x300) from [<c0218054>] (vfs_readdir+0x90/0xb4) (vfs_readdir+0x90/0xb4) from [<c0218418>] (SyS_getdents64+0x64/0xcc) (SyS_getdents64+0x64/0xcc) from [<c0105ba0>] (ret_fast_syscall+0x0/0x30) Cc: <stable@vger.kernel.org> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: reorganize f2fs_map_blocksChao Yu
In this patch, we try to reorganize f2fs_map_blocks to make block mapping flow more clear by using following structure: /* check status of mapping */ if (unmapped) { /* blkaddr == NULL_ADDR || blkaddr == NEW_ADDR */ if (create) { /* write path, handle dio write case here */ alloc_and_map; } else { /* * handle read cases from all call paths: * 1. generic read; * 2. dio read; * 3. fiemap; * 4. bmap */ } } /* map buffer_header */ Besides, this patch handles the missing case correctly for dio write: When we fail in __allocate_data_blocks, then in f2fs_map_blocks, we will not allocate blocks correctly for preallocated blocks, but returning with an unmapped buffer head, which will result in failure of dio write. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: declare f2fs_update_extent_tree_range as staticJaegeuk Kim
This function should be static. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: fix overflow of size calculationChao Yu
We have potential overflow issue when calculating size of object, when we left shift index with PAGE_CACHE_SHIFT bits, if type of index has only 32-bits space in 32-bit architecture, left shifting will incur overflow, i.e: pgoff_t index = 0xFFFFFFFF; loff_t size = index << PAGE_CACHE_SHIFT; size: 0xFFFFF000 So we should cast index with 64-bits type to avoid this issue. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: fix incorrect searching position when shrinking extent cacheChao Yu
When shrinking extent cache, we have two steps in the flow: 1) shrink objects which are unreferenced by inodes; 2) shrink objects from LRU list of extent cache. In step 1, if we haven't shrunk enough number of objects, we will try step 2, but before that we didn't update the searching position which may point to last inode index in global extent tree, result in failing to shrink objects by traversing the all inodes' extent tree. In this patch, we reset searching position to beginning of global extent tree for fixing. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: verify file type early in f2fs_fallocateChao Yu
This patch changes to verify file type early in f2fs_fallocate for cleanup, meanwhile this also fixes to add missing verification for expand_inode_data. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: no need to lock for update_inode_page all the timeJaegeuk Kim
As comment says, we don't need to call f2fs_lock_op in write_inode to prevent from producing dirty node pages all the time. That happens only when there is not enough free sections and we can avoid that by calling balance_fs in prior to that. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: cover number of dirty node pages under node_write lockJaegeuk Kim
This number is referenced by checkpoint under node_write lock. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: fix incorrect return statement in the function ↵Nicholas Krause
f2fs_ioc_release_volatile_write This fixes the incorrect return statement at the end of the function f2fs_ioc_release_volatile_write's body for returning zero as this is incorrect due to the function call before this return statement to the function punch_hole being able to fail and we should return this function's return fail directly in order to signal to callers of the function f2fs_ioc_release_volatile if a failure arises with this call to punch_hole fails. Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09f2fs: trace in batches extent info updateChao Yu
Rename trace_f2fs_update_extent_tree to trace_f2fs_update_extent_tree_range, then expand and enable it to trace in batches extent info updates. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-07Merge tag 'nfs-for-4.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Bugfixes: - Fix a use-after-free bug in the RPC/RDMA client - Fix a write performance regression - Fix up page writeback accounting - Don't try to reclaim unused state owners - Fix a NFSv4 nograce recovery hang - reset states to use open_stateid when returning delegation voluntarily - Fix a tracepoint NULL-pointer dereference" * tag 'nfs-for-4.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix a tracepoint NULL-pointer dereference nfs4: reset states to use open_stateid when returning delegation voluntarily NFSv4: Fix a nograce recovery hang NFSv4.1: nfs4_opendata_check_deleg needs to handle NFS4_OPEN_CLAIM_DELEG_CUR_FH NFSv4: Don't try to reclaim unused state owners NFS: Fix a write performance regression NFS: Fix up page writeback accounting xprtrdma: disconnect and flush cqs before freeing buffers
2015-10-06NFS: Fix a tracepoint NULL-pointer dereferenceAnna Schumaker
Running xfstest generic/013 with the tracepoint nfs:nfs4_open_file enabled produces a NULL-pointer dereference when calculating fileid and filehandle of the opened file. Fix this by checking if state is NULL before trying to use the inode pointer. Reported-by: Olga Kornievskaia <aglo@umich.edu> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-06Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull CIFS fixes from Steve French: "Two fixes for problems pointed out by automated tools. Thanks PaX/grsecurity team and Dan Carpenter (and the Smatch tool)" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: [CIFS] Update cifs version number [SMB3] Do not fall back to SMBWriteX in set_file_size error cases [SMB3] Missing null tcon check
2015-10-03[CIFS] Update cifs version numberSteve French
Update modinfo cifs.ko version number to 2.08 Signed-off-by: Steve French <steve.french@primarydata.com>
2015-10-02nfs4: reset states to use open_stateid when returning delegation voluntarilyJeff Layton
When the client goes to return a delegation, it should always update any nfs4_state currently set up to use that delegation stateid to instead use the open stateid. It already does do this in some cases, particularly in the state recovery code, but not currently when the delegation is voluntarily returned (e.g. in advance of a RENAME). This causes the client to try to continue using the delegation stateid after the DELEGRETURN, e.g. in LAYOUTGET. Set the nfs4_state back to using the open stateid in nfs4_open_delegation_recall, just before clearing the NFS_DELEGATED_STATE bit. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-02NFSv4: Fix a nograce recovery hangBenjamin Coddington
Since commit 5cae02f42793130e1387f4ec09c4d07056ce9fa5 an OPEN_CONFIRM should have a privileged sequence in the recovery case to allow nograce recovery to proceed for NFSv4.0. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-02NFSv4.1: nfs4_opendata_check_deleg needs to handle NFS4_OPEN_CLAIM_DELEG_CUR_FHTrond Myklebust
We need to warn against broken NFSv4.1 servers that try to hand out delegations in response to NFS4_OPEN_CLAIM_DELEG_CUR_FH. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-02NFSv4: Don't try to reclaim unused state ownersTrond Myklebust
Currently, we don't test if the state owner is in use before we try to recover it. The problem is that if the refcount is zero, then the state owner will be waiting on the lru list for garbage collection. The expectation in that case is that if you bump the refcount, then you must also remove the state owner from the lru list. Otherwise the call to nfs4_put_state_owner will corrupt that list by trying to add our state owner a second time. Avoid the whole problem by just skipping state owners that hold no state. Reported-by: Andrew W Elble <aweits@rit.edu> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-02NFS: Fix a write performance regressionTrond Myklebust
If all other conditions in nfs_can_extend_write() are met, and there are no locks, then we should be able to assume close-to-open semantics and the ability to extend our write to cover the whole page. With this patch, the xfstests generic/074 test completes in 242s instead of >1400s on my test rig. Fixes: bd61e0a9c852 ("locks: convert posix locks to file_lock_context") Cc: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-02NFS: Fix up page writeback accountingTrond Myklebust
Currently, we are crediting all the calls to nfs_writepages_callback() (i.e. the nfs_writepages() callback) to nfs_writepage(). Aside from being inconsistent with the behaviour of the equivalent readpage/readpages accounting, this also means that we cannot distinguish between bulk writes and single page writebacks (which confuses the 'nfsiostat -p' tool). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-02[SMB3] Do not fall back to SMBWriteX in set_file_size error casesSteve French
The error paths in set_file_size for cifs and smb3 are incorrect. In the unlikely event that a server did not support set file info of the file size, the code incorrectly falls back to trying SMBWriteX (note that only the original core SMB Write, used for example by DOS, can set the file size this way - this actually does not work for the more recent SMBWriteX). The idea was since the old DOS SMB Write could set the file size if you write zero bytes at that offset then use that if server rejects the normal set file info call. Fortunately the SMBWriteX will never be sent on the wire (except when file size is zero) since the length and offset fields were reversed in the two places in this function that call SMBWriteX causing the fall back path to return an error. It is also important to never call an SMB request from an SMB2/sMB3 session (which theoretically would be possible, and can cause a brief session drop, although the client recovers) so this should be fixed. In practice this path does not happen with modern servers but the error fall back to SMBWriteX is clearly wrong. Removing the calls to SMBWriteX in the error paths in cifs_set_file_size Pointed out by PaX/grsecurity team Signed-off-by: Steve French <steve.french@primarydata.com> Reported-by: PaX Team <pageexec@freemail.hu> CC: Emese Revfy <re.emese@gmail.com> CC: Brad Spengler <spender@grsecurity.net> CC: Stable <stable@vger.kernel.org>
2015-10-02dax: fix NULL pointer in __dax_pmd_fault()Ross Zwisler
Commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") moved some code in __dax_pmd_fault() that was responsible for zeroing newly allocated PMD pages. The new location didn't properly set up 'kaddr', so when run this code resulted in a NULL pointer BUG. Fix this by getting the correct 'kaddr' via bdev_direct_access(). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01Merge tag 'upstream-4.3-rc4' of git://git.infradead.org/linux-ubifsLinus Torvalds
Pull UBI/UBIFS fixes from Richard Weinberger: "This contains three bug fixes for both UBI and UBIFS" * tag 'upstream-4.3-rc4' of git://git.infradead.org/linux-ubifs: UBI: return ENOSPC if no enough space available UBI: Validate data_size UBIFS: Kill unneeded locking in ubifs_init_security
2015-09-29UBIFS: Kill unneeded locking in ubifs_init_securityRichard Weinberger
Fixes the following lockdep splat: [ 1.244527] ============================================= [ 1.245193] [ INFO: possible recursive locking detected ] [ 1.245193] 4.2.0-rc1+ #37 Not tainted [ 1.245193] --------------------------------------------- [ 1.245193] cp/742 is trying to acquire lock: [ 1.245193] (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff812b3f69>] ubifs_init_security+0x29/0xb0 [ 1.245193] [ 1.245193] but task is already holding lock: [ 1.245193] (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff81198e7f>] path_openat+0x3af/0x1280 [ 1.245193] [ 1.245193] other info that might help us debug this: [ 1.245193] Possible unsafe locking scenario: [ 1.245193] [ 1.245193] CPU0 [ 1.245193] ---- [ 1.245193] lock(&sb->s_type->i_mutex_key#9); [ 1.245193] lock(&sb->s_type->i_mutex_key#9); [ 1.245193] [ 1.245193] *** DEADLOCK *** [ 1.245193] [ 1.245193] May be due to missing lock nesting notation [ 1.245193] [ 1.245193] 2 locks held by cp/742: [ 1.245193] #0: (sb_writers#5){.+.+.+}, at: [<ffffffff811ad37f>] mnt_want_write+0x1f/0x50 [ 1.245193] #1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff81198e7f>] path_openat+0x3af/0x1280 [ 1.245193] [ 1.245193] stack backtrace: [ 1.245193] CPU: 2 PID: 742 Comm: cp Not tainted 4.2.0-rc1+ #37 [ 1.245193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140816_022509-build35 04/01/2014 [ 1.245193] ffffffff8252d530 ffff88007b023a38 ffffffff814f6f49 ffffffff810b56c5 [ 1.245193] ffff88007c30cc80 ffff88007b023af8 ffffffff810a150d ffff88007b023a68 [ 1.245193] 000000008101302a ffff880000000000 00000008f447e23f ffffffff8252d500 [ 1.245193] Call Trace: [ 1.245193] [<ffffffff814f6f49>] dump_stack+0x4c/0x65 [ 1.245193] [<ffffffff810b56c5>] ? console_unlock+0x1c5/0x510 [ 1.245193] [<ffffffff810a150d>] __lock_acquire+0x1a6d/0x1ea0 [ 1.245193] [<ffffffff8109fa78>] ? __lock_is_held+0x58/0x80 [ 1.245193] [<ffffffff810a1a93>] lock_acquire+0xd3/0x270 [ 1.245193] [<ffffffff812b3f69>] ? ubifs_init_security+0x29/0xb0 [ 1.245193] [<ffffffff814fc83b>] mutex_lock_nested+0x6b/0x3a0 [ 1.245193] [<ffffffff812b3f69>] ? ubifs_init_security+0x29/0xb0 [ 1.245193] [<ffffffff812b3f69>] ? ubifs_init_security+0x29/0xb0 [ 1.245193] [<ffffffff812b3f69>] ubifs_init_security+0x29/0xb0 [ 1.245193] [<ffffffff8128e286>] ubifs_create+0xa6/0x1f0 [ 1.245193] [<ffffffff81198e7f>] ? path_openat+0x3af/0x1280 [ 1.245193] [<ffffffff81195d15>] vfs_create+0x95/0xc0 [ 1.245193] [<ffffffff8119929c>] path_openat+0x7cc/0x1280 [ 1.245193] [<ffffffff8109ffe3>] ? __lock_acquire+0x543/0x1ea0 [ 1.245193] [<ffffffff81088f20>] ? sched_clock_cpu+0x90/0xc0 [ 1.245193] [<ffffffff81088c00>] ? calc_global_load_tick+0x60/0x90 [ 1.245193] [<ffffffff81088f20>] ? sched_clock_cpu+0x90/0xc0 [ 1.245193] [<ffffffff811a9cef>] ? __alloc_fd+0xaf/0x180 [ 1.245193] [<ffffffff8119ac55>] do_filp_open+0x75/0xd0 [ 1.245193] [<ffffffff814ffd86>] ? _raw_spin_unlock+0x26/0x40 [ 1.245193] [<ffffffff811a9cef>] ? __alloc_fd+0xaf/0x180 [ 1.245193] [<ffffffff81189bd9>] do_sys_open+0x129/0x200 [ 1.245193] [<ffffffff81189cc9>] SyS_open+0x19/0x20 [ 1.245193] [<ffffffff81500717>] entry_SYSCALL_64_fastpath+0x12/0x6f While the lockdep splat is a false positive, becuase path_openat holds i_mutex of the parent directory and ubifs_init_security() tries to acquire i_mutex of a new inode, it reveals that taking i_mutex in ubifs_init_security() is in vain because it is only being called in the inode allocation path and therefore nobody else can see the inode yet. Cc: stable@vger.kernel.org # 3.20- Reported-and-tested-by: Boris Brezillon <boris.brezillon@free-electrons.com> Reviewed-and-tested-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: dedekind1@gmail.com
2015-09-27Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull CIFS fixes from Steve French: "Four fixes from testing at the recent SMB3 Plugfest including two important authentication ones (one fixes authentication problems to some popular servers when clock times differ more than two hours between systems, the other fixes Kerberos authentication for SMB3)" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: fix encryption error checks on mount [SMB3] Fix sec=krb5 on smb3 mounts cifs: use server timestamp for ntlmv2 authentication disabling oplocks/leases via module parm enable_oplocks broken for SMB3
2015-09-26[SMB3] Missing null tcon checkSteve French
Pointed out by Dan Carpenter via smatch code analysis tool CC: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steve French <steve.french@primarydata.com>
2015-09-25Merge branch 'for-linus-4.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is an assorted set I've been queuing up: Jeff Mahoney tracked down a tricky one where we ended up starting IO on the wrong mapping for special files in btrfs_evict_inode. A few people reported this one on the list. Filipe found (and provided a test for) a difficult bug in reading compressed extents, and Josef fixed up some quota record keeping with snapshot deletion. Chandan killed off an accounting bug during DIO that lead to WARN_ONs as we freed inodes" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: keep dropped roots in cache until transaction commit Btrfs: Direct I/O: Fix space accounting btrfs: skip waiting on ordered range for special files Btrfs: fix read corruption of compressed and shared extents Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock Btrfs: don't initialize a space info as full to prevent ENOSPC
2015-09-25Merge tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable patches: - fix v4.2 SEEK on files over 2 gigs - Fix a layout segment reference leak when pNFS I/O falls back to inband I/O. - Fix recovery of recalled read delegations Bugfixes: - Fix a case where NFSv4 fails to send CLOSE after a server reboot - Fix sunrpc to wait for connections to complete before retrying - Fix sunrpc races between transport connect/disconnect and shutdown - Fix an infinite loop when layoutget fail with BAD_STATEID - nfs/filelayout: Fix NULL reference caused by double freeing of fh_array - Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set - Fix layoutreturn/close ordering issues" * tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS41: make close wait for layoutreturn NFS: Skip checking ds_cinfo.buckets when lseg's commit_through_mds is set NFSv4.x/pnfs: Don't try to recover stateids twice in layoutget NFSv4: Recovery of recalled read delegations is broken NFS: Fix an infinite loop when layoutget fail with BAD_STATEID NFS: Do cleanup before resetting pageio read/write to mds SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose SUNRPC: Lock the transport layer on shutdown nfs/filelayout: Fix NULL reference caused by double freeing of fh_array SUNRPC: Ensure that we wait for connections to complete before retrying SUNRPC: drop null test before destroy functions nfs: fix v4.2 SEEK on files over 2 gigs SUNRPC: Fix races between socket connection and destroy code nfs: fix pg_test page count calculation Failing to send a CLOSE if file is opened WRONLY and server reboots on a 4.x mount
2015-09-24fix encryption error checks on mountSteve French
Signed-off-by: Steve French <steve.french@primarydata.com>
2015-09-24[SMB3] Fix sec=krb5 on smb3 mountsSteve French
Kerberos, which is very important for security, was only enabled for CIFS not SMB2/SMB3 mounts (e.g. vers=3.0) Patch based on the information detailed in http://thread.gmane.org/gmane.linux.kernel.cifs/10081/focus=10307 to enable Kerberized SMB2/SMB3 a) SMB2_negotiate: enable/use decode_negTokenInit in SMB2_negotiate b) SMB2_sess_setup: handle Kerberos sectype and replicate Kerberos SMB1 processing done in sess_auth_kerberos Signed-off-by: Noel Power <noel.power@suse.com> Signed-off-by: Jim McDonough <jmcd@samba.org> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <steve.french@primarydata.com>
2015-09-23NFS41: make close wait for layoutreturnPeng Tao
If we send a layoutreturn asynchronously before close, the close might reach server first and layoutreturn would fail with BADSTATEID because there is nothing keeping the layout stateid alive. Also do not pretend sending layoutreturn if we are not. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-22ocfs2/dlm: fix deadlock when dispatch assert masterJoseph Qi
The order of the following three spinlocks should be: dlm_domain_lock < dlm_ctxt->spinlock < dlm_lock_resource->spinlock But dlm_dispatch_assert_master() is called while holding dlm_ctxt->spinlock and dlm_lock_resource->spinlock, and then it calls dlm_grab() which will take dlm_domain_lock. Once another thread (for example, dlm_query_join_handler) has already taken dlm_domain_lock, and tries to take dlm_ctxt->spinlock deadlock happens. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: "Junxiao Bi" <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-22userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to ↵Andrea Arcangeli
__wake_up_locked_key" This reverts commit 51360155eccb907ff8635bd10fc7de876408c2e0 and adapts fs/userfaultfd.c to use the old version of that function. It didn't look robust to call __wake_up_common with "nr == 1" when we absolutely require wakeall semantics, but we've full control of what we insert in the two waitqueue heads of the blocked userfaults. No exclusive waitqueue risks to be inserted into those two waitqueue heads so we can as well stick to "nr == 1" of the old code and we can rely purely on the fact no waitqueue inserted in one of the two waitqueue heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Thierry Reding <treding@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-22NFS: Skip checking ds_cinfo.buckets when lseg's commit_through_mds is setKinglong Mee
When lseg's commit_through_mds is set, pnfs client always WARN once in nfs_direct_select_verf after checking ds_cinfo.nbuckets. nfs should use the DS verf except commit_through_mds is set for layout segment where nbuckets is zero. [17844.666094] ------------[ cut here ]------------ [17844.667071] WARNING: CPU: 0 PID: 21758 at /root/source/linux-pnfs/fs/nfs/direct.c:174 nfs_direct_select_verf+0x5a/0x70 [nfs]() [17844.668650] Modules linked in: nfs_layout_nfsv41_files(OE) nfsv4(OE) nfs(OE) fscache(E) nfsd(OE) xfs libcrc32c btrfs ppdev coretemp crct10dif_pclmul auth_rpcgss crc32_pclmul crc32c_intel nfs_acl ghash_clmulni_intel lockd vmw_balloon xor vmw_vmci grace raid6_pq shpchp sunrpc parport_pc i2c_piix4 parport vmwgfx drm_kms_helper ttm drm serio_raw mptspi e1000 scsi_transport_spi mptscsih mptbase ata_generic pata_acpi [last unloaded: fscache] [17844.686676] CPU: 0 PID: 21758 Comm: kworker/0:1 Tainted: G W OE 4.3.0-rc1-pnfs+ #245 [17844.687352] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014 [17844.698502] Workqueue: nfsiod rpc_async_release [sunrpc] [17844.699212] 0000000000000009 0000000043e58010 ffff8800454fbc10 ffffffff813680c4 [17844.699990] ffff8800454fbc48 ffffffff8108b49d ffff88004eb20000 ffff88004eb20000 [17844.700844] ffff880062e26000 0000000000000000 0000000000000001 ffff8800454fbc58 [17844.701637] Call Trace: [17844.725252] [<ffffffff813680c4>] dump_stack+0x19/0x25 [17844.732693] [<ffffffff8108b49d>] warn_slowpath_common+0x7d/0xb0 [17844.733855] [<ffffffff8108b5da>] warn_slowpath_null+0x1a/0x20 [17844.735015] [<ffffffffa04a27ca>] nfs_direct_select_verf+0x5a/0x70 [nfs] [17844.735999] [<ffffffffa04a2b83>] nfs_direct_set_hdr_verf+0x23/0x90 [nfs] [17844.736846] [<ffffffffa04a2e17>] nfs_direct_write_completion+0x227/0x260 [nfs] [17844.737782] [<ffffffffa04a433c>] nfs_pgio_release+0x1c/0x20 [nfs] [17844.738597] [<ffffffffa0502df3>] pnfs_generic_rw_release+0x23/0x30 [nfsv4] [17844.739486] [<ffffffffa01cbbea>] rpc_free_task+0x2a/0x70 [sunrpc] [17844.740326] [<ffffffffa01cbcd5>] rpc_async_release+0x15/0x20 [sunrpc] [17844.741173] [<ffffffff810a387c>] process_one_work+0x21c/0x4c0 [17844.741984] [<ffffffff810a37cd>] ? process_one_work+0x16d/0x4c0 [17844.742837] [<ffffffff810a3b6a>] worker_thread+0x4a/0x440 [17844.743639] [<ffffffff810a3b20>] ? process_one_work+0x4c0/0x4c0 [17844.744399] [<ffffffff810a3b20>] ? process_one_work+0x4c0/0x4c0 [17844.745176] [<ffffffff810a8d75>] kthread+0xf5/0x110 [17844.745927] [<ffffffff810a8c80>] ? kthread_create_on_node+0x240/0x240 [17844.747105] [<ffffffff8172ce1f>] ret_from_fork+0x3f/0x70 [17844.747856] [<ffffffff810a8c80>] ? kthread_create_on_node+0x240/0x240 [17844.748642] ---[ end trace 336a2845d42b83f0 ]--- Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-22cifs: use server timestamp for ntlmv2 authenticationPeter Seiderer
Linux cifs mount with ntlmssp against an Mac OS X (Yosemite 10.10.5) share fails in case the clocks differ more than +/-2h: digest-service: digest-request: od failed with 2 proto=ntlmv2 digest-service: digest-request: kdc failed with -1561745592 proto=ntlmv2 Fix this by (re-)using the given server timestamp for the ntlmv2 authentication (as Windows 7 does). A related problem was also reported earlier by Namjae Jaen (see below): Windows machine has extended security feature which refuse to allow authentication when there is time difference between server time and client time when ntlmv2 negotiation is used. This problem is prevalent in embedded enviornment where system time is set to default 1970. Modern servers send the server timestamp in the TargetInfo Av_Pair structure in the challenge message [see MS-NLMP 2.2.2.1] In [MS-NLMP 3.1.5.1.2] it is explicitly mentioned that the client must use the server provided timestamp if present OR current time if it is not Reported-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Peter Seiderer <ps.report@gmx.net> Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org>
2015-09-22disabling oplocks/leases via module parm enable_oplocks broken for SMB3Steve French
leases (oplocks) were always requested for SMB2/SMB3 even when oplocks disabled in the cifs.ko module. Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Chandrika Srinivasan <chandrika.srinivasan@citrix.com> CC: Stable <stable@vger.kernel.org>
2015-09-22Btrfs: keep dropped roots in cache until transaction commitJosef Bacik
When dropping a snapshot we need to account for the qgroup changes. If we drop the snapshot in all one go then the backref code will fail to find blocks from the snapshot we dropped since it won't be able to find the root in the fs root cache. This can lead to us failing to find refs from other roots that pointed at blocks in the now deleted root. To handle this we need to not remove the fs roots from the cache until after we process the qgroup operations. Do this by adding dropped roots to a list on the transaction, and letting the transaction remove the roots at the same time it drops the commit roots. This will keep all of the backref searching code in sync properly, and fixes a problem Mark was seeing with snapshot delete and qgroups. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-09-21Btrfs: Direct I/O: Fix space accountingchandan
The following call trace is seen when generic/095 test is executed, WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0() Modules linked in: CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014 ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0 0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38 ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0 Call Trace: [<ffffffff81984058>] dump_stack+0x45/0x57 [<ffffffff81050385>] warn_slowpath_common+0x85/0xc0 [<ffffffff81050465>] warn_slowpath_null+0x15/0x20 [<ffffffff81340294>] btrfs_destroy_inode+0x284/0x2a0 [<ffffffff8117ce07>] destroy_inode+0x37/0x60 [<ffffffff8117cf39>] evict+0x109/0x170 [<ffffffff8117cfd5>] dispose_list+0x35/0x50 [<ffffffff8117dd3a>] evict_inodes+0xaa/0x100 [<ffffffff81165667>] generic_shutdown_super+0x47/0xf0 [<ffffffff81165951>] kill_anon_super+0x11/0x20 [<ffffffff81302093>] btrfs_kill_super+0x13/0x110 [<ffffffff81165c99>] deactivate_locked_super+0x39/0x70 [<ffffffff811660cf>] deactivate_super+0x5f/0x70 [<ffffffff81180e1e>] cleanup_mnt+0x3e/0x90 [<ffffffff81180ebd>] __cleanup_mnt+0xd/0x10 [<ffffffff81069c06>] task_work_run+0x96/0xb0 [<ffffffff81003a3d>] do_notify_resume+0x3d/0x50 [<ffffffff8198cbc2>] int_signal+0x12/0x17 This means that the inode had non-zero "outstanding extents" during eviction. This occurs because, during direct I/O a task which successfully used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does not clear the bit after finishing the DIO write. A future DIO write could actually fail and the unused reserve space won't be freed because of the previously set BTRFS_INODE_DIO_READY bit. Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the following issue, |-----------------------------------+-------------------------------------| | Task A | Task B | |-----------------------------------+-------------------------------------| | Start direct i/o write on inode X.| | | reserve space | | | Allocate ordered extent | | | release reserved space | | | Set BTRFS_INODE_DIO_READY bit. | | | | splice() | | | Transfer data from pipe buffer to | | | destination file. | | | - kmap(pipe buffer page) | | | - Start direct i/o write on | | | inode X. | | | - reserve space | | | - dio_refill_pages() | | | - sdio->blocks_available == 0 | | | - Since a kernel address is | | | being passed instead of a | | | user space address, | | | iov_iter_get_pages() returns | | | -EFAULT. | | | - Since BTRFS_INODE_DIO_READY is | | | set, we don't release reserved | | | space. | | | - Clear BTRFS_INODE_DIO_READY bit.| | -EIOCBQUEUED is returned. | | |-----------------------------------+-------------------------------------| Hence this commit introduces "struct btrfs_dio_data" to track the usage of reserved data space. The remaining unused "reserve space" can now be freed reliably. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>