summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)Author
2015-01-06Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "Revert a potential seek_data/hole regression which shows up when using ext4 to handle ext3 file systems, plus two minor bug fixes" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: remove spurious KERN_INFO from ext4_warning call Revert "ext4: fix suboptimal seek_{data,hole} extents traversial" ext4: prevent online resize with backup superblock
2015-01-02ext4: remove spurious KERN_INFO from ext4_warning callJakub Wilk
Signed-off-by: Jakub Wilk <jwilk@jwilk.net> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-02Revert "ext4: fix suboptimal seek_{data,hole} extents traversial"Theodore Ts'o
This reverts commit 14516bb7bb6ffbd49f35389f9ece3b2045ba5815. This was causing regression test failures with generic/285 with an ext3 filesystem using CONFIG_EXT4_USE_FOR_EXT23. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-27ext4: prevent online resize with backup superblockTheodore Ts'o
Prevent BUG or corrupted file systems after the following: mkfs.ext4 /dev/vdc 100M mount -t ext4 -o sb=40961 /dev/vdc /vdc resize2fs /dev/vdc We previously prevented online resizing using the old resize ioctl. Move the code to ext4_resize_begin(), so the check applies for all of the resize ioctl's. Reported-by: Maxim Malkov <malkov@ispras.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-17move_extent_per_page(): get rid of unused w_flagsAl Viro
... and comparing get_fs() with KERNEL_DS used only to initialize that Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-12Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Lots of bugs fixes, including Zheng and Jan's extent status shrinker fixes, which should improve CPU utilization and potential soft lockups under heavy memory pressure, and Eric Whitney's bigalloc fixes" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits) ext4: ext4_da_convert_inline_data_to_extent drop locked page after error ext4: fix suboptimal seek_{data,hole} extents traversial ext4: ext4_inline_data_fiemap should respect callers argument ext4: prevent fsreentrance deadlock for inline_data ext4: forbid journal_async_commit in data=ordered mode jbd2: remove unnecessary NULL check before iput() ext4: Remove an unnecessary check for NULL before iput() ext4: remove unneeded code in ext4_unlink ext4: don't count external journal blocks as overhead ext4: remove never taken branch from ext4_ext_shift_path_extents() ext4: create nojournal_checksum mount option ext4: update comments regarding ext4_delete_inode() ext4: cleanup GFP flags inside resize path ext4: introduce aging to extent status tree ext4: cleanup flag definitions for extent status tree ext4: limit number of scanned extents in status tree shrinker ext4: move handling of list of shrinkable inodes into extent status code ext4: change LRU to round-robin in extent status tree shrinker ext4: cache extent hole in extent status tree for ext4_da_map_blocks() ext4: fix block reservation for bigalloc filesystems ...
2014-12-06ext4: ext4_da_convert_inline_data_to_extent drop locked page after errorDmitry Monakhov
Testcase: xfstests generic/270 MKFS_OPTIONS="-q -I 256 -O inline_data,64bit" Call Trace: [<ffffffff81144c76>] lock_page+0x35/0x39 -------> DEADLOCK [<ffffffff81145260>] pagecache_get_page+0x65/0x15a [<ffffffff811507fc>] truncate_inode_pages_range+0x1db/0x45c [<ffffffff8120ea63>] ? ext4_da_get_block_prep+0x439/0x4b6 [<ffffffff811b29b7>] ? __block_write_begin+0x284/0x29c [<ffffffff8120e62a>] ? ext4_change_inode_journal_flag+0x16b/0x16b [<ffffffff81150af0>] truncate_inode_pages+0x12/0x14 [<ffffffff81247cb4>] ext4_truncate_failed_write+0x19/0x25 [<ffffffff812488cf>] ext4_da_write_inline_data_begin+0x196/0x31c [<ffffffff81210dad>] ext4_da_write_begin+0x189/0x302 [<ffffffff810c07ac>] ? trace_hardirqs_on+0xd/0xf [<ffffffff810ddd13>] ? read_seqcount_begin.clone.1+0x9f/0xcc [<ffffffff8114309d>] generic_perform_write+0xc7/0x1c6 [<ffffffff810c040e>] ? mark_held_locks+0x59/0x77 [<ffffffff811445d1>] __generic_file_write_iter+0x17f/0x1c5 [<ffffffff8120726b>] ext4_file_write_iter+0x2a5/0x354 [<ffffffff81185656>] ? file_start_write+0x2a/0x2c [<ffffffff8107bcdb>] ? bad_area_nosemaphore+0x13/0x15 [<ffffffff811858ce>] new_sync_write+0x8a/0xb2 [<ffffffff81186e7b>] vfs_write+0xb5/0x14d [<ffffffff81186ffb>] SyS_write+0x5c/0x8c [<ffffffff816f2529>] system_call_fastpath+0x12/0x17 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02ext4: fix suboptimal seek_{data,hole} extents traversialDmitry Monakhov
It is ridiculous practice to scan inode block by block, this technique applicable only for old indirect files. This takes significant amount of time for really large files. Let's reuse ext4_fiemap which already traverse inode-tree in most optimal meaner. TESTCASE: ftruncate64(fd, 0); ftruncate64(fd, 1ULL << 40); /* lseek will spin very long time */ lseek64(fd, 0, SEEK_DATA); lseek64(fd, 0, SEEK_HOLE); Original report: https://lkml.org/lkml/2014/10/16/620 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02ext4: ext4_inline_data_fiemap should respect callers argumentDmitry Monakhov
Currently ext4_inline_data_fiemap ignores requested arguments (start and len) which may lead endless loop if start != 0. Also fix incorrect extent length determination. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02ext4: prevent fsreentrance deadlock for inline_dataDmitry Monakhov
ext4_da_convert_inline_data_to_extent() invokes grab_cache_page_write_begin(). grab_cache_page_write_begin performs memory allocation, so fs-reentrance should be prohibited because we are inside journal transaction. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-26ext4: forbid journal_async_commit in data=ordered modeJan Kara
Option journal_async_commit breaks gurantees of data=ordered mode as it sends only a single cache flush after writing a transaction commit block. Thus even though the transaction including the commit block is fully stored on persistent storage, file data may still linger in drives caches and will be lost on power failure. Since all checksums match on journal recovery, we replay the transaction thus possibly exposing stale user data. To fix this data exposure issue, remove the possibility to use journal_async_commit in data=ordered mode. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-26ext4: Remove an unnecessary check for NULL before iput()Markus Elfring
The iput() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25ext4: remove unneeded code in ext4_unlinkNamjae Jeon
Setting retval to zero is not needed in ext4_unlink. Remove unneeded code. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25ext4: don't count external journal blocks as overheadEric Sandeen
This was fixed for ext3 with: e6d8fb3 ext3: Count internal journal as bsddf overhead in ext3_statfs but was never fixed for ext4. With a large external journal and no used disk blocks, df comes out negative without this, as journal blocks are added to the overhead & subtracted from used blocks unconditionally. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25ext4: remove never taken branch from ext4_ext_shift_path_extents()Jan Kara
path[depth].p_hdr can never be NULL for a path passed to us (and even if it could, EXT_LAST_EXTENT() would make something != NULL from it). So just remove the branch. Coverity-id: 1196498 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25ext4: create nojournal_checksum mount optionDarrick J. Wong
Create a mount option to disable journal checksumming (because the metadata_csum feature turns it on by default now), and fix remount not to allow changing the journal checksumming option, since changing the mount options has no effect on the journal. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25ext4: update comments regarding ext4_delete_inode()Wang Shilong
ext4_delete_inode() has been renamed for a long time, update comments for this. Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25ext4: cleanup GFP flags inside resize pathDmitry Monakhov
We must use GFP_NOFS instead GFP_KERNEL inside ext4_mb_add_groupinfo and ext4_calculate_overhead() because they are called from inside a journal transaction. Call trace: ioctl ->ext4_group_add ->journal_start ->ext4_setup_new_descs ->ext4_mb_add_groupinfo -> GFP_KERNEL ->ext4_flex_group_add ->ext4_update_super ->ext4_calculate_overhead -> GFP_KERNEL ->journal_stop Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25ext4: introduce aging to extent status treeJan Kara
Introduce a simple aging to extent status tree. Each extent has a REFERENCED bit which gets set when the extent is used. Shrinker then skips entries with referenced bit set and clears the bit. Thus frequently used extents have higher chances of staying in memory. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25ext4: cleanup flag definitions for extent status treeJan Kara
Currently flags for extent status tree are defined twice, once shifted and once without a being shifted. Consolidate these definitions into one place and make some computations automatic to make adding flags less error prone. Compiler should be clever enough to figure out these are constants and generate the same code. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25ext4: limit number of scanned extents in status tree shrinkerJan Kara
Currently we scan extent status trees of inodes until we reclaim nr_to_scan extents. This can however require a lot of scanning when there are lots of delayed extents (as those cannot be reclaimed). Change shrinker to work as shrinkers are supposed to and *scan* only nr_to_scan extents regardless of how many extents did we actually reclaim. We however need to be careful and avoid scanning each status tree from the beginning - that could lead to a situation where we would not be able to reclaim anything at all when first nr_to_scan extents in the tree are always unreclaimable. We remember with each inode offset where we stopped scanning and continue from there when we next come across the inode. Note that we also need to update places calling __es_shrink() manually to pass reasonable nr_to_scan to have a chance of reclaiming anything and not just 1. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25ext4: move handling of list of shrinkable inodes into extent status codeJan Kara
Currently callers adding extents to extent status tree were responsible for adding the inode to the list of inodes with freeable extents. This is error prone and puts list handling in unnecessarily many places. Just add inode to the list automatically when the first non-delay extent is added to the tree and remove inode from the list when the last non-delay extent is removed. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25ext4: change LRU to round-robin in extent status tree shrinkerZheng Liu
In this commit we discard the lru algorithm for inodes with extent status tree because it takes significant effort to maintain a lru list in extent status tree shrinker and the shrinker can take a long time to scan this lru list in order to reclaim some objects. We replace the lru ordering with a simple round-robin. After that we never need to keep a lru list. That means that the list needn't be sorted if the shrinker can not reclaim any objects in the first round. Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25ext4: cache extent hole in extent status tree for ext4_da_map_blocks()Zheng Liu
Currently extent status tree doesn't cache extent hole when a write looks up in extent tree to make sure whether a block has been allocated or not. In this case, we don't put extent hole in extent cache because later this extent might be removed and a new delayed extent might be added back. But it will cause a defect when we do a lot of writes. If we don't put extent hole in extent cache, the following writes also need to access extent tree to look at whether or not a block has been allocated. It brings a cache miss. This commit fixes this defect. Also if the inode doesn't have any extent, this extent hole will be cached as well. Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25ext4: fix block reservation for bigalloc filesystemsJan Kara
For bigalloc filesystems we have to check whether newly requested inode block isn't already part of a cluster for which we already have delayed allocation reservation. This check happens in ext4_ext_map_blocks() and that function sets EXT4_MAP_FROM_CLUSTER if that's the case. However if ext4_da_map_blocks() finds in extent cache information about the block, we don't call into ext4_ext_map_blocks() and thus we always end up getting new reservation even if the space for cluster is already reserved. This results in overreservation and premature ENOSPC reports. Fix the problem by checking for existing cluster reservation already in ext4_da_map_blocks(). That simplifies the logic and actually allows us to get rid of the EXT4_MAP_FROM_CLUSTER flag completely. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-23ext4: fix end of region partial cluster handlingEric Whitney
ext4_ext_remove_space() can incorrectly free a partial_cluster if EAGAIN is encountered while truncating or punching. Extent removal should be retried in this case. It also fails to free a partial cluster when the punched region begins at the start of a file on that unaligned cluster and where the entire file has not been punched. Remove the requirement that all blocks in the file must have been freed in order to free the partial cluster. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-23ext4: miscellaneous partial cluster cleanupsEric Whitney
Add some casts and rearrange a few statements for improved readability. Some code can also be simplified and made more readable if we set partial_cluster to 0 rather than to a negative value when we can tell we've hit the left edge of the punched region. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-23ext4: fix end of leaf partial cluster handlingEric Whitney
The fix in commit ad6599ab3ac9 ("ext4: fix premature freeing of partial clusters split across leaf blocks"), intended to avoid dereferencing an invalid extent pointer when determining whether a partial cluster should be freed, wasn't quite good enough. Assure that at least one extent remains at the start of the leaf once the hole has been punched. Otherwise, the pointer to the extent to the right of the hole will be invalid and a partial cluster will be incorrectly freed. Set partial_cluster to 0 when we can tell we've hit the left edge of the punched region within the leaf. This prevents incorrect freeing of a partial cluster when ext4_ext_rm_leaf is called one last time during extent tree traversal after the punched region has been removed. Adjust comments to reflect code changes and a correction. Remove a bit of dead code. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-23ext4: fix partial cluster initializationEric Whitney
The partial_cluster variable is not always initialized correctly when hole punching on bigalloc file systems. Although commit c06344939422 ("ext4: fix partial cluster handling for bigalloc file systems") addressed the case where the right edge of the punched region and the next extent to its right were within the same leaf, it didn't handle the case where the next extent to its right is in the next leaf. This causes xfstest generic/300 to fail. Fix this by replacing the code in c0634493922 with a more general solution that can continue the search for the first cluster to the right of the punched region into the next leaf if present. If found, partial_cluster is initialized to this cluster's negative value. There's no need to determine if that cluster is actually shared; we simply record it so its blocks won't be freed in the event it does happen to be shared. Also, minimize the burden on non-bigalloc file systems with some minor code simplification. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-20ext4: kill ext4_kvfree()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-10ext4: Convert to private i_dquot fieldJan Kara
CC: linux-ext4@vger.kernel.org Acked-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-05ext4: move_extent improve bh vanishing success factorDmitry Monakhov
Xiaoguang Wang has reported sporadic EBUSY failures of ext4/302 Unfortunetly there is nothing we can do if some other task holds BH's refenrence. So we must return EBUSY in this case. But we can try kicking the journal to see if the other task releases the bh reference after the commit is complete. Also decrease false positives by properly checking for ENOSPC and retrying the allocation after kicking the journal --- which is done by ext4_should_retry_alloc(). [ Modified by tytso to properly check for ENOSPC. ] Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30ext4: make ext4_ext_convert_to_initialized() return proper number of blocksJan Kara
ext4_ext_convert_to_initialized() can return more blocks than are actually allocated from map->m_lblk in case where initial part of the on-disk extent is zeroed out. Luckily this doesn't have serious consequences because the caller currently uses the return value only to unmap metadata buffers. Anyway this is a data corruption/exposure problem waiting to happen so fix it. Coverity-id: 1226848 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30ext4: bail early when clearing inode journal flag failsJan Kara
When clearing inode journal flag, we call jbd2_journal_flush() to force all the journalled data to their final locations. Currently we ignore when this fails and continue clearing inode journal flag. This isn't a big problem because when jbd2_journal_flush() fails, journal is likely aborted anyway. But it can still lead to somewhat confusing results so rather bail out early. Coverity-id: 989044 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30ext4: bail out from make_indexed_dir() on first errorJan Kara
When ext4_handle_dirty_dx_node() or ext4_handle_dirty_dirent_node() fail, there's really something wrong with the fs and there's no point in continuing further. Just return error from make_indexed_dir() in that case. Also initialize frames array so that if we return early due to error, dx_release() doesn't try to dereference uninitialized memory (which could happen also due to error in do_split()). Coverity-id: 741300 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-10-30ext4: prevent bugon on race between write/fcntlDmitry Monakhov
O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked twice inside ext4_file_write_iter() and __generic_file_write() which result in BUG_ON inside ext4_direct_IO. Let's initialize iocb->private unconditionally. TESTCASE: xfstest:generic/036 https://patchwork.ozlabs.org/patch/402445/ #TYPICAL STACK TRACE: kernel BUG at fs/ext4/inode.c:2960! invalid opcode: 0000 [#1] SMP Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000 RIP: 0010:[<ffffffff811fabf2>] [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP: 0018:ffff88080f90bb58 EFLAGS: 00010246 RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818 RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001 RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80 R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28 FS: 00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0 Stack: ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30 0000000000000200 0000000000000200 0000000000000001 0000000000000200 ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200 Call Trace: [<ffffffff8112ca9d>] generic_file_direct_write+0xed/0x180 [<ffffffff8112f2b2>] __generic_file_write_iter+0x222/0x370 [<ffffffff811f495b>] ext4_file_write_iter+0x34b/0x400 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810abd94>] ? __lock_acquire+0x274/0x700 [<ffffffff811f4610>] ? ext4_unwritten_wait+0xb0/0xb0 [<ffffffff811bd756>] aio_run_iocb+0x286/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190 [<ffffffff811bc05b>] ? lookup_ioctx+0x4b/0xf0 [<ffffffff811bde3b>] do_io_submit+0x55b/0x740 [<ffffffff811bdcaa>] ? do_io_submit+0x3ca/0x740 [<ffffffff811be030>] SyS_io_submit+0x10/0x20 [<ffffffff815ce192>] system_call_fastpath+0x16/0x1b Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c 24 18 00 75 04 <0f> 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89 RIP [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP <ffff88080f90bb58> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Cc: stable@vger.kernel.org
2014-10-30ext4: remove extent status procfs files if journal load failsDarrick J. Wong
If we can't load the journal, remove the procfs files for the extent status information file to avoid leaking resources. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-10-30ext4: disallow changing journal_csum option during remountDarrick J. Wong
ext4 does not permit changing the metadata or journal checksum feature flag while mounted. Until we decide to support that, don't allow a remount to change the journal_csum flag (right now we silently fail to change anything). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30ext4: enable journal checksum when metadata checksum feature enabledDarrick J. Wong
If metadata checksumming is turned on for the FS, we need to tell the journal to use checksumming too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-10-30ext4: fix oops when loading block bitmap failedJan Kara
When we fail to load block bitmap in __ext4_new_inode() we will dereference NULL pointer in ext4_journal_get_write_access(). So check for error from ext4_read_block_bitmap(). Coverity-id: 989065 Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30ext4: fix overflow when updating superblock backups after resizeJan Kara
When there are no meta block groups update_backups() will compute the backup block in 32-bit arithmetics thus possibly overflowing the block number and corrupting the filesystem. OTOH filesystems without meta block groups larger than 16 TB should be rare. Fix the problem by doing the counting in 64-bit arithmetics. Coverity-id: 741252 CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-10-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "overlayfs merge + leak fix for d_splice_alias() failure exits" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: overlayfs: embed middle into overlay_readdir_data overlayfs: embed root into overlay_readdir_data overlayfs: make ovl_cache_entry->name an array instead of pointer overlayfs: don't hold ->i_mutex over opening the real directory fix inode leaks on d_splice_alias() failure exits fs: limit filesystem stacking depth overlay: overlay filesystem documentation overlayfs: implement show_options overlayfs: add statfs support overlay filesystem shmem: support RENAME_WHITEOUT ext4: support RENAME_WHITEOUT vfs: add RENAME_WHITEOUT vfs: add whiteout support vfs: export check_sticky() vfs: introduce clone_private_mount() vfs: export __inode_permission() to modules vfs: export do_splice_direct() to modules vfs: add i_op->dentry_open()
2014-10-23ext4: support RENAME_WHITEOUTMiklos Szeredi
Add whiteout support to ext4_rename(). A whiteout inode (chrdev/0,0) is created before the rename takes place. The whiteout inode is added to the old entry instead of deleting it. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-20Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A large number of cleanups and bug fixes, with some (minor) journal optimizations" [ This got sent to me before -rc1, but was stuck in my spam folder. - Linus ] * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (67 commits) ext4: check s_chksum_driver when looking for bg csum presence ext4: move error report out of atomic context in ext4_init_block_bitmap() ext4: Replace open coded mdata csum feature to helper function ext4: delete useless comments about ext4_move_extents ext4: fix reservation overflow in ext4_da_write_begin ext4: add ext4_iget_normal() which is to be used for dir tree lookups ext4: don't orphan or truncate the boot loader inode ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT ext4: optimize block allocation on grow indepth ext4: get rid of code duplication ext4: fix over-defensive complaint after journal abort ext4: fix return value of ext4_do_update_inode ext4: fix mmap data corruption when blocksize < pagesize vfs: fix data corruption when blocksize < pagesize for mmaped data ext4: fold ext4_nojournal_sops into ext4_sops ext4: support freezing ext2 (nojournal) file systems ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs() ext4: don't check quota format when there are no quota files jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_list jbd2: avoid pointless scanning of checkpoint lists ...
2014-10-15Merge branch 'for-3.18-consistent-ops' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu consistent-ops changes from Tejun Heo: "Way back, before the current percpu allocator was implemented, static and dynamic percpu memory areas were allocated and handled separately and had their own accessors. The distinction has been gone for many years now; however, the now duplicate two sets of accessors remained with the pointer based ones - this_cpu_*() - evolving various other operations over time. During the process, we also accumulated other inconsistent operations. This pull request contains Christoph's patches to clean up the duplicate accessor situation. __get_cpu_var() uses are replaced with with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr(). Unfortunately, the former sometimes is tricky thanks to C being a bit messy with the distinction between lvalues and pointers, which led to a rather ugly solution for cpumask_var_t involving the introduction of this_cpu_cpumask_var_ptr(). This converts most of the uses but not all. Christoph will follow up with the remaining conversions in this merge window and hopefully remove the obsolete accessors" * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits) irqchip: Properly fetch the per cpu offset percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write. percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t Revert "powerpc: Replace __get_cpu_var uses" percpu: Remove __this_cpu_ptr clocksource: Replace __this_cpu_ptr with raw_cpu_ptr sparc: Replace __get_cpu_var uses avr32: Replace __get_cpu_var with __this_cpu_write blackfin: Replace __get_cpu_var uses tile: Use this_cpu_ptr() for hardware counters tile: Replace __get_cpu_var uses powerpc: Replace __get_cpu_var uses alpha: Replace __get_cpu_var ia64: Replace __get_cpu_var uses s390: cio driver &__get_cpu_var replacements s390: Replace __get_cpu_var uses mips: Replace __get_cpu_var uses MIPS: Replace __get_cpu_var uses in FPU emulator. arm: Replace __this_cpu_ptr with raw_cpu_ptr ...
2014-10-14ext4: check s_chksum_driver when looking for bg csum presenceDarrick J. Wong
Convert the ext4_has_group_desc_csum predicate to look for a checksum driver instead of the metadata_csum flag and change the bg checksum calculation function to look for GDT_CSUM before taking the crc16 path. Without this patch, if we mount with ^uninit_bg,^metadata_csum and later metadata_csum gets turned on by accident, the block group checksum functions will incorrectly assume that checksumming is enabled (metadata_csum) but that crc16 should be used (!s_chksum_driver). This is totally wrong, so fix the predicate and the checksum formula selection. (Granted, if the metadata_csum feature bit gets enabled on a live FS then something underhanded is going on, but we could at least avoid writing garbage into the on-disk fields.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Cc: stable@vger.kernel.org
2014-10-13ext4: move error report out of atomic context in ext4_init_block_bitmap()Dmitry Monakhov
Error report likely result in IO so it is bad idea to do it from atomic context. This patch should fix following issue: BUG: sleeping function called from invalid context at include/linux/buffer_head.h:349 in_atomic(): 1, irqs_disabled(): 0, pid: 137, name: kworker/u128:1 5 locks held by kworker/u128:1/137: #0: ("writeback"){......}, at: [<ffffffff81085618>] process_one_work+0x228/0x4d0 #1: ((&(&wb->dwork)->work)){......}, at: [<ffffffff81085618>] process_one_work+0x228/0x4d0 #2: (jbd2_handle){......}, at: [<ffffffff81242622>] start_this_handle+0x712/0x7b0 #3: (&ei->i_data_sem){......}, at: [<ffffffff811fa387>] ext4_map_blocks+0x297/0x430 #4: (&(&bgl->locks[i].lock)->rlock){......}, at: [<ffffffff811f3180>] ext4_read_block_bitmap_nowait+0x5d0/0x630 CPU: 3 PID: 137 Comm: kworker/u128:1 Not tainted 3.17.0-rc2-00184-g82752e4 #165 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 Workqueue: writeback bdi_writeback_workfn (flush-1:0) 0000000000000411 ffff880813777288 ffffffff815c7fdc ffff880813777288 ffff880813a8bba0 ffff8808137772a8 ffffffff8108fb30 ffff880803e01e38 ffff880803e01e38 ffff8808137772c8 ffffffff811a8d53 ffff88080ecc6000 Call Trace: [<ffffffff815c7fdc>] dump_stack+0x51/0x6d [<ffffffff8108fb30>] __might_sleep+0xf0/0x100 [<ffffffff811a8d53>] __sync_dirty_buffer+0x43/0xe0 [<ffffffff811a8e03>] sync_dirty_buffer+0x13/0x20 [<ffffffff8120f581>] ext4_commit_super+0x1d1/0x230 [<ffffffff8120fa03>] save_error_info+0x23/0x30 [<ffffffff8120fd06>] __ext4_error+0xb6/0xd0 [<ffffffff8120f260>] ? ext4_group_desc_csum+0x140/0x190 [<ffffffff811f2d8c>] ext4_read_block_bitmap_nowait+0x1dc/0x630 [<ffffffff8122e23a>] ext4_mb_init_cache+0x21a/0x8f0 [<ffffffff8113ae95>] ? lru_cache_add+0x55/0x60 [<ffffffff8112e16c>] ? add_to_page_cache_lru+0x6c/0x80 [<ffffffff8122eaa0>] ext4_mb_init_group+0x190/0x280 [<ffffffff8122ec51>] ext4_mb_good_group+0xc1/0x190 [<ffffffff8123309a>] ext4_mb_regular_allocator+0x17a/0x410 [<ffffffff8122c821>] ? ext4_mb_use_preallocated+0x31/0x380 [<ffffffff81233535>] ? ext4_mb_new_blocks+0x205/0x8e0 [<ffffffff8116ed5c>] ? kmem_cache_alloc+0xfc/0x180 [<ffffffff812335b0>] ext4_mb_new_blocks+0x280/0x8e0 [<ffffffff8116f2c4>] ? __kmalloc+0x144/0x1c0 [<ffffffff81221797>] ? ext4_find_extent+0x97/0x320 [<ffffffff812257f4>] ext4_ext_map_blocks+0xbc4/0x1050 [<ffffffff811fa387>] ? ext4_map_blocks+0x297/0x430 [<ffffffff811fa3ab>] ext4_map_blocks+0x2bb/0x430 [<ffffffff81200e43>] ? ext4_init_io_end+0x23/0x50 [<ffffffff811feb44>] ext4_writepages+0x564/0xaf0 [<ffffffff815cde3b>] ? _raw_spin_unlock+0x2b/0x40 [<ffffffff810ac7bd>] ? lock_release_non_nested+0x2fd/0x3c0 [<ffffffff811a009e>] ? writeback_sb_inodes+0x10e/0x490 [<ffffffff811a009e>] ? writeback_sb_inodes+0x10e/0x490 [<ffffffff811377e3>] do_writepages+0x23/0x40 [<ffffffff8119c8ce>] __writeback_single_inode+0x9e/0x280 [<ffffffff811a026b>] writeback_sb_inodes+0x2db/0x490 [<ffffffff811a0664>] wb_writeback+0x174/0x2d0 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190 [<ffffffff811a0863>] wb_do_writeback+0xa3/0x200 [<ffffffff811a0a40>] bdi_writeback_workfn+0x80/0x230 [<ffffffff81085618>] ? process_one_work+0x228/0x4d0 [<ffffffff810856cd>] process_one_work+0x2dd/0x4d0 [<ffffffff81085618>] ? process_one_work+0x228/0x4d0 [<ffffffff81085c1d>] worker_thread+0x35d/0x460 [<ffffffff810858c0>] ? process_one_work+0x4d0/0x4d0 [<ffffffff810858c0>] ? process_one_work+0x4d0/0x4d0 [<ffffffff8108a885>] kthread+0xf5/0x100 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff8108a790>] ? __init_kthread_worker+0x70/0x70 [<ffffffff815ce2ac>] ret_from_fork+0x7c/0xb0 [<ffffffff8108a790>] ? __init_kthread_work Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-10-13ext4: Replace open coded mdata csum feature to helper functionDmitry Monakhov
Besides the fact that this replacement improves code readability it also protects from errors caused direct EXT4_S(sb)->s_es manipulation which may result attempt to use uninitialized csum machinery. #Testcase_BEGIN IMG=/dev/ram0 MNT=/mnt mkfs.ext4 $IMG mount $IMG $MNT #Enable feature directly on disk, on mounted fs tune2fs -O metadata_csum $IMG # Provoke metadata update, likey result in OOPS touch $MNT/test umount $MNT #Testcase_END # Replacement script @@ expression E; @@ - EXT4_HAS_RO_COMPAT_FEATURE(E, EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) + ext4_has_metadata_csum(E) https://bugzilla.kernel.org/show_bug.cgi?id=82201 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-10-11ext4: delete useless comments about ext4_move_extentsXiaoguang Wang
In patch 'ext4: refactor ext4_move_extents code base', Dmitry Monakhov has refactored ext4_move_extents' implementation, but forgot to update the corresponding comments, this patch will try to delete some useless comments. Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-11ext4: fix reservation overflow in ext4_da_write_beginEric Sandeen
Delalloc write journal reservations only reserve 1 credit, to update the inode if necessary. However, it may happen once in a filesystem's lifetime that a file will cross the 2G threshold, and require the LARGE_FILE feature to be set in the superblock as well, if it was not set already. This overruns the transaction reservation, and can be demonstrated simply on any ext4 filesystem without the LARGE_FILE feature already set: dd if=/dev/zero of=testfile bs=1 seek=2147483646 count=1 \ conv=notrunc of=testfile sync dd if=/dev/zero of=testfile bs=1 seek=2147483647 count=1 \ conv=notrunc of=testfile leads to: EXT4-fs: ext4_do_update_inode:4296: aborting transaction: error 28 in __ext4_handle_dirty_super EXT4-fs error (device loop0) in ext4_do_update_inode:4301: error 28 EXT4-fs error (device loop0) in ext4_reserve_inode_write:4757: Readonly filesystem EXT4-fs error (device loop0) in ext4_dirty_inode:4876: error 28 EXT4-fs error (device loop0) in ext4_da_write_end:2685: error 28 Adjust the number of credits based on whether the flag is already set, and whether the current write may extend past the LARGE_FILE limit. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@vger.kernel.org