summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2013-03-11ext4: update extent status tree after an extent is zeroed outZheng Liu
When we try to split an extent, this extent could be zeroed out and mark as initialized. But we don't know this in ext4_map_blocks because it only returns a length of allocated extent. Meanwhile we will mark this extent as uninitialized because we only check m_flags. This commit update extent status tree when we try to split an unwritten extent. We don't need to worry about the status of this extent because we always mark it as initialized. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-03-11ext4: fix wrong m_len value after unwritten extent conversionZheng Liu
The ext4_ext_handle_uninitialized_extents() function was assuming the return value of ext4_ext_map_blocks() is equal to map->m_len. This incorrect assumption was harmless until we started use status tree as a extent cache because we need to update status tree according to 'm_len' value. Meanwhile this commit marks EXT4_MAP_MAPPED flag after unwritten extent conversion. It shouldn't cause a bug because we update status tree according to checking EXT4_MAP_UNWRITTEN flag. But it should be fixed. After applied this commit, the following error message from self-testing infrastructure disappears. ... kernel: ES len assertation failed for inode: 230 retval 1 != map->m_len 3 in ext4_map_blocks (allocation) ... Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-03-11ext4: add self-testing infrastructure to do a sanity checkDmitry Monakhov
This commit adds a self-testing infrastructure like extent tree does to do a sanity check for extent status tree. After status tree is as a extent cache, we'd better to make sure that it caches right result. After applied this commit, we will get a lot of messages when we run xfstests as below. ... kernel: ES len assertation failed for inode: 230 retval 1 != map->m_len 3 in ext4_map_blocks (allocation) ... kernel: ES cache assertation failed for inode: 230 es_cached ex [974/2/4781/20] != found ex [974/1/4781/1000] ... kernel: ES insert assertation failed for inode: 635 ex_status [0/45/21388/w] != es_status [44/1/21432/u] ... Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-11ext4: avoid a potential overflow in ext4_es_can_be_merged()Zheng Liu
Check the length of an extent to avoid a potential overflow in ext4_es_can_be_merged(). Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-03-04ext4: invalidate extent status tree during extent migrationDmitry Monakhov
mext_replace_branches() will change inode's extents layout so we have to drop corresponding cache. TESTCASE: 301'th xfstest was not yet accepted to official xfstest's branch and can be found here: https://github.com/dmonakhov/xfstests/commit/7b7efeee30a41109201e2040034e71db9b66ddc0 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-03-04ext4: remove unnecessary wait for extent conversion in ext4_fallocate()Jan Kara
Now that we don't merge uninitialized extents anymore, ext4_fallocate() is free to operate on the inode while there are still some extent conversions pending - it won't disturb them in any way. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-04ext4: add warning to ext4_convert_unwritten_extents_endioDmitry Monakhov
Splitting extents inside endio is a bad thing, but unfortunately it is still possible. In fact we are pretty close to the moment when all related issues will be fixed. Let's warn developer if it still the case. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-03-04ext4: disable merging of uninitialized extentsDmitry Monakhov
Derived from Jan's patch:http://permalink.gmane.org/gmane.comp.file-systems.ext4/36470 Merging of uninitialized extents creates all sorts of interesting race possibilities when writeback / DIO races with fallocate. Thus ext4_convert_unwritten_extents_endio() has to deal with a case where extent to be converted needs to be split out first. That isn't nice for two reasons: 1) It may need allocation of extent tree block so ENOSPC is possible. 2) It complicates end_io handling code So we disable merging of uninitialized extents which allows us to simplify the code. Extents will get merged after they are converted to initialized ones. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-03-04ext4: ext4_split_extent should take care of extent zerooutDmitry Monakhov
When ext4_split_extent_at() ends up doing zeroout & conversion to initialized instead of split & conversion, ext4_split_extent() gets confused and can wrongly mark the extent back as uninitialized resulting in end IO code getting confused from large unwritten extents and may result in data loss. The example of problematic behavior is: lblk len lblk len ext4_split_extent() (ex=[1000,30,uninit], map=[1010,10]) ext4_split_extent_at() (split [1000,30,uninit] at 1020) ext4_ext_insert_extent() -> ENOSPC ext4_ext_zeroout() -> extent [1000,30] is now initialized ext4_split_extent_at() (split [1000,30,init] at 1010, MARK_UNINIT1 | MARK_UNINIT2) -> extent is split and parts marked as uninitialized Fix the problem by rechecking extent type after the first ext4_split_extent_at() returns. None of split_flags can not be applied to initialized extent so this patch also add BUG_ON to prevent similar issues in future. TESTCASE: https://github.com/dmonakhov/xfstests/commit/b8a55eb5ce28c6ff29e620ab090902fcd5833597 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-03-02ext4: enable quotas before orphan cleanupJan Kara
When using quota feature we need to enable quotas before orphan cleanup so that changes happening during it are properly reflected in quota accounting. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02ext4: don't allow quota mount options when quota feature enabledJan Kara
So far we silently ignored when quota mount options were set while quota feature was enabled. But this can create confusion in userspace when mount options are set but silently ignored and also creates opportunities for bugs when we don't properly test all quota types. Actually ext4_mark_dquot_dirty() forgets to test for quota feature so it was dependent on journaled quota options being set. OTOH ext4_orphan_cleanup() tries to enable journaled quota when quota options are specified which is wrong when quota feature is enabled. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02ext4: fix a warning from sparse check for ext4_dir_llseekZheng Liu
ext4_dir_llseek is only used as a callback function, and no one calls it directly. So make it as a static function in order to remove a warning message from sparse check. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02ext4: convert number of blocks to clusters properlyLukas Czerner
We're using macro EXT4_B2C() to convert number of blocks to number of clusters for bigalloc file systems. However, we should be using EXT4_NUM_B2C(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-03-02ext4: fix possible memory leak in ext4_remount()Wei Yongjun
'orig_data' is malloced in ext4_remount() and should be freed before leaving from the error handling cases, otherwise it will cause memory leak. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: stable@vger.kernel.org
2013-03-02jbd2: fix ERR_PTR dereference in jbd2__journal_startDmitry Monakhov
If start_this_handle() failed handle will be initialized to ERR_PTR() and can not be dereferenced. paging request at fffffffffffffff6 IP: [<ffffffff813c073f>] jbd2__journal_start+0x18f/0x290 PGD 200e067 PUD 200f067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod CPU 0 journal commit I/O error Pid: 2694, comm: fio Not tainted 3.8.0-rc3+ #79 /DQ67SW RIP: 0010:[<ffffffff813c073f>] [<ffffffff813c073f>] jbd2__journal_start+0x18f/0x290 RSP: 0018:ffff880233b8ba58 EFLAGS: 00010292 RAX: 00000000ffffffe2 RBX: ffffffffffffffe2 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff82128f48 RBP: ffff880233b8ba98 R08: 0000000000000000 R09: ffff88021440a6e0 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02ext4: use percpu counter for extent cache countTheodore Ts'o
Use a percpu counter rather than atomic types for shrinker accounting. There's no need for ultimate accuracy in the shrinker, so this should come a little more cheaply. The percpu struct is somewhat large, but there was a big gap before the cache-aligned s_es_lru_lock anyway, and it fits nicely in there. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-01ext4: optimize ext4_es_shrink()Theodore Ts'o
When the system is under memory pressure, ext4_es_srhink() will get called very often. So optimize returning the number of items in the file system's extent status cache by keeping a per-filesystem count, instead of calculating it each time by scanning all of the inodes in the extent status cache. Also rename the slab used for the extent status cache to be "ext4_extent_status" so it's obviousl the slab in question is created by ext4. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Zheng Liu <gnehzuil.liu@gmail.com>
2013-02-27ext4: fix extent status tree regression for file systems > 512GBTheodore Ts'o
This fixes a regression introduced by commit f7fec032aa782. The problem was that the extents status flags caused us to mask out block numbers smaller than 2**28 blocks. Since we didn't test with file systems smaller than 512GB, we didn't notice this during the development cycle. A typical failure looks like this: EXT4-fs error (device sdb1): htree_dirblock_to_tree:919: inode #172235804: block 152052301: comm ls: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 ... where 'debugfs -R "stat <172235804>" /dev/sdb1' reports that the inode has block number 688923213. When viewed in hex, block number 152052301 (from the syslog) is 0x910224D, while block number 688923213 is 0x2910224D. Note the missing "0x20000000" in the block number. Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Verified-by: Markus Trippelsdorf <markus@trippelsdorf.de> Reported-by: Dave Jones <davej@redhat.com> Verified-by: Dave Jones <davej@redhat.com> Cc: Zheng Liu <gnehzuil.liu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-22ext4: fix free clusters calculation in bigalloc filesystemLukas Czerner
ext4_has_free_clusters() should tell us whether there is enough free clusters to allocate, however number of free clusters in the file system is converted to blocks using EXT4_C2B() which is not only wrong use of the macro (we should have used EXT4_NUM_B2C) but it's also completely wrong concept since everything else is in cluster units. Moreover when calculating number of root clusters we should be using macro EXT4_NUM_B2C() instead of EXT4_B2C() otherwise the result might be off by one. However r_blocks_count should always be a multiple of the cluster ratio so doing a plain bit shift should be enough here. We avoid using EXT4_B2C() because it's confusing. As a result of the first problem number of free clusters is much bigger than it should have been and ext4_has_free_clusters() would return 1 even if there is really not enough free clusters available. Fix this by removing the EXT4_C2B() conversion of free clusters and using bit shift when calculating number of root clusters. This bug affects number of xfstests tests covering file system ENOSPC situation handling. With this patch most of the ENOSPC problems with bigalloc file system disappear, especially the errors caused by delayed allocation not having enough space when the actual allocation is finally requested. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-02-22ext4: no need to remove extent if len is 0 in ext4_es_remove_extent()Eryu Guan
len is 0 means no extent needs to be removed, so return immediately. Otherwise it could trigger the following BUG_ON() in ext4_es_remove_extent() end = lblk + len - 1; BUG_ON(end < lblk); This could be reproduced by a simple truncate(1) command by an unprivileged user truncate -s $(($((2**32 - 1)) * 4096)) /mnt/ext4/testfile The same is true for __es_insert_extent(). Patched kernel passed xfstests regression test. Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-02-18ext4: fix xattr block allocation/release with bigallocLukas Czerner
Currently when new xattr block is created or released we we would call dquot_free_block() or dquot_alloc_block() respectively, among the else decrementing or incrementing the number of blocks assigned to the inode by one block. This however does not work for bigalloc file system because we always allocate/free the whole cluster so we have to count with that in dquot_free_block() and dquot_alloc_block() as well. Use the clusters-to-blocks conversion EXT4_C2B() when passing number of blocks to the dquot_alloc/free functions to fix the problem. The problem has been revealed by xfstests #117 (and possibly others). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Cc: stable@vger.kernel.org
2013-02-18ext4: reclaim extents from extent status treeZheng Liu
Although extent status is loaded on-demand, we also need to reclaim extent from the tree when we are under a heavy memory pressure because in some cases fragmented extent tree causes status tree costs too much memory. Here we maintain a lru list in super_block. When the extent status of an inode is accessed and changed, this inode will be move to the tail of the list. The inode will be dropped from this list when it is cleared. In the inode, a counter is added to count the number of cached objects in extent status tree. Here only written/unwritten/hole extent is counted because delayed extent doesn't be reclaimed due to fiemap, bigalloc and seek_data/hole need it. The counter will be increased as a new extent is allocated, and it will be decreased as a extent is freed. In this commit we use normal shrinker framework to reclaim memory from the status tree. ext4_es_reclaim_extents_count() traverses the lru list to count the number of reclaimable extents. ext4_es_shrink() tries to reclaim written/unwritten/hole extents from extent status tree. The inode that has been shrunk is moved to the tail of lru list. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan kara <jack@suse.cz>
2013-02-18ext4: adjust some functions for reclaiming extents from extent status treeZheng Liu
This commit changes some interfaces in extent status tree because we need to use inode to count the cached objects in a extent status tree. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan kara <jack@suse.cz>
2013-02-18ext4: remove single extent cacheZheng Liu
Single extent cache could be removed because we have extent status tree as a extent cache, and it would be better. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan kara <jack@suse.cz>
2013-02-18ext4: lookup block mapping in extent status treeZheng Liu
After tracking all extent status, we already have a extent cache in memory. Every time we want to lookup a block mapping, we can first try to lookup it in extent status tree to avoid a potential disk I/O. A new function called ext4_es_lookup_extent is defined to finish this work. When we try to lookup a block mapping, we always call ext4_map_blocks and/or ext4_da_map_blocks. So in these functions we first try to lookup a block mapping in extent status tree. A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks in order not to put a hole into extent status tree because this hole will be converted to delayed extent in the tree immediately. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan kara <jack@suse.cz>
2013-02-18ext4: track all extent status in extent status treeZheng Liu
By recording the phycisal block and status, extent status tree is able to track the status of every extents. When we call _map_blocks functions to lookup an extent or create a new written/unwritten/delayed extent, this extent will be inserted into extent status tree. We don't load all extents from disk in alloc_inode() because it costs too much memory, and if a file is opened and closed frequently it will takes too much time to load all extent information. So currently when we create/lookup an extent, this extent will be inserted into extent status tree. Hence, the extent status tree may not comprehensively contain all of the extents found in the file. Here a condition we need to take care is that an extent might contains unwritten and delayed status simultaneously because an extent is delayed allocated and could be allocated by fallocate. At this time we need to keep delayed status because later we need to update delayed reservation space using it. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan kara <jack@suse.cz>
2013-02-18ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flagZheng Liu
This commit lets ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag because in later commit ext4_map_blocks needs to use this flag to determine the extent status. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-18ext4: rename and improbe ext4_es_find_extent()Zheng Liu
This commit renames ext4_es_find_extent with ext4_es_find_delayed_extent and improve this function. First, we split input and output parameter. Second, this function never return the first block of the next delayed extent after 'es'. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan kara <jack@suse.cz>
2013-02-18ext4: add physical block and status member into extent status treeZheng Liu
This commit adds two members in extent_status structure to let it record physical block and extent status. Here es_pblk is used to record both of them because physical block only has 48 bits. So extent status could be stashed into it so that we can save some memory. Now written, unwritten, delayed and hole are defined as status. Due to new member is added into extent status tree, all interfaces need to be adjusted. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-18ext4: refine extent status treeZheng Liu
This commit refines the extent status tree code. 1) A prefix 'es_' is added to to the extent status tree structure members. 2) Refactored es_remove_extent() so that __es_remove_extent() can be used by es_insert_extent() to remove the old extent entry(-ies) before inserting a new one. 3) Rename extent_status_end() to ext4_es_end() 4) ext4_es_can_be_merged() is define to check whether two extents can be merged or not. 5) Update and clarified comments. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-15ext4: use ERR_PTR() abstraction for ext4_append()Theodore Ts'o
Use ERR_PTR()/IS_ERR() abstraction instead of passing in a separate pointer to an integer for the error code, as a code cleanup. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-15ext4: refactor code to read directory blocks into ext4_read_dirblock()Theodore Ts'o
The code to read in directory blocks and verify their metadata checksums was replicated in ten different places across fs/ext4/namei.c, and the code was buggy in subtle ways in a number of those replicated sites. In some cases, ext4_error() was called with a training newline. In others, in particularly in empty_dir(), it was possible to call ext4_dirent_csum_verify() on an index block, which would trigger false warnings requesting the system adminsitrator to run e2fsck. By refactoring the code, we make the code more readable, as well as shrinking the compiled object file by over 700 bytes and 50 lines of code. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-14ext4: add debugging context for warning in ext4_da_update_reserve_space()Theodore Ts'o
Print some additional debugging context to hopefully help to debug a warning which is getting triggered by xfstests #74. Also remove extraneous newlines from when printk's were converted to ext4_warning() and ext4_msg(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-14ext4: use KERN_WARNING for warning messagesTheodore Ts'o
Some messages printed related to a WARN_ON(1) were printed using KERN_NOTICE. Use KERN_WARNING or ext4_warning() instead so that context related to the WARN_ON() is printed at the same printk warning level (and log files, etc.) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-09jbd2: use module parameters instead of debugfs for jbd_debugTheodore Ts'o
There are multiple reasons to move away from debugfs. First of all, we are only using it for a single parameter, and it is much more complicated to set up (some 30 lines of code compared to 3), and one more thing that might fail while loading the jbd2 module. Secondly, as a module paramter it can be specified as a boot option if jbd2 is built into the kernel, or as a parameter when the module is loaded, and it can also be manipulated dynamically under /sys/module/jbd2/parameters/jbd2_debug. So it is more flexible. Ultimately we want to move away from using jbd_debug() towards tracepoints, but for now this is still a useful simplification of the code base. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-09ext4: use module parameters instead of debugfs for mballoc_debugTheodore Ts'o
There are multiple reasons to move away from debugfs. First of all, we are only using it for a single parameter, and it is much more complicated to set up (some 30 lines of code compared to 3), and one more thing that might fail while loading the ext4 module. Secondly, as a module paramter it can be specified as a boot option if ext4 is built into the kernel, or as a parameter when the module is loaded, and it can also be manipulated dynamically under /sys/module/ext4/parameters/mballoc_debug. So it is more flexible. Ultimately we want to move away from using mb_debug() towards tracepoints, but for now this is still a useful simplification of the code base. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-09ext4: start handle at the last possible moment when creating inodesTheodore Ts'o
In ext4_{create,mknod,mkdir,symlink}(), don't start the journal handle until the inode has been succesfully allocated. In order to do this, we need to start the handle in the ext4_new_inode(). So create a new variant of this function, ext4_new_inode_start_handle(), so the handle can be created at the last possible minute, before we need to modify the inode allocation bitmap block. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-09ext4: fix the number of credits needed for acl ops with inline dataTheodore Ts'o
Operations which modify extended attributes may need extra journal credits if inline data is used, since there is a chance that some extended attributes may need to get pushed to an external attribute block. Changes to reflect this was made in xattr.c, but they were missed in fs/ext4/acl.c. To fix this, abstract the calculation of the number of credits needed for xattr operations to an inline function defined in ext4_jbd2.h, and use it in acl.c and xattr.c. Also move the function declarations used in inline.c from xattr.h (where they are non-obviously hidden, and caused problems since ext4_jbd2.h needs to use the function ext4_has_inline_data), and move them to ext4.h. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Tao Ma <boyu.mt@taobao.com> Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09ext4: fix the number of credits needed for ext4_unlink() and ext4_rmdir()Theodore Ts'o
The ext4_unlink() and ext4_rmdir() don't actually release the blocks associated with the file/directory. This gets done in a separate jbd2 handle called via ext4_evict_inode(). Thus, we don't need to reserve lots of journal credits for the truncate. Note that using too many journal credits is non-optimal because it can leading to the journal transmit getting closed too early, before it is strictly necessary. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09ext4: fix the number of credits needed for ext4_ext_migrate()Theodore Ts'o
The migration ioctl creates a temporary inode. Since this inode is never linked to a directory, we don't need to reserve journal credits required for modifying the directory. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09ext4: start handle at the last possible moment in ext4_rmdir()Theodore Ts'o
Don't start the jbd2 transaction handle until after the directory entry has been found, to minimize the amount of time that a handle is held active. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09ext4: start handle at the last possible moment in ext4_unlink()Theodore Ts'o
Don't start the jbd2 transaction handle until after the directory entry has been found, to minimize the amount of time that a handle is held active. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09ext4: grab page before starting transaction handle in write_begin()Theodore Ts'o
The grab_cache_page_write_begin() function can potentially sleep for a long time, since it may need to do memory allocation which can block if the system is under significant memory pressure, and because it may be blocked on page writeback. If it does take a long time to grab the page, it's better that we not hold an active jbd2 handle. So grab a handle on the page first, and _then_ start the transaction handle. This commit fixes the following long transaction handle hold time: postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32 tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1 dirtied_blocks 0 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09ext4: pass context information to jbd2__journal_start()Theodore Ts'o
So we can better understand what bits of ext4 are responsible for long-running jbd2 handles, use jbd2__journal_start() so we can pass context information for logging purposes. The recommended way for finding the longer-running handles is: T=/sys/kernel/debug/tracing EVENT=$T/events/jbd2/jbd2_handle_stats echo "interval > 5" > $EVENT/filter echo 1 > $EVENT/enable ./run-my-fs-benchmark cat $T/trace > /tmp/problem-handles This will list handles that were active for longer than 20ms. Having longer-running handles is bad, because a commit started at the wrong time could stall for those 20+ milliseconds, which could delay an fsync() or an O_SYNC operation. Here is an example line from the trace file describing a handle which lived on for 311 jiffies, or over 1.2 seconds: postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32 tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1 dirtied_blocks 0 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-08ext4: move the jbd2 wrapper functions out of super.cTheodore Ts'o
Move the jbd2 wrapper functions which start and stop handles out of super.c, where they don't really logically belong, and into ext4_jbd2.c. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-08jbd2: add tracepoints which provide per-handle statistics Theodore Ts'o
Handles which stay open a long time are problematic when it comes time to close down a transaction so it can be committed. These tracepoints will help us determine which ones are the problematic ones, and to validate whether changes makes things better or worse. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-07jbd2: revert "jbd2: add COW fields to struct jbd2_journal_handle"Theodore Ts'o
This reverts commit 93737456d68ddcb86232f669b83da673dd12e351. The cow-snapshots effort is no longer active, so remove these extra fields to shrink down the handle structure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-07jbd2: track request delay statisticsTheodore Ts'o
Track the delay between when we first request that the commit begin and when it actually begins, so we can see how much of a gap exists. In theory, this should just be the remaining scheduling quantuum of the thread which requested the commit (assuming it was not a synchronous operation which triggered the commit request) plus scheduling overhead; however, it's possible that real time processes might get in the way of letting the kjournald thread from executing. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-04ext4: optimize mballoc for large allocationsTheodore Ts'o
The ext4 block allocator only maintains buddy bitmaps for chunks which are less than or equal to one quarter of a block group. That is, for a file aystem with a 1k blocksize, and where the number of blocks in a block group is 8192 blocks, the largest chunk size tracked by buddy bitmaps is 2048 blocks. For a file system with a 4k blocksize, and where the number of blocks in a block group is 32768 blocks, the largest chunk size tracked by buddy bitmaps is 8192 blocks. To work around this code, mballoc.c before this commit would truncate allocation requests to the number of blocks in a block group minus 10. Why 10? Aside from being a completely arbitrary number, it avoids block allocation to be a power of two larger than 25% of the block group. If you try to explicitly fallocate 50% of the block group size, this will demonstrate the problem; the block allocation code will scan the all of the blocks in the file system with cr==0 (since the request is for a natural power of two), but then completely fail for all blocks groups, since the buddy bitmaps don't track chunk sizes of 50% of the block group. To fix this, in these we use ext4_mb_complex_scan_group() instead of ext4_mb_simple_scan_group(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger@dilger.ca>
2013-02-03ext4: check incompatible mount options while mounting ext2/3Theodore Ts'o
Check for incompatible mount options when using the ext4 file system driver to mount ext2 or ext3 file systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>