summaryrefslogtreecommitdiff
path: root/fs/ext4/ext4.h
AgeCommit message (Collapse)Author
2012-05-24Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace enhancements from Eric Biederman: "This is a course correction for the user namespace, so that we can reach an inexpensive, maintainable, and reasonably complete implementation. Highlights: - Config guards make it impossible to enable the user namespace and code that has not been converted to be user namespace safe. - Use of the new kuid_t type ensures the if you somehow get past the config guards the kernel will encounter type errors if you enable user namespaces and attempt to compile in code whose permission checks have not been updated to be user namespace safe. - All uids from child user namespaces are mapped into the initial user namespace before they are processed. Removing the need to add an additional check to see if the user namespace of the compared uids remains the same. - With the user namespaces compiled out the performance is as good or better than it is today. - For most operations absolutely nothing changes performance or operationally with the user namespace enabled. - The worst case performance I could come up with was timing 1 billion cache cold stat operations with the user namespace code enabled. This went from 156s to 164s on my laptop (or 156ns to 164ns per stat operation). - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value. Most uid/gid setting system calls treat these value specially anyway so attempting to use -1 as a uid would likely cause entertaining failures in userspace. - If setuid is called with a uid that can not be mapped setuid fails. I have looked at sendmail, login, ssh and every other program I could think of that would call setuid and they all check for and handle the case where setuid fails. - If stat or a similar system call is called from a context in which we can not map a uid we lie and return overflowuid. The LFS experience suggests not lying and returning an error code might be better, but the historical precedent with uids is different and I can not think of anything that would break by lying about a uid we can't map. - Capabilities are localized to the current user namespace making it safe to give the initial user in a user namespace all capabilities. My git tree covers all of the modifications needed to convert the core kernel and enough changes to make a system bootable to runlevel 1." Fix up trivial conflicts due to nearby independent changes in fs/stat.c * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits) userns: Silence silly gcc warning. cred: use correct cred accessor with regards to rcu read lock userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq userns: Convert cgroup permission checks to use uid_eq userns: Convert tmpfs to use kuid and kgid where appropriate userns: Convert sysfs to use kgid/kuid where appropriate userns: Convert sysctl permission checks to use kuid and kgids. userns: Convert proc to use kuid/kgid where appropriate userns: Convert ext4 to user kuid/kgid where appropriate userns: Convert ext3 to use kuid/kgid where appropriate userns: Convert ext2 to use kuid/kgid where appropriate. userns: Convert devpts to use kuid/kgid where appropriate userns: Convert binary formats to use kuid/kgid where appropriate userns: Add negative depends on entries to avoid building code that is userns unsafe userns: signal remove unnecessary map_cred_ns userns: Teach inode_capable to understand inodes whose uids map to other namespaces. userns: Fail exec for suid and sgid binaries with ids outside our user namespace. userns: Convert stat to return values mapped from kuids and kgids userns: Convert user specfied uids and gids in chown into kuids and kgid userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs ...
2012-05-15userns: Convert ext4 to user kuid/kgid where appropriateEric W. Biederman
Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-04-16ext4: address scalability issue by removing extent cache statisticsTheodore Ts'o
Andi Kleen and Tim Chen have reported that under certain circumstances the extent cache statistics are causing scalability problems due to cache line bounces. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2012-03-29Merge branch 'for-3.4' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd changes from Bruce Fields: Highlights: - Benny Halevy and Tigran Mkrtchyan implemented some more 4.1 features, moving us closer to a complete 4.1 implementation. - Bernd Schubert fixed a long-standing problem with readdir cookies on ext2/3/4. - Jeff Layton performed a long-overdue overhaul of the server reboot recovery code which will allow us to deprecate the current code (a rather unusual user of the vfs), and give us some needed flexibility for further improvements. - Like the client, we now support numeric uid's and gid's in the auth_sys case, allowing easier upgrades from NFSv2/v3 to v4.x. Plus miscellaneous bugfixes and cleanup. Thanks to everyone! There are also some delegation fixes waiting on vfs review that I suppose will have to wait for 3.5. With that done I think we'll finally turn off the "EXPERIMENTAL" dependency for v4 (though that's mostly symbolic as it's been on by default in distro's for a while). And the list of 4.1 todo's should be achievable for 3.5 as well: http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues though we may still want a bit more experience with it before turning it on by default. * 'for-3.4' of git://linux-nfs.org/~bfields/linux: (55 commits) nfsd: only register cld pipe notifier when CONFIG_NFSD_V4 is enabled nfsd4: use auth_unix unconditionally on backchannel nfsd: fix NULL pointer dereference in cld_pipe_downcall nfsd4: memory corruption in numeric_name_to_id() sunrpc: skip portmap calls on sessions backchannel nfsd4: allow numeric idmapping nfsd: don't allow legacy client tracker init for anything but init_net nfsd: add notifier to handle mount/unmount of rpc_pipefs sb nfsd: add the infrastructure to handle the cld upcall nfsd: add a header describing upcall to nfsdcld nfsd: add a per-net-namespace struct for nfsd sunrpc: create nfsd dir in rpc_pipefs nfsd: add nfsd4_client_tracking_ops struct and a way to set it nfsd: convert nfs4_client->cl_cb_flags to a generic flags field NFSD: Fix nfs4_verifier memory alignment NFSD: Fix warnings when NFSD_DEBUG is not defined nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes) nfsd: rename 'int access' to 'int may_flags' in nfsd_open() ext4: return 32/64-bit dir name hash according to usage type fs: add new FMODE flags: FMODE_32bithash and FMODE_64bithash ...
2012-03-20ext4: add no_printk argument validation, fix falloutJoe Perches
Add argument validation to debug functions. Use ##__VA_ARGS__. Fix format and argument mismatches. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19ext4: return 32/64-bit dir name hash according to usage typeFan Yong
Traditionally ext2/3/4 has returned a 32-bit hash value from llseek() to appease NFSv2, which can only handle a 32-bit cookie for seekdir() and telldir(). However, this causes problems if there are 32-bit hash collisions, since the NFSv2 server can get stuck resending the same entries from the directory repeatedly. Allow ext4 to return a full 64-bit hash (both major and minor) for telldir to decrease the chance of hash collisions. This still needs integration on the NFS side. Patch-updated-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> (blame me if something is not correct) Signed-off-by: Fan Yong <yong.fan@whamcloud.com> Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-05ext4: add comments to definition of ext4_io_end_tCurt Wohlgemuth
This should make it more clear what this structure is used for, and how some of the (mutually exclusive) fields are used to keep page cache references. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-05ext4: fix race between sync and completed io workJeff Moyer
The following command line will leave the aio-stress process unkillable on an ext4 file system (in my case, mounted on /mnt/test): aio-stress -t 20 -s 10 -O -S -o 2 -I 1000 /mnt/test/aiostress.3561.4 /mnt/test/aiostress.3561.4.20 /mnt/test/aiostress.3561.4.19 /mnt/test/aiostress.3561.4.18 /mnt/test/aiostress.3561.4.17 /mnt/test/aiostress.3561.4.16 /mnt/test/aiostress.3561.4.15 /mnt/test/aiostress.3561.4.14 /mnt/test/aiostress.3561.4.13 /mnt/test/aiostress.3561.4.12 /mnt/test/aiostress.3561.4.11 /mnt/test/aiostress.3561.4.10 /mnt/test/aiostress.3561.4.9 /mnt/test/aiostress.3561.4.8 /mnt/test/aiostress.3561.4.7 /mnt/test/aiostress.3561.4.6 /mnt/test/aiostress.3561.4.5 /mnt/test/aiostress.3561.4.4 /mnt/test/aiostress.3561.4.3 /mnt/test/aiostress.3561.4.2 This is using the aio-stress program from the xfstests test suite. That particular command line tells aio-stress to do random writes to 20 files from 20 threads (one thread per file). The files are NOT preallocated, so you will get writes to random offsets within the file, thus creating holes and extending i_size. It also opens the file with O_DIRECT and O_SYNC. On to the problem. When an I/O requires unwritten extent conversion, it is queued onto the completed_io_list for the ext4 inode. Two code paths will pull work items from this list. The first is the ext4_end_io_work routine, and the second is ext4_flush_completed_IO, which is called via the fsync path (and O_SYNC handling, as well). There are two issues I've found in these code paths. First, if the fsync path beats the work routine to a particular I/O, the work routine will free the io_end structure! It does not take into account the fact that the io_end may still be in use by the fsync path. I've fixed this issue by adding yet another IO_END flag, indicating that the io_end is being processed by the fsync path. The second problem is that the work routine will make an assignment to io->flag outside of the lock. I have witnessed this result in a hang at umount. Moving the flag setting inside the lock resolved that problem. The problem was introduced by commit b82e384c7b ("ext4: optimize locking for end_io extent conversion"), which first appeared in 3.2. As such, the fix should be backported to that release (probably along with the unwritten extent conversion race fix). Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> CC: stable@kernel.org
2012-03-05ext4: make ext4_show_options() be table-drivenTheodore Ts'o
Consistently show mount options which are the non-default, so that /proc/mounts accurately shows the mount options that would be necessary to mount the file system in its current mode of operation. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03ext4: simplify handling of the errors=* mount optionsTheodore Ts'o
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-02ext4: remove the I_VERSION mount flag and use the super_block flag insteadTheodore Ts'o
There's no point to have two bits that are set in parallel; so use the MS_I_VERSION flag that is needed by the VFS anyway, and that way we free up a bit in sbi->s_mount_opts. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20ext4: fix race between unwritten extent conversion and truncateJeff Moyer
The following comment in ext4_end_io_dio caught my attention: /* XXX: probably should move into the real I/O completion handler */ inode_dio_done(inode); The truncate code takes i_mutex, then calls inode_dio_wait. Because the ext4 code path above will end up dropping the mutex before it is reacquired by the worker thread that does the extent conversion, it seems to me that the truncate can happen out of order. Jan Kara mentioned that this might result in error messages in the system logs, but that should be the extent of the "damage." The fix is pretty straight-forward: don't call inode_dio_done until the extent conversion is complete. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2012-02-20ext4: fix INCOMPAT feature codepoint reservation for INLINEDATATheodore Ts'o
In commit 9b90e5e028 I incorrectly reserved the wrong bit for EXT4_FEATURE_INCOMPAT_INLINEDATA per the discussion on the linux-ext4 list on December 7, 2011. The codepoint 0x2000 should be used for EXT4_FEATURE_INCOMPAT_USE_META_CSUM, so INLINEDATA will be assigned the value 0x8000. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20ext4: fix race when setting bitmap_uptodate flagTheodore Ts'o
In ext4_read_{inode,block}_bitmap() we were setting bitmap_uptodate() before submitting the buffer for read. The is bad, since we check bitmap_uptodate() without locking the buffer, and so if another process is racing with us, it's possible that they will think the bitmap is uptodate even though the read has not completed yet, resulting in inodes and blocks potentially getting allocated more than once if we get really unlucky. Addresses-Google-Bug: 2828254 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-10Merge branch 'for_linus' into for_linus_mergedTheodore Ts'o
Conflicts: fs/ext4/ioctl.c
2012-01-05ext4: make more symbols staticEric Sandeen
A couple more functions can reasonably be made static if desired. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-05ext4: reserve new feature flag codepointsTheodore Ts'o
Reserve the ext4 features flags EXT4_FEATURE_RO_COMPAT_METADATA_CSUM, EXT4_FEATURE_INCOMPAT_INLINEDATA, and EXT4_FEATURE_INCOMPAT_LARGEDIR. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-04ext4: add new online resize interfaceYongqiang Yang
This patch adds new online resize interface, whose input argument is a 64-bit integer indicating how many blocks there are in the resized fs. In new resize impelmentation, all work like allocating group tables are done by kernel side, so the new resize interface can support flex_bg feature and prepares ground for suppoting resize with features like bigalloc and exclude bitmap. Besides these, user-space tools just passes in the new number of blocks. We delay initializing the bitmaps and inode tables of added groups if possible and add multi groups (a flex groups) each time, so new resize is very fast like mkfs. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-04ext4: add a function which sets up group blocks of a flex bgYongqiang Yang
This patch adds a function named setup_new_flex_group_blocks() which sets up group blocks of a flex bg. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-04ext4: propagate umode_tAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-12-29ext4: use proper little-endian bitopsAkinobu Mita
ext4_{set,clear}_bit() is defined as __test_and_{set,clear}_bit_le() for ext4. Only two ext4_{set,clear}_bit() calls check the return value. The rest of calls ignore the return value and they can be replaced with __{set,clear}_bit_le(). This changes ext4_{set,clear}_bit() from __test_and_{set,clear}_bit_le() to __{set,clear}_bit_le() and introduces ext4_test_and_{set,clear}_bit() for the two places where old bit needs to be returned. This ext4_{set,clear}_bit() change is considered safe, because if someone uses these macros without noticing the change, new ext4_{set,clear}_bit don't have return value and causes compiler errors where the return value is used. This also removes unused ext4_find_first_zero_bit(). Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-12-29ext4: remove no longer used functions in inode.cZheng Liu
The functions ext4_block_truncate_page() and ext4_block_zero_page_range() are no longer used, so remove them. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-02Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (97 commits) jbd2: Unify log messages in jbd2 code jbd/jbd2: validate sb->s_first in journal_get_superblock() ext4: let ext4_ext_rm_leaf work with EXT_DEBUG defined ext4: fix a syntax error in ext4_ext_insert_extent when debugging enabled ext4: fix a typo in struct ext4_allocation_context ext4: Don't normalize an falloc request if it can fit in 1 extent. ext4: remove comments about extent mount option in ext4_new_inode() ext4: let ext4_discard_partial_buffers handle unaligned range correctly ext4: return ENOMEM if find_or_create_pages fails ext4: move vars to local scope in ext4_discard_partial_page_buffers_no_lock() ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten ext4: optimize locking for end_io extent conversion ext4: remove unnecessary call to waitqueue_active() ext4: Use correct locking for ext4_end_io_nolock() ext4: fix race in xattr block allocation path ext4: trace punch_hole correctly in ext4_ext_map_blocks ext4: clean up AGGRESSIVE_TEST code ext4: move variables to their scope ext4: fix quota accounting during migration ext4: migrate cleanup ...
2011-11-01treewide: use __printf not __attribute__((format(printf,...)))Joe Perches
Standardize the style for compiler based printf format verification. Standardized the location of __printf too. Done via script and a little typing. $ grep -rPl --include=*.[ch] -w "__attribute__" * | \ grep -vP "^(tools|scripts|include/linux/compiler-gcc.h)" | \ xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s*\(\s*\(\s*format\s*\(\s*printf\s*,\s*(.+)\s*,\s*(.+)\s*\)\s*\)\s*\)/__printf($1, $2)/g ; print; }' [akpm@linux-foundation.org: revert arch bits] Signed-off-by: Joe Perches <joe@perches.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwrittenTao Ma
EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should be done simultaneously since ext4_end_io_nolock always clear the flag and decrease the counter in the same time. We have found some bugs that the flag is set while leaving i_aiodio_unwritten unchanged(commit 32c80b32c053d). So this patch just tries to create a helper function to wrap them to avoid any future bug. The idea is inspired by Eric. Cc: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-29ext4: fix quota accounting during migrationDmitry Monakhov
The tmp_inode should have same uid/gid as the original inode. Otherwise new metadata blocks will be accounted to wrong quota-id, which will result in a quota leak after the inode migration is completed. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-25ext4: update EOFBLOCKS flag on fallocate properlyDmitry Monakhov
EOFBLOCK_FL should be updated if called w/o FALLOCATE_FL_KEEP_SIZE Currently it happens only if new extent was allocated. TESTCASE: fallocate test_file -n -l4096 fallocate test_file -l4096 Last fallocate cmd has updated size, but keept EOFBLOCK_FL set. And fsck will complain about that. Also remove ping pong in ext4_fallocate() in case of new extents, where ext4_ext_map_blocks() clear EOFBLOCKS bit, and later ext4_falloc_update_inode() restore it again. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08ext4: remove the obsolete/broken EXT4_IOC_WAIT_FOR_READONLY ioctlTao Ma
There are no users of the EXT4_IOC_WAIT_FOR_READONLY ioctl, and it is also broken. No one sets the set_ro_timer, no one wakes up us and our state is set to TASK_INTERRUPTIBLE not RUNNING. So remove it. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08ext4: remove deprecated oldallocLukas Czerner
For a long time now orlov is the default block allocator in the ext4. It performs better than the old one and no one seems to claim otherwise so we can safely drop it and make oldalloc and orlov mount option deprecated. This is a part of the effort to reduce number of ext4 options hence the test matrix. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: attempt to fix race in bigalloc code pathAditya Kali
Currently, there exists a race between delayed allocated writes and the writeback when bigalloc feature is in use. The race was because we wanted to determine what blocks in a cluster are under delayed allocation and we were using buffer_delayed(bh) check for it. But, the writeback codepath clears this bit without any synchronization which resulted in a race and an ext4 warning similar to: EXT4-fs (ram1): ext4_da_update_reserve_space: ino 13, used 1 with only 0 reserved data blocks The race existed in two places. (1) between ext4_find_delalloc_range() and ext4_map_blocks() when called from writeback code path. (2) between ext4_find_delalloc_range() and ext4_da_get_block_prep() (where buffer_delayed(bh) is set. To fix (1), this patch introduces a new buffer_head state bit - BH_Da_Mapped. This bit is set under the protection of EXT4_I(inode)->i_data_sem when we have actually mapped the delayed allocated blocks during the writeout time. We can now reliably check for this bit inside ext4_find_delalloc_range() to determine whether the reservation for the blocks have already been claimed or not. To fix (2), it was necessary to set buffer_delay(bh) under the protection of i_data_sem. So, I extracted the very beginning of ext4_map_blocks into a new function - ext4_da_map_blocks() - and performed the required setting of bh_delay bit and the quota reservation under the protection of i_data_sem. These two fixes makes the checking of buffer_delay(bh) and buffer_da_mapped(bh) consistent, thus removing the race. Tested: I was able to reproduce the problem by running 'dd' and 'fsync' in parallel. Also, xfstests sometimes used to reproduce this race. After the fix both my test and xfstests were successful and no race (warning message) was observed. Google-Bug-Id: 4997027 Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: add some tracepoints in ext4/extents.cAditya Kali
This patch adds some tracepoints in ext4/extents.c and updates a tracepoint in ext4/inode.c. Tested: Built and ran the kernel and verified that these tracepoints work. Also ran xfstests. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: rename ext4_has_free_blocks() to ext4_has_free_clusters()Theodore Ts'o
Rename the function so it is more clear what is going on. Also rename the various variables so it's clearer what's happening. Also fix a missing blocks to cluster conversion when reading the number of reserved blocks for root. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: rename ext4_claim_free_blocks() to ext4_claim_free_clusters()Theodore Ts'o
This function really claims a number of free clusters, not blocks, so rename it so it's clearer what's going on. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: rename ext4_free_blocks_after_init() to ext4_free_clusters_after_init()Theodore Ts'o
This function really returns the number of clusters after initializing an uninitalized block bitmap has been initialized. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: rename ext4_count_free_blocks() to ext4_count_free_clusters()Theodore Ts'o
This function really counts the free clusters reported in the block group descriptors, so rename it to reduce confusion. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: Rename ext4_free_blks_{count,set}() to refer to clustersTheodore Ts'o
The field bg_free_blocks_count_{lo,high} in the block group descriptor has been repurposed to hold the number of free clusters for bigalloc functions. So rename the functions so it makes it easier to read and audit the block allocation and block freeing code. Note: at this point in bigalloc development we doesn't support online resize, so this also makes it really obvious all of the places we need to fix up to add support for online resize. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: enable mounting bigalloc as read/writeTheodore Ts'o
Now that we have implemented all of the changes needed for bigalloc, we can finally enable it! Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: Fix bigalloc quota accounting and i_blocks valueAditya Kali
With bigalloc changes, the i_blocks value was not correctly set (it was still set to number of blocks being used, but in case of bigalloc, we want i_blocks to represent the number of clusters being used). Since the quota subsystem sets the i_blocks value, this patch fixes the quota accounting and makes sure that the i_blocks value is set correctly. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: convert the free_blocks field in s_flex_groups to be free_clustersTheodore Ts'o
Convert the free_blocks to be free_clusters to make the final revised bigalloc changes easier to read/understand. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: convert s_{dirty,free}blocks_counter to s_{dirty,free}clusters_counterTheodore Ts'o
Convert the percpu counters s_dirtyblocks_counter and s_freeblocks_counter in struct ext4_super_info to be s_dirtyclusters_counter and s_freeclusters_counter. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: teach ext4_free_blocks() about bigalloc and clustersTheodore Ts'o
The ext4_free_blocks() function now has two new flags that indicate whether a partial cluster at the beginning or the end of the block extents should be freed or not. That will be up the caller (i.e., truncate), who can figure out whether partial clusters at the beginning or the end of a block range can be freed. We also have to update the ext4_mb_free_metadata() and release_blocks_on_commit() machinery to be cluster-based, since it is used by ext4_free_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: bigalloc changes to block bitmap initialization functionsTheodore Ts'o
Add bigalloc support to ext4_init_block_bitmap() and ext4_free_blocks_after_init(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: split out ext4_free_blocks_after_init()Theodore Ts'o
The function ext4_free_blocks_after_init() used to be a #define of ext4_init_block_bitmap(). This actually made it difficult to understand how the function worked, and made it hard make changes to support clusters. So as an initial cleanup, I've separated out the functionality of initializing block bitmap from calculating the number of free blocks in the new block group. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09ext4: read-only support for bigalloc file systemsTheodore Ts'o
This adds supports for bigalloc file systems. It teaches the mount code just enough about bigalloc superblock fields that it will mount the file system without freaking out that the number of blocks per group is too big. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-03ext4: improve handling of conflicting mount optionsTheodore Ts'o
If the user explicitly specifies conflicting mount options for delalloc or dioread_nolock and data=journal, fail the mount, instead of printing a warning and continuing (since many user's won't look at dmesg and notice the warning). Also, print a single warning that data=journal implies that delayed allocation is not on by default (since it's not supported), and furthermore that O_DIRECT is not supported. Improve the text in Documentation/filesystems/ext4.txt so this is clear there as well. Similarly, if the dioread_nolock mount option is specified when the file system block size != PAGE_SIZE, fail the mount instead of printing a warning message and ignoring the mount option. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-03ext4: Add new ext4_discard_partial_page_buffers routinesAllison Henderson
This patch adds two new routines: ext4_discard_partial_page_buffers and ext4_discard_partial_page_buffers_no_lock. The ext4_discard_partial_page_buffers routine is a wrapper function to ext4_discard_partial_page_buffers_no_lock. The wrapper function locks the page and passes it to ext4_discard_partial_page_buffers_no_lock. Calling functions that already have the page locked can call ext4_discard_partial_page_buffers_no_lock directly. The ext4_discard_partial_page_buffers_no_lock function zeros a specified range in a page, and unmaps the corresponding buffer heads. Only block aligned regions of the page will have their buffer heads unmapped. Unblock aligned regions will be mapped if needed so that they can be updated with the partial zero out. This function is meant to be used to update a page and its buffer heads to be zeroed and unmapped when the corresponding blocks have been released or will be released. This routine is used in the following scenarios: * A hole is punched and the non page aligned regions of the head and tail of the hole need to be discarded * The file is truncated and the partial page beyond EOF needs to be discarded * The end of a hole is in the same page as EOF. After the page is flushed, the partial page beyond EOF needs to be discarded. * A write operation begins or ends inside a hole and the partial page appearing before or after the write needs to be discarded * A write operation extends EOF and the partial page beyond EOF needs to be discarded This function takes a flag EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED which is used when a write operation begins or ends in a hole. When the EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED flag is used, only buffer heads that are already unmapped will have the corresponding regions of the page zeroed. Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-31ext2,ext3,ext4: don't inherit APPEND_FL or IMMUTABLE_FL for new inodesTheodore Ts'o
This doesn't make much sense, and it exposes a bug in the kernel where attempts to create a new file in an append-only directory using O_CREAT will fail (but still leave a zero-length file). This was discovered when xfstests #79 was generalized so it could run on all file systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc:stable@kernel.org
2011-08-31ext4: remove i_mutex lock in ext4_evict_inode to fix lockdep complainingJiaying Zhang
The i_mutex lock and flush_completed_IO() added by commit 2581fdc810 in ext4_evict_inode() causes lockdep complaining about potential deadlock in several places. In most/all of these LOCKDEP complaints it looks like it's a false positive, since many of the potential circular locking cases can't take place by the time the ext4_evict_inode() is called; but since at the very least it may mask real problems, we need to address this. This change removes the flush_completed_IO() and i_mutex lock in ext4_evict_inode(). Instead, we take a different approach to resolve the software lockup that commit 2581fdc810 intends to fix. Rather than having ext4-dio-unwritten thread wait for grabing the i_mutex lock of an inode, we use mutex_trylock() instead, and simply requeue the work item if we fail to grab the inode's i_mutex lock. This should speed up work queue processing in general and also prevents the following deadlock scenario: During page fault, shrink_icache_memory is called that in turn evicts another inode B. Inode B has some pending io_end work so it calls ext4_ioend_wait() that waits for inode B's i_ioend_count to become zero. However, inode B's ioend work was queued behind some of inode A's ioend work on the same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten thread on that cpu is processing inode A's ioend work, it tries to grab inode A's i_mutex lock. Since the i_mutex lock of inode A is still hold before the page fault happened, we enter a deadlock. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-01Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (60 commits) ext4: prevent memory leaks from ext4_mb_init_backend() on error path ext4: use EXT4_BAD_INO for buddy cache to avoid colliding with valid inode # ext4: use ext4_msg() instead of printk in mballoc ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree() ext4: use the correct error exit path in ext4_init_inode_table() ext4: add missing kfree() on error return path in add_new_gdb() ext4: change umode_t in tracepoint headers to be an explicit __u16 ext4: fix races in ext4_sync_parent() ext4: Fix overflow caused by missing cast in ext4_fallocate() ext4: add action of moving index in ext4_ext_rm_idx for Punch Hole ext4: simplify parameters of reserve_backup_gdb() ext4: simplify parameters of add_new_gdb() ext4: remove lock_buffer in bclean() and setup_new_group_blocks() ext4: simplify journal handling in setup_new_group_blocks() ext4: let setup_new_group_blocks() set multiple bits at a time ext4: fix a typo in ext4_group_extend() ext4: let ext4_group_add_blocks() handle 0 blocks quickly ext4: let ext4_group_add_blocks() return an error code ext4: rename ext4_add_groupblocks() to ext4_group_add_blocks() ... Fix up conflict in fs/ext4/inode.c: commit aacfc19c626e ("fs: simplify the blockdev_direct_IO prototype") had changed the ext4_ind_direct_IO() function for the new simplified calling convention, while commit dae1e52cb126 ("ext4: move ext4_ind_* functions from inode.c to indirect.c") moved the function to another file.
2011-08-01ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree()Theodore Ts'o
Introduce new helper functions which try kmalloc, and then fall back to vmalloc if necessary, and use them for allocating and deallocating s_flex_groups. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>