summaryrefslogtreecommitdiff
path: root/fs/ext3
AgeCommit message (Collapse)Author
2014-04-08Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext3 improvements, cleanups, reiserfs fix from Jan Kara: "various cleanups for ext2, ext3, udf, isofs, a documentation update for quota, and a fix of a race in reiserfs readdir implementation" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: reiserfs: fix race in readdir ext2: acl: remove unneeded include of linux/capability.h ext3: explicitly remove inode from orphan list after failed direct io fs/isofs/inode.c add __init to init_inodecache() ext3: Speedup WB_SYNC_ALL pass fs/quota/Kconfig: Update filesystems ext3: Update outdated comment before ext3_ordered_writepage() ext3: Update PF_MEMALLOC handling in ext3_write_inode() ext2/3: use prandom_u32() instead of get_random_bytes() ext3: remove an unneeded check in ext3_new_blocks() ext3: remove unneeded check in ext3_ordered_writepage() fs: Mark function as static in ext3/xattr_security.c fs: Mark function as static in ext3/dir.c fs: Mark function as static in ext2/xattr_security.c ext3: Add __init macro to init_inodecache ext2: Add __init macro to init_inodecache udf: Add __init macro to init_inodecache fs: udf: parse_options: blocksize check
2014-04-04Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Major changes for 3.14 include support for the newly added ZERO_RANGE and COLLAPSE_RANGE fallocate operations, and scalability improvements in the jbd2 layer and in xattr handling when the extended attributes spill over into an external block. Other than that, the usual clean ups and minor bug fixes" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits) ext4: fix premature freeing of partial clusters split across leaf blocks ext4: remove unneeded test of ret variable ext4: fix comment typo ext4: make ext4_block_zero_page_range static ext4: atomically set inode->i_flags in ext4_set_inode_flags() ext4: optimize Hurd tests when reading/writing inodes ext4: kill i_version support for Hurd-castrated file systems ext4: each filesystem creates and uses its own mb_cache fs/mbcache.c: doucple the locking of local from global data fs/mbcache.c: change block and index hash chain to hlist_bl_node ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate ext4: refactor ext4_fallocate code ext4: Update inode i_size after the preallocation ext4: fix partial cluster handling for bigalloc file systems ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents ext4: only call sync_filesystm() when remounting read-only fs: push sync_filesystem() down to the file system's remount_fs() jbd2: improve error messages for inconsistent journal heads jbd2: minimize region locked by j_list_lock in jbd2_journal_forget() jbd2: minimize region locked by j_list_lock in journal_get_create_access() ...
2014-04-03mm + fs: store shadow entries in page cacheJohannes Weiner
Reclaim will be leaving shadow entries in the page cache radix tree upon evicting the real page. As those pages are found from the LRU, an iput() can lead to the inode being freed concurrently. At this point, reclaim must no longer install shadow pages because the inode freeing code needs to ensure the page tree is really empty. Add an address_space flag, AS_EXITING, that the inode freeing code sets under the tree lock before doing the final truncate. Reclaim will check for this flag before installing shadow pages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-17ext3: explicitly remove inode from orphan list after failed direct ioEric Sandeen
Otherwise non-empty orphan list will be triggered on umount. This is just an application of commit da1daf by Dmitry Monakhov to the same code in ext3. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
2014-03-13fs: push sync_filesystem() down to the file system's remount_fs()Theodore Ts'o
Previously, the no-op "mount -o mount /dev/xxx" operation when the file system is already mounted read-write causes an implied, unconditional syncfs(). This seems pretty stupid, and it's certainly documented or guaraunteed to do this, nor is it particularly useful, except in the case where the file system was mounted rw and is getting remounted read-only. However, it's possible that there might be some file systems that are actually depending on this behavior. In most file systems, it's probably fine to only call sync_filesystem() when transitioning from read-write to read-only, and there are some file systems where this is not needed at all (for example, for a pseudo-filesystem or something like romfs). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Jan Kara <jack@suse.cz> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Anders Larsen <al@alarsen.net> Cc: Phillip Lougher <phillip@squashfs.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: xfs@oss.sgi.com Cc: linux-btrfs@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: codalist@coda.cs.cmu.edu Cc: linux-ext4@vger.kernel.org Cc: linux-f2fs-devel@lists.sourceforge.net Cc: fuse-devel@lists.sourceforge.net Cc: cluster-devel@redhat.com Cc: linux-mtd@lists.infradead.org Cc: jfs-discussion@lists.sourceforge.net Cc: linux-nfs@vger.kernel.org Cc: linux-nilfs@vger.kernel.org Cc: linux-ntfs-dev@lists.sourceforge.net Cc: ocfs2-devel@oss.oracle.com Cc: reiserfs-devel@vger.kernel.org
2014-03-12ext3: Speedup WB_SYNC_ALL passJan Kara
When doing filesystem wide sync, there's no need to force transaction commit separately for each inode because ext3_sync_fs() takes care of forcing commit at the end. Most of the time this slowness doesn't manifest because previous WB_SYNC_NONE writeback doesn't leave much to write but when there are processes aggressively creating new files and several filesystems to sync, the sync slowness can be noticeable. In the following test script sync(1) takes around 6 minutes when there are two ext3 filesystems mounted on a standard SATA drive. After this patch sync is about twice as fast in the default data=ordered mode. For data=writeback mode we have even bigger speedup. function run_writers { for (( i = 0; i < 10; i++ )); do mkdir $1/dir$i for (( j = 0; j < 40000; j++ )); do dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null done & done } for dir in "$@"; do run_writers $dir done sleep 40 time sync Signed-off-by: Jan Kara <jack@suse.cz>
2014-03-04ext3: Update outdated comment before ext3_ordered_writepage()Jan Kara
The comment is heavily outdated. The recursion into the filesystem isn't possible because we use GFP_NOFS for our allocations, the issue about block_write_full_page() dirtying tail page is long resolved as well (that function doesn't dirty buffers at all), and finally we don't start a transaction if all blocks are already allocated and mapped. Signed-off-by: Jan Kara <jack@suse.cz>
2014-03-04ext3: Update PF_MEMALLOC handling in ext3_write_inode()Jan Kara
The special handling of PF_MEMALLOC callers in ext3_write_inode() shouldn't be necessary as there shouldn't be any. Warn about it. Also update comment before the function as it seems somewhat outdated. Signed-off-by: Jan Kara <jack@suse.cz>
2014-03-04ext2/3: use prandom_u32() instead of get_random_bytes()ZhangZhen
Many of the uses of get_random_bytes() do not actually need cryptographically secure random numbers. Replace those uses with a call to prandom_u32(), which is faster and which doesn't consume entropy from the /dev/random driver. The commit dd1f723bf56bd96efc9d90e9e60dc511c79de48f has made that for ext4, and i did the same for ext2/3. Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>
2014-03-03ext3: remove an unneeded check in ext3_new_blocks()Dan Carpenter
We know "fatal" is zero here. The code can be simplified a bit by assigning directly. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
2014-03-03ext3: remove unneeded check in ext3_ordered_writepage()Dan Carpenter
We already know "ret" is zero so there is no need to do: if (!ret) ret = err; We can just assign ret directly instead. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
2014-03-03fs: Mark function as static in ext3/xattr_security.cRashika Kheria
Mark function as static in ext3/xattr_security.c because it is not used outside this file. This eliminates the following warning in ext3/xattr_security.c: fs/ext3/xattr_security.c:46:5: warning: no previous prototype for ‘ext3_initxattrs’ [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Jan Kara <jack@suse.cz>
2014-03-03fs: Mark function as static in ext3/dir.cRashika Kheria
Mark function as static in ext3/dir.c because it is not used outside this file. This also eliminates the following warning in ext3/dir.c: fs/ext3/dir.c:278:8: warning: no previous prototype for ‘ext3_dir_llseek’ [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Jan Kara <jack@suse.cz>
2014-03-03ext3: Add __init macro to init_inodecacheFabian Frederick
init_inodecache is only called by __init init_ext3_fs. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jan Kara <jack@suse.cz>
2014-01-28Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Assorted stuff; the biggest pile here is Christoph's ACL series. Plus assorted cleanups and fixes all over the place... There will be another pile later this week" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits) __dentry_path() fixes vfs: Remove second variable named error in __dentry_path vfs: Is mounted should be testing mnt_ns for NULL or error. Fix race when checking i_size on direct i/o read hfsplus: remove can_set_xattr nfsd: use get_acl and ->set_acl fs: remove generic_acl nfs: use generic posix ACL infrastructure for v3 Posix ACLs gfs2: use generic posix ACL infrastructure jfs: use generic posix ACL infrastructure xfs: use generic posix ACL infrastructure reiserfs: use generic posix ACL infrastructure ocfs2: use generic posix ACL infrastructure jffs2: use generic posix ACL infrastructure hfsplus: use generic posix ACL infrastructure f2fs: use generic posix ACL infrastructure ext2/3/4: use generic posix ACL infrastructure btrfs: use generic posix ACL infrastructure fs: make posix_acl_create more useful fs: make posix_acl_chmod more useful ...
2014-01-26ext2/3/4: use generic posix ACL infrastructureChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-26fs: make posix_acl_create more usefulChristoph Hellwig
Rename the current posix_acl_created to __posix_acl_create and add a fully featured helper to set up the ACLs on file creation that uses get_acl(). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-26fs: make posix_acl_chmod more usefulChristoph Hellwig
Rename the current posix_acl_chmod to __posix_acl_chmod and add a fully featured ACL chmod helper that uses the ->set_acl inode operation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-24fs-ext3-use-rbtree-postorder-iteration-helper-instead-of-opencoding-fixAndrew Morton
use do{}while - more efficient and it squishes a coccinelle warning Reported-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-24fs/ext3: use rbtree postorder iteration helper instead of opencodingCody P Schafer
Use rbtree_postorder_for_each_entry_safe() to destroy the rbtree instead of opencoding an alternate postorder iteration that modifies the tree Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Michel Lespinasse <walken@google.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext[23], udf and quota fixes from Jan Kara: "Assorted fixes in quota, ext2, ext3 & udf. Probably the most important is a fix of fs corruption issue in ext2 XIP support (OTOH xip is rarely used)" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: Fix fs corruption in ext2_get_xip_mem() quota: info leak in quota_getquota() jbd: Revert "jbd: remove dependency on __GFP_NOFAIL" udf: fix for pathetic mount times in case of invalid file system ext3: Count journal as bsddf overhead in ext3_statfs
2013-10-16ext3: Count journal as bsddf overhead in ext3_statfsEric Sandeen
ext4 counts journal space as bsddf overhead, but ext3 does not. For some reason when I patched ext4 I thought I should leave ext3 alone, but frankly it makes more sense to fix it, I think. Otherwise we get inconsistent behavior from ext3 under ext3.ko, and ext3 under ext4.ko, which is not at all desirable... This is testable by xfstests shared/289, though it will need modification because it currently special-cases ext3. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
2013-10-15ext[34]: fix double put in tmpfileMiklos Szeredi
d_tmpfile() already swallowed the inode ref. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-06Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext3, reiserfs, udf & isofs fixes from Jan Kara: "The contains a bunch of ext3 cleanups and minor improvements, major reiserfs locking changes which should hopefully fix deadlocks introduced by BKL removal, and udf/isofs changes to refuse mounting fs rw instead of mounting it ro automatically which makes eject button work as expected for all media (see the changelog for why userspace should be ok with this change)" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: jbd: use a single printk for jbd_debug() reiserfs: locking, release lock around quota operations reiserfs: locking, handle nested locks properly reiserfs: locking, push write lock out of xattr code jbd: relocate assert after state lock in journal_commit_transaction() udf: Refuse RW mount of the filesystem instead of making it RO udf: Standardize return values in mount sequence isofs: Refuse RW mount of the filesystem instead of making it RO ext3: allow specifying external journal by pathname mount option jbd: remove unneeded semicolon
2013-08-28ext4: Fix misspellings using 'codespell' toolAnatol Pomozov
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-31ext3: allow specifying external journal by pathname mount optionEric Sandeen
It's always been a hassle that if an external journal's device number changes, the filesystem won't mount. And since boot-time enumeration can change, device number changes aren't unusual. The current mechanism to update the journal location is by passing in a mount option w/ a new devnum, but that's a hassle; it's a manual approach, fixing things after the fact. Adding a mount option, "-o journal_path=/dev/$DEVICE" would help, since then we can do i.e. # mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ... and it'll mount even if the devnum has changed, as shown here: # losetup /dev/loop0 journalfile # mke2fs -L mylabel-journal -O journal_dev /dev/loop0 # mkfs.ext3 -L mylabel -J device=/dev/loop0 /dev/sdb1 Change the journal device number: # losetup -d /dev/loop0 # losetup /dev/loop1 journalfile And today it will fail: # mount /dev/sdb1 /mnt/test mount: wrong fs type, bad option, bad superblock on /dev/sdb1, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so # dmesg | tail -n 1 [17343.240702] EXT3-fs (sdb1): error: couldn't read superblock of external journal But with this new mount option, we can specify the new path: # mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test # (which does update the encoded device number, incidentally): # umount /dev/sdb1 # dumpe2fs -h /dev/sdb1 | grep "Journal device" dumpe2fs 1.41.12 (17-May-2010) Journal device: 0x0701 But best of all we can just always mount by journal-path, and it'll always work: # mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test # So the journal_path option can be specified in fstab, and as long as the disk is available somewhere, and findable by label (or by UUID), we can mount. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
2013-07-21ext3: fix a BUG when opening a file with O_TMPFILE flagZheng Liu
When we try to open a file with O_TMPFILE flag, we will trigger a bug. The root cause is that in ext4_orphan_add() we check ->i_nlink == 0 and this check always fails because we set ->i_nlink = 1 in inode_init_always(). We can use the following program to trigger it: int main(int argc, char *argv[]) { int fd; fd = open(argv[1], O_TMPFILE, 0666); if (fd < 0) { perror("open "); return -1; } close(fd); return 0; } The oops message looks like this: kernel: kernel BUG at fs/ext3/namei.c:1992! kernel: invalid opcode: 0000 [#1] SMP kernel: Modules linked in: ext4 jbd2 crc16 cpufreq_ondemand ipv6 dm_mirror dm_region_hash dm_log dm_mod parport_pc parport serio_raw sg dcdbas pcspkr i2c_i801 ehci_pci ehci_hcd button acpi_cpufreq mperf e1000e ptp pps_core ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core ext3 jbd sd_mod ahci libahci libata scsi_mod uhci_hcd kernel: CPU: 0 PID: 2882 Comm: tst_tmpfile Not tainted 3.11.0-rc1+ #4 kernel: Hardware name: Dell Inc. OptiPlex 780 /0V4W66, BIOS A05 08/11/2010 kernel: task: ffff880112d30050 ti: ffff8801124d4000 task.ti: ffff8801124d4000 kernel: RIP: 0010:[<ffffffffa00db5ae>] [<ffffffffa00db5ae>] ext3_orphan_add+0x6a/0x1eb [ext3] kernel: RSP: 0018:ffff8801124d5cc8 EFLAGS: 00010202 kernel: RAX: 0000000000000000 RBX: ffff880111510128 RCX: ffff8801114683a0 kernel: RDX: 0000000000000000 RSI: ffff880111510128 RDI: ffff88010fcf65a8 kernel: RBP: ffff8801124d5d18 R08: 0080000000000000 R09: ffffffffa00d3b7f kernel: R10: ffff8801114683a0 R11: ffff8801032a2558 R12: 0000000000000000 kernel: R13: ffff88010fcf6800 R14: ffff8801032a2558 R15: ffff8801115100d8 kernel: FS: 00007f5d172b5700(0000) GS:ffff880117c00000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b kernel: CR2: 00007f5d16df15d0 CR3: 0000000110b1d000 CR4: 00000000000407f0 kernel: Stack: kernel: 000000000000000c ffff8801048a7dc8 ffff8801114685a8 ffffffffa00b80d7 kernel: ffff8801124d5e38 ffff8801032a2558 ffff88010ce24d68 0000000000000000 kernel: ffff88011146b300 ffff8801124d5d44 ffff8801124d5d78 ffffffffa00db7e1 kernel: Call Trace: kernel: [<ffffffffa00b80d7>] ? journal_start+0x8c/0xbd [jbd] kernel: [<ffffffffa00db7e1>] ext3_tmpfile+0xb2/0x13b [ext3] kernel: [<ffffffff821076f8>] path_openat+0x11f/0x5e7 kernel: [<ffffffff821c86b4>] ? list_del+0x11/0x30 kernel: [<ffffffff82065fa2>] ? __dequeue_entity+0x33/0x38 kernel: [<ffffffff82107cd5>] do_filp_open+0x3f/0x8d kernel: [<ffffffff82112532>] ? __alloc_fd+0x50/0x102 kernel: [<ffffffff820f9296>] do_sys_open+0x13b/0x1cd kernel: [<ffffffff820f935c>] SyS_open+0x1e/0x20 kernel: [<ffffffff82398c02>] system_call_fastpath+0x16/0x1b kernel: Code: 39 c7 0f 85 67 01 00 00 0f b7 03 25 00 f0 00 00 3d 00 40 00 00 74 18 3d 00 80 00 00 74 11 3d 00 a0 00 00 74 0a 83 7b 48 00 74 04 <0f> 0b eb fe 49 8b 85 50 03 00 00 4c 89 f6 48 c7 c7 c0 99 0e a0 kernel: RIP [<ffffffffa00db5ae>] ext3_orphan_add+0x6a/0x1eb [ext3] kernel: RSP <ffff8801124d5cc8> Here we couldn't call clear_nlink() directly because in d_tmpfile() we will call inode_dec_link_count() to decrease ->i_nlink. So this commit tries to call d_tmpfile() before ext4_orphan_add() to fix this problem. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk>
2013-07-09Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext3 fix and quota cleanup from Jan Kara: "A fix of ext3 error reporting from fsync and a quota cleanup" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: Convert use of typedef ctl_table to struct ctl_table ext3: Fix fsync error handling after filesystem abort.
2013-07-04ext3: Fix fsync error handling after filesystem abort.Dmitry Monakhov
If filesystem was aborted we will return success due to (sb->s_flags & MS_RDONLY) which is incorrect and results in data loss. In order to handle fs abort correctly we have to check fs state once we discover that it is in MS_RDONLY state Test case: http://patchwork.ozlabs.org/patch/244297/ Changes from V1: - fix spelling - fix smp_rmb()/debug order Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
2013-07-03mm: vmscan: take page buffers dirty and locked state into accountMel Gorman
Page reclaim keeps track of dirty and under writeback pages and uses it to determine if wait_iff_congested() should stall or if kswapd should begin writing back pages. This fails to account for buffer pages that can be under writeback but not PageWriteback which is the case for filesystems like ext3 ordered mode. Furthermore, PageDirty buffer pages can have all the buffers clean and writepage does no IO so it should not be accounted as congested. This patch adds an address_space operation that filesystems may optionally use to check if a page is really dirty or really under writeback. An implementation is provided for for buffer_heads is added and used for block operations and ext3 in ordered mode. By default the page flags are obeyed. Credit goes to Jan Kara for identifying that the page flags alone are not sufficient for ext3 and sanity checking a number of ideas on how the problem could be addressed. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second set of VFS changes from Al Viro: "Assorted f_pos race fixes, making do_splice_direct() safe to call with i_mutex on parent, O_TMPFILE support, Jeff's locks.c series, ->d_hash/->d_compare calling conventions changes from Linus, misc stuff all over the place." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) Document ->tmpfile() ext4: ->tmpfile() support vfs: export lseek_execute() to modules lseek_execute() doesn't need an inode passed to it block_dev: switch to fixed_size_llseek() cpqphp_sysfs: switch to fixed_size_llseek() tile-srom: switch to fixed_size_llseek() proc_powerpc: switch to fixed_size_llseek() ubi/cdev: switch to fixed_size_llseek() pci/proc: switch to fixed_size_llseek() isapnp: switch to fixed_size_llseek() lpfc: switch to fixed_size_llseek() locks: give the blocked_hash its own spinlock locks: add a new "lm_owner_key" lock operation locks: turn the blocked_list into a hashtable locks: convert fl_link to a hlist_node locks: avoid taking global lock if possible when waking up blocked waiters locks: protect most of the file_lock handling with i_lock locks: encapsulate the fl_link list handling locks: make "added" in __posix_lock_file a bool ...
2013-07-02Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 update from Ted Ts'o: "Lots of bug fixes, cleanups and optimizations. In the bug fixes category, of note is a fix for on-line resizing file systems where the block size is smaller than the page size (i.e., file systems 1k blocks on x86, or more interestingly file systems with 4k blocks on Power or ia64 systems.) In the cleanup category, the ext4's punch hole implementation was significantly improved by Lukas Czerner, and now supports bigalloc file systems. In addition, Jan Kara significantly cleaned up the write submission code path. We also improved error checking and added a few sanity checks. In the optimizations category, two major optimizations deserve mention. The first is that ext4_writepages() is now used for nodelalloc and ext3 compatibility mode. This allows writes to be submitted much more efficiently as a single bio request, instead of being sent as individual 4k writes into the block layer (which then relied on the elevator code to coalesce the requests in the block queue). Secondly, the extent cache shrink mechanism, which was introduce in 3.9, no longer has a scalability bottleneck caused by the i_es_lru spinlock. Other optimizations include some changes to reduce CPU usage and to avoid issuing empty commits unnecessarily." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits) ext4: optimize starting extent in ext4_ext_rm_leaf() jbd2: invalidate handle if jbd2_journal_restart() fails ext4: translate flag bits to strings in tracepoints ext4: fix up error handling for mpage_map_and_submit_extent() jbd2: fix theoretical race in jbd2__journal_restart ext4: only zero partial blocks in ext4_zero_partial_blocks() ext4: check error return from ext4_write_inline_data_end() ext4: delete unnecessary C statements ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree() jbd2: move superblock checksum calculation to jbd2_write_superblock() ext4: pass inode pointer instead of file pointer to punch hole ext4: improve free space calculation for inline_data ext4: reduce object size when !CONFIG_PRINTK ext4: improve extent cache shrink mechanism to avoid to burn CPU time ext4: implement error handling of ext4_mb_new_preallocation() ext4: fix corruption when online resizing a fs with 1K block size ext4: delete unused variables ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents jbd2: remove debug dependency on debug_fs and update Kconfig help text jbd2: use a single printk for jbd_debug() ...
2013-07-01ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()Al Viro
Both ext3 and ext4 htree_dirblock_to_tree() is just filling the in-core rbtree for use by call_filldir(). All updates of ->f_pos are done by the latter; bumping it here (on error) is obviously wrong - we might very well have it nowhere near the block we'd found an error in. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-06-29ext3 ->tmpfile() supportAl Viro
In this case we do need a bit more than usual, due to orphan list handling. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29[readdir] convert ext3Al Viro
new helper: dir_relax(inode). Call when you are in location that will _not_ be invalidated by directory modifications (block boundary, in case of ext*). Returns whether the directory has survived (dropping i_mutex allows rmdir to kill the sucker; if it returns false to us, ->iterate() is obviously done) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-22jbd: change journal_invalidatepage() to accept lengthLukas Czerner
->invalidatepage() aop now accepts range to invalidate so we can make use of it in journal_invalidatepage() and all the users in ext3 file system. Also update ext3 trace point to print out length argument. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2013-05-22mm: change invalidatepage prototype to accept lengthLukas Czerner
Currently there is no way to truncate partial page where the end truncate point is not at the end of the page. This is because it was not needed and the functionality was enough for file system truncate operation to work properly. However more file systems now support punch hole feature and it can benefit from mm supporting truncating page just up to the certain point. Specifically, with this functionality truncate_inode_pages_range() can be changed so it supports truncating partial page at the end of the range (currently it will BUG_ON() if 'end' is not at the end of the page). This commit changes the invalidatepage() address space operation prototype to accept range to be invalidated and update all the instances for it. We also change the block_invalidatepage() in the same way and actually make a use of the new length argument implementing range invalidation. Actual file system implementations will follow except the file systems where the changes are really simple and should not change the behaviour in any way .Implementation for truncate_page_range() which will be able to accept page unaligned ranges will follow as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com>
2013-05-08Merge branch 'akpm' (incoming from Andrew)Linus Torvalds
Merge more incoming from Andrew Morton: - Various fixes which were stalled or which I picked up recently - A large rotorooting of the AIO code. Allegedly to improve performance but I don't really have good performance numbers (I might have lost the email) and I can't raise Kent today. I held this out of 3.9 and we could give it another cycle if it's all too late/scary. I ended up taking only the first two thirds of the AIO rotorooting. I left the percpu parts and the batch completion for later. - Linus * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (33 commits) aio: don't include aio.h in sched.h aio: kill ki_retry aio: kill ki_key aio: give shared kioctx fields their own cachelines aio: kill struct aio_ring_info aio: kill batch allocation aio: change reqs_active to include unreaped completions aio: use cancellation list lazily aio: use flush_dcache_page() aio: make aio_read_evt() more efficient, convert to hrtimers wait: add wait_event_hrtimeout() aio: refcounting cleanup aio: make aio_put_req() lockless aio: do fget() after aio_get_req() aio: dprintk() -> pr_debug() aio: move private stuff out of aio.h aio: add kiocb_cancel() aio: kill return value of aio_complete() char: add aio_{read,write} to /dev/{null,zero} aio: remove retry-based AIO ...
2013-05-08aio: don't include aio.h in sched.hKent Overstreet
Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07make blkdev_put() return voidAl Viro
same story as with the previous patches - note that return value of blkdev_close() is lost, since there's nowhere the caller (__fput()) could return it to. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-03Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext3/jbd fixes from Jan Kara: "A couple of ext3/jbd fixes" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: jbd: use kmem_cache_zalloc for allocating journal head jbd: use kmem_cache_zalloc instead of kmem_cache_alloc/memset jbd: don't wait (forever) for stale tid caused by wraparound ext3: fix data=journal fast mount/umount hang
2013-04-29mm: make snapshotting pages for stable writes a per-bio operationDarrick J. Wong
Walking a bio's page mappings has proved problematic, so create a new bio flag to indicate that a bio's data needs to be snapshotted in order to guarantee stable pages during writeback. Next, for the one user (ext3/jbd) of snapshotting, hook all the places where writes can be initiated without PG_writeback set, and set BIO_SNAP_STABLE there. We must also flag journal "metadata" bios for stable writeout, since file data can be written through the journal. Finally, the MS_SNAP_STABLE mount flag (only used by ext3) is now superfluous, so get rid of it. [akpm@linux-foundation.org: rename _submit_bh()'s `flags' to `bio_flags', delobotomize the _submit_bh declaration] [akpm@linux-foundation.org: teeny cleanup] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Artem Bityutskiy <dedekind1@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-20ext3: fix data=journal fast mount/umount hangJan Kara
In data=journal mode, if we unmount the file system before a transaction has a chance to complete, when the journal inode is being evicted, we can end up calling into log_wait_commit() for the last transaction, after the journalling machinery has been shut down. That triggers the WARN_ONCE in __log_start_commit(). Arguably we should adjust ext3_should_journal_data() to return FALSE for the journal inode, but the only place it matters is ext3_evict_inode(), and so it's to save a bit of CPU time, and to make the patch much more obviously correct by inspection(tm), we'll fix it by explicitly not trying to waiting for a journal commit when we are evicting the journal inode, since it's guaranteed to never succeed in this case. This can be easily replicated via: mount -t ext3 -o data=journal /dev/vdb /vdb ; umount /vdb This is a port of ext4 fix from Ted Ts'o. Signed-off-by: Jan Kara <jack@suse.cz>
2013-03-11ext3: Fix format string issuesLars-Peter Clausen
ext3_msg() takes the printk prefix as the second parameter and the format string as the third parameter. Two callers of ext3_msg omit the prefix and pass the format string as the second parameter and the first parameter to the format string as the third parameter. In both cases this string comes from an arbitrary source. Which means the string may contain format string characters, which will lead to undefined and potentially harmful behavior. The issue was introduced in commit 4cf46b67eb("ext3: Unify log messages in ext3") and is fixed by this patch. CC: stable@vger.kernel.org Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Jan Kara <jack@suse.cz>
2013-03-04fs: Limit sys_mount to only request filesystem modules.Eric W. Biederman
Modify the request_module to prefix the file system type with "fs-" and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Reported-by: Kees Cook <keescook@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-27Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle *and* ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...
2013-02-26Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2, ext3, udf updates from Jan Kara: "Several UDF fixes, a support for UDF extent cache, and couple of ext2 and ext3 cleanups and minor fixes" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: Ext2: remove the static function release_blocks to optimize the kernel Ext2: mark inode dirty after the function dquot_free_block_nodirty is called Ext2: remove the overhead check about sb in the function ext2_new_blocks udf: Remove unused s_extLength from udf_bitmap udf: Make s_block_bitmap standard array udf: Fix bitmap overflow on large filesystems with small block size udf: add extent cache support in case of file reading udf: Write LVID to disk after opening / closing Ext3: return ENOMEM rather than EIO if sb_getblk fails Ext2: return ENOMEM rather than EIO if sb_getblk fails Ext3: use unlikely to improve the efficiency of the kernel Ext2: use unlikely to improve the efficiency of the kernel Ext3: add necessary check in case IO error happens Ext2: free memory allocated and forget buffer head when io error happens ext3: Fix memory leak when quota options are specified multiple times ext3, ext4, ocfs2: remove unused macro NAMEI_RA_INDEX
2013-02-23new helper: file_inode(file)Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22block: optionally snapshot page contents to provide stable pages during writeDarrick J. Wong
This provides a band-aid to provide stable page writes on jbd without needing to backport the fixed locking and page writeback bit handling schemes of jbd2. The band-aid works by using bounce buffers to snapshot page contents instead of waiting. For those wondering about the ext3 bandage -- fixing the jbd locking (which was done as part of ext4dev years ago) is a lot of surgery, and setting PG_writeback on data pages when we actually hold the page lock dropped ext3 performance by nearly an order of magnitude. If we're going to migrate iscsi and raid to use stable page writes, the complaints about high latency will likely return. We might as well centralize their page snapshotting thing to one place. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Artem Bityutskiy <dedekind1@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-21Ext3: return ENOMEM rather than EIO if sb_getblk failsWang Shilong
It will be better to use ENOMEM rather than EIO, because the only reason that sb_getblk fails is that allocation fails. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz>