summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2009-09-24vfs: remove redundant position check in do_sendfileJeff Layton
As Johannes Weiner pointed out, one of the range checks in do_sendfile is redundant and is already checked in rw_verify_area. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Robert Love <rlove@google.com> Cc: Mandeep Singh Baines <msb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24vfs: change sb->s_maxbytes to a loff_tJeff Layton
sb->s_maxbytes is supposed to indicate the maximum size of a file that can exist on the filesystem. It's declared as an unsigned long long. Even if a filesystem has no inherent limit that prevents it from using every bit in that unsigned long long, it's still problematic to set it to anything larger than MAX_LFS_FILESIZE. There are places in the kernel that cast s_maxbytes to a signed value. If it's set too large then this cast makes it a negative number and generally breaks the comparison. Change s_maxbytes to be loff_t instead. That should help eliminate the temptation to set it too large by making it a signed value. Also, add a warning for couple of releases to help catch filesystems that set s_maxbytes too large. Eventually we can either convert this to a BUG() or just remove it and in the hope that no one will get it wrong now that it's a signed value. Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Robert Love <rlove@google.com> Cc: Mandeep Singh Baines <msb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24vfs: explicitly cast s_maxbytes in fiemap_check_rangesJeff Layton
If fiemap_check_ranges is passed a large enough value, then it's possible that the value would be cast to a signed value for comparison against s_maxbytes when we change it to loff_t. Make sure that doesn't happen by explicitly casting s_maxbytes to an unsigned value for the purposes of comparison. Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Love <rlove@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mandeep Singh Baines <msb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24libfs: return error code on failed attr setWu Fengguang
Currently all simple_attr.set handlers return 0 on success and negative codes on error. Fix simple_attr_write() to return these error codes. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24seq_file: return a negative error code when seq_path_root() fails.Tetsuo Handa
seq_path_root() is returning a return value of successful __d_path() instead of returning a negative value when mangle_path() failed. This is not a bug so far because nobody is using return value of seq_path_root(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24vfs: optimize touch_time() tooAndi Kleen
Do a similar optimization as earlier for touch_atime. Getting the lock in mnt_get_write is relatively costly, so try all avenues to avoid it first. This patch is careful to still only update inode fields inside the lock region. This didn't show up in benchmarks, but it's easy enough to do. [akpm@linux-foundation.org: fix typo in comment] [hugh.dickins@tiscali.co.uk: fix inverted test of mnt_want_write_file()] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Valerie Aurora <vaurora@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24vfs: optimization for touch_atime()Andi Kleen
Some benchmark testing shows touch_atime to be high up in profile logs for IO intensive workloads. Most likely that's due to the lock in mnt_want_write(). Unfortunately touch_atime first takes the lock, and then does all the other tests that could avoid atime updates (like noatime or relatime). Do it the other way round -- first try to avoid the update and only then if that didn't succeed take the lock. That works because none of the atime avoidance tests rely on locking. This also eliminates a goto. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Reviewed-by: Valerie Aurora <vaurora@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24vfs: split generic_forget_inode() so that hugetlbfs does not have to copy itJan Kara
Hugetlbfs needs to do special things instead of truncate_inode_pages(). Currently, it copied generic_forget_inode() except for truncate_inode_pages() call which is asking for trouble (the code there isn't trivial). So create a separate function generic_detach_inode() which does all the list magic done in generic_forget_inode() and call it from hugetlbfs_forget_inode(). Signed-off-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24fs/inode.c: add dev-id and inode number for debugging in init_special_inode()Manish Katiyar
Add device-id and inode number for better debugging. This was suggested by Andreas in one of the threads http://article.gmane.org/gmane.comp.file-systems.ext4/12062 . "If anyone has a chance, fixing this error message to be not-useless would be good... Including the device name and the inode number would help track down the source of the problem." Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Cc: Andreas Dilger <adilger@sun.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24libfs: make simple_read_from_buffer conventionalSteven Rostedt
Impact: have simple_read_from_buffer conform to standards It was brought to my attention by Andrew Morton, Theodore Tso, and H. Peter Anvin that a read from userspace should only return -EFAULT if nothing was actually read. Looking at the simple_read_from_buffer I noticed that this function does not conform to that rule. This patch fixes that function. [akpm@linux-foundation.org: simplification suggested by hpa] [hpa@zytor.com: fix count==0 handling] Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24headers: utsname.h reduxAlexey Dobriyan
* remove asm/atomic.h inclusion from linux/utsname.h -- not needed after kref conversion * remove linux/utsname.h inclusion from files which do not need it NOTE: it looks like fs/binfmt_elf.c do not need utsname.h, however due to some personality stuff it _is_ needed -- cowardly leave ELF-related headers and files alone. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6Linus Torvalds
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Propagate 'fsc' mount option through automounts sunrpc/rpc_pipe: fix kernel-doc notation sunrpc: xdr_xcode_hyper helpers cannot presume 64-bit alignment NFS: Add nfs_alloc_parsed_mount_data NFS/RPC: fix problems with reestablish_timeout and related code. NFS: Get rid of the NFS_MOUNT_VER3 and NFS_MOUNT_TCP flags
2009-09-23NFS: Propagate 'fsc' mount option through automountsDavid Howells
Propagate the NFS 'fsc' mount option through NFS automounts of various types. This is now required as commit: commit c02d7adf8c5429727a98bad1d039bccad4c61c50 Author: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Mon Jun 22 15:09:14 2009 -0400 NFSv4: Replace nfs4_path_walk() with VFS path lookup in a private namespace uses VFS-driven automounting to reach all submounts barring the root, thus preventing fscaching from being enabled on any submount other than the root. This patch gets around that by propagating the NFS_OPTION_FSCACHE flag across automounts. If a uniquifier is supplied to a mount then this is propagated to all automounts of that mount too. Signed-off-by: David Howells <dhowells@redhat.com> [Trond: Fixed up the definition of nfs_fscache_get_super_cookie for the case of #undef CONFIG_NFS_FSCACHE] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-09-23NFS: Add nfs_alloc_parsed_mount_dataChuck Lever
Allocating nfs_parsed_mount_data and setting up the defaults is nearly the same for both nfs and nfs4 mounts. Both paths seem to use nfs_validate_transport_protocol(), so setting a default value for nfs_server.protocol ought to be unnecessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-09-23NFS: Get rid of the NFS_MOUNT_VER3 and NFS_MOUNT_TCP flagsTrond Myklebust
Keep it in the case of the legacy binary mount interface, but purge it from the nfs_server structure. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-09-239p: Add fscache support to 9pAbhishek Kulkarni
This patch adds a persistent, read-only caching facility for 9p clients using the FS-Cache caching backend. When the fscache facility is enabled, each inode is associated with a corresponding vcookie which is an index into the FS-Cache indexing tree. The FS-Cache indexing tree is indexed at 3 levels: - session object associated with each mount. - inode/vcookie - actual data (pages) A cache tag is chosen randomly for each session. These tags can be read off /sys/fs/9p/caches and can be passed as a mount-time parameter to re-attach to the specified caching session. Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2009-09-239p: Fix the incorrect update of inode size in v9fs_file_write()Abhishek Kulkarni
When using the cache=loose flags, the inode's size was not being updated correctly on a remote write. Thus subsequent reads of the whole file resulted in a truncated read. Fix it. Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2009-09-239p: Use the i_size_[read, write]() macros instead of using inode->i_size ↵Abhishek Kulkarni
directly. Change all occurrence of inode->i_size with i_size_read() or i_size_write() as appropriate. Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2009-09-23Merge git://git.infradead.org/mtd-2.6Linus Torvalds
* git://git.infradead.org/mtd-2.6: (58 commits) mtd: jedec_probe: add PSD4256G6V id mtd: OneNand support for Nomadik 8815 SoC (on NHK8815 board) mtd: nand: driver for Nomadik 8815 SoC (on NHK8815 board) m25p80: Add Spansion S25FL129P serial flashes jffs2: Use SLAB_HWCACHE_ALIGN for jffs2_raw_{dirent,inode} slabs mtd: sh_flctl: register sh_flctl using platform_driver_probe() mtd: nand: txx9ndfmc: transfer 512 byte at a time if possible mtd: nand: fix tmio_nand ecc correction mtd: nand: add __nand_correct_data helper function mtd: cfi_cmdset_0002: add 0xFF intolerance for M29W128G mtd: inftl: fix fold chain block number mtd: jedec: fix compilation problem with I28F640C3B definition mtd: nand: fix ECC Correction bug for SMC ordering for NDFC driver mtd: ofpart: Check availability of reg property instead of name property driver/Makefile: Initialize "mtd" and "spi" before "net" mtd: omap: adding DMA mode support in nand prefetch/post-write mtd: omap: add support for nand prefetch-read and post-write mtd: add nand support for w90p910 (v2) mtd: maps: add mtd-ram support to physmap_of mtd: pxa3xx_nand: add single-bit error corrections reporting ...
2009-09-23Merge branch 'upstream-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (85 commits) ocfs2: Use buffer IO if we are appending a file. ocfs2: add spinlock protection when dealing with lockres->purge. dlmglue.c: add missed mlog lines ocfs2: __ocfs2_abort() should not enable panic for local mounts ocfs2: Add ioctl for reflink. ocfs2: Enable refcount tree support. ocfs2: Implement ocfs2_reflink. ocfs2: Add preserve to reflink. ocfs2: Create reflinked file in orphan dir. ocfs2: Use proper parameter for some inode operation. ocfs2: Make transaction extend more efficient. ocfs2: Don't merge in 1st refcount ops of reflink. ocfs2: Modify removing xattr process for refcount. ocfs2: Add reflink support for xattr. ocfs2: Create an xattr indexed block if needed. ocfs2: Call refcount tree remove process properly. ocfs2: Attach xattr clusters to refcount tree. ocfs2: Abstract ocfs2 xattr tree extend rec iteration process. ocfs2: Abstract the creation of xattr block. ocfs2: Remove inode from ocfs2_xattr_bucket_get_name_value. ...
2009-09-23fs: change sys_truncate length parameter typeHeiko Carstens
For this system call user space passes a signed long length parameter, while the kernel side takes an unsigned long parameter and converts it later to signed long again. This has led to bugs in compat wrappers see e.g. dd90bbd5 "powerpc: Add compat_sys_truncate". The s390 compat wrapper for this functions is broken as well since it also performs zero extension instead of sign extension for the length parameter. In addition if hpa comes up with an automated way of generating compat wrappers it would generate a wrong one here. So change the length parameter from unsigned long to long. Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23ext2: fix format string compile warning (ino_t)Heiko Carstens
Unlike on most other architectures ino_t is an unsigned int on s390. So add an explicit cast to avoid this compile warning: fs/ext2/namei.c: In function 'ext2_lookup': fs/ext2/namei.c:73: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'ino_t' Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23V3 minixfs: add missing directory type checkingDoug Graham
There are a few places in the Minix FS code where the "inode" field of a minix_dir_entry is used without checking first to see if the dirent is really a minix3_dir_entry. The inode number in a V1/V2 dirent is 16 bits, whereas that in a V3 dirent is 32 bits. Accessing it as a 16 bit field when it really should be accessed as a 32 bit field probably kinda sorta works on a little-endian machine, but leads to some rather odd behaviour on big-endian machines. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Doug Graham <dgraham@nortel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23ncpfs: fix wrong check in __ncp_ioctl()Bartlomiej Zolnierkiewicz
We want to check for s_inode's existence, not inode's one (inode is always valid in this function). This takes care of the following entry from Dan's list: fs/ncpfs/ioctl.c +445 __ncp_ioctl(180) warning: variable derefenced before check 'inode' Reported-by: Dan Carpenter <error27@gmail.com> Cc: Julia Lawall <julia@diku.dk> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Petr Vandrovec <vandrove@vc.cvut.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23ncpfs: read buffer overflowRoel Kluin
This function uses signed integers for the unix_date and local variables - if a negative number is supplied and the leap-year condition is not met, month will be 0, leading to a later read of day_n[-1] Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Cc: Petr Vandrovec <VANDROVE@vc.cvut.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23ramfs: move RAMFS_MAGIC to include/linux/magic.hmaximilian attems
initramfs userspace likes to use this magic number. Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: maximilian attems <max@stro.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23/proc/kcore: update stat.st_size after memory hotplugKAMEZAWA Hiroyuki
After memory hotplug (or other events in future), kcore size can be modified. To update inode->i_size, we have to know inode/dentry but we can't get it from inside /proc directly. But considerinyg memory hotplug, kcore image is updated only when it's opened. Then, updating inode->i_size at open() is enough. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23/proc/kcore: fix stat.st_sizeKAMEZAWA Hiroyuki
Presently the size of /proc/kcore which can be read by 'ls -l' is 0. But it's not the correct value. On x86-64, ls -l shows ... root root 140737486266368 2009-09-17 10:29 /proc/kcore Then, 7FFFFFFE02000. This comes from vmalloc area's size. (*) This shows "core" size, not memory size. This patch shows the size by updating "size" field in struct proc_dir_entry. Later, lookup routine will create inode and fill inode->i_size based on this value. Then, this has a problem. - Once inode is cached, inode->i_size will never be updated. Then, this patch is not memory-hotplug-aware. To update inode->i_size, we have to know dentry or inode. But there is no way to lookup them by inside kernel. Hmmm.... Next patch will try it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: more fixes for initKAMEZAWA Hiroyuki
proc_kcore_init() doesn't check NULL case. fix it and remove unnecessary comments. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: register module area in generic wayKAMEZAWA Hiroyuki
Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area. This is handled only in x86-64. This patch make it more generic. And we can use vread/vwrite to access the area. Fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: register vmemmap rangeKAMEZAWA Hiroyuki
Benjamin Herrenschmidt <benh@kernel.crashing.org> pointed out that vmemmap range is not included in KCORE_RAM, KCORE_VMALLOC .... This adds KCORE_VMEMMAP if SPARSEMEM_VMEMMAP is used. By this, vmemmap can be readable via /proc/kcore Because it's not vmalloc area, vread/vwrite cannot be used. But the range is static against the memory layout, this patch handles vmemmap area by the same scheme with physical memory. This patch assumes SPARSEMEM_VMEMMAP range is not in VMALLOC range. It's correct now. [akpm@linux-foundation.org: fix typo] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: use registerd physmem informationKAMEZAWA Hiroyuki
For /proc/kcore, each arch registers its memory range by kclist_add(). In usual, - range of physical memory - range of vmalloc area - text, etc... are registered but "range of physical memory" has some troubles. It doesn't updated at memory hotplug and it tend to include unnecessary memory holes. Now, /proc/iomem (kernel/resource.c) includes required physical memory range information and it's properly updated at memory hotplug. Then, it's good to avoid using its own code(duplicating information) and to rebuild kclist for physical memory based on /proc/iomem. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: register text area in generic wayKAMEZAWA Hiroyuki
Some 64bit arch has special segment for mapping kernel text. It should be entried to /proc/kcore in addtion to direct-linear-map, vmalloc area. This patch unifies KCORE_TEXT entry scattered under x86 and ia64. I'm not familiar with other archs (mips has its own even after this patch) but range of [_stext ..._end) is a valid area of text and it's not in direct-map area, defining CONFIG_ARCH_PROC_KCORE_TEXT is only a necessary thing to do. Note: I left mips as it is now. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: register vmalloc area in generic wayKAMEZAWA Hiroyuki
For /proc/kcore, vmalloc areas are registered per arch. But, all of them registers same range of [VMALLOC_START...VMALLOC_END) This patch unifies them. By this. archs which have no kclist_add() hooks can see vmalloc area correctly. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: add kclist typesKAMEZAWA Hiroyuki
Presently, kclist_add() only eats start address and size as its arguments. Considering to make kclist dynamically reconfigulable, it's necessary to know which kclists are for System RAM and which are not. This patch add kclist types as KCORE_RAM KCORE_VMALLOC KCORE_TEXT KCORE_OTHER This "type" is used in a patch following this for detecting KCORE_RAM. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: use usual list for kclistKAMEZAWA Hiroyuki
This patchset is for /proc/kcore. With this, - many per-arch hooks are removed. - /proc/kcore will know really valid physical memory area. - /proc/kcore will be aware of memory hotplug. - /proc/kcore will be architecture independent i.e. if an arch supports CONFIG_MMU, it can use /proc/kcore. (if the arch uses usual memory layout.) This patch: /proc/kcore uses its own list handling codes. It's better to use generic list codes. No changes in logic. just clean up. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23procfs: provide stack information for threadsStefani Seibold
A patch to give a better overview of the userland application stack usage, especially for embedded linux. Currently you are only able to dump the main process/thread stack usage which is showed in /proc/pid/status by the "VmStk" Value. But you get no information about the consumed stack memory of the the threads. There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks the vm mapping where the thread stack pointer reside with "[thread stack xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value information, because libpthread doesn't set the start of the stack to the top of the mapped area, depending of the pthread usage. A sample output of /proc/<pid>/task/<tid>/maps looks like: 08048000-08049000 r-xp 00000000 03:00 8312 /opt/z 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z 0804a000-0806b000 rw-p 00000000 00:00 0 [heap] a7d12000-a7d13000 ---p 00000000 00:00 0 a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4] a7f13000-a7f14000 ---p 00000000 00:00 0 a7f14000-a7f36000 rw-p 00000000 00:00 0 a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6 a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6 a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6 a806c000-a806f000 rw-p 00000000 00:00 0 a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 a8085000-a8088000 rw-p 00000000 00:00 0 a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack] ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status which will you give the current stack usage in kb. A sample output of /proc/self/status looks like: Name: cat State: R (running) Tgid: 507 Pid: 507 . . . CapBnd: fffffffffffffeff voluntary_ctxt_switches: 0 nonvoluntary_ctxt_switches: 0 Stack usage: 12 kB I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base address of the associated thread stack and not the one of the main process. This makes more sense. [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()] Signed-off-by: Stefani Seibold <stefani@seibold.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23fs/proc/base.c: fix proc_fault_inject_write() input sanity checkVincent Li
Remove obfuscated zero-length input check and return -EINVAL instead of -EIO error to make the error message clear to user. Add whitespace stripping. No functionality changes. The old code: echo 1 > /proc/pid/make-it-fail (ok) echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Input/output error) The new code: echo 1 > /proc/pid/make-it-fail (ok) echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Invalid argument) This patch is conservative in changes to not breaking existing scripts/applications. Signed-off-by: Vincent Li <macli@brc.ubc.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23fs/proc/task_mmu.c v1: fix clear_refs_write() input sanity checkVincent Li
Andrew Morton pointed out similar string hacking and obfuscated check for zero-length input at the end of the function, David Rientjes suggested to use strict_strtol to replace simple_strtol, this patch cover above suggestions, add removing of leading and trailing whitespace from user input. It does not change function behavious. Signed-off-by: Vincent Li <macli@brc.ubc.ca> Acked-by: David Rientjes <rientjes@google.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Amerigo Wang <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: fix /proc/kcore's stat.st_sizeAmerigo Wang
In 9063c61fd5cbd ("x86, 64-bit: Clean up user address masking") Linus fixed the wrong size of /proc/kcore problem. But its size still looks insane, since it never equals the size of physical memory. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Tao Ma <tao.ma@oracle.com> Cc: <mtk.manpages@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23proc_flush_task: flush /proc/tid/task/pid when a sub-thread exitsOleg Nesterov
The exiting sub-thread flushes /proc/pid only, but this doesn't buy too much: ps and friends mostly use /proc/tid/task/pid. Remove "if (thread_group_leader())" checks from proc_flush_task() path, this means we always remove /proc/tid/task/pid dentry on exit, and this actually matches the comment above proc_flush_task(). The test-case: static void* tfunc(void *arg) { char name[256]; sprintf(name, "/proc/%d/task/%ld/status", getpid(), gettid()); close(open(name, O_RDONLY)); return NULL; } int main(void) { pthread_t t; for (;;) { if (!pthread_create(&t, NULL, &tfunc, NULL)) pthread_join(t, NULL); } } slabtop shows that pid/proc_inode_cache/etc grow quickly and "indefinitely" until the task is killed or shrink_slab() is called, not good. And the main thread needs a lot of time to exit. The same can happen if something like "ps -efL" runs continuously, while some application spawns short-living threads. Reported-by: "James M. Leddy" <jleddy@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dominic Duval <dduval@redhat.com> Cc: Frank Hirtz <fhirtz@redhat.com> Cc: "Fuller, Johnray" <Johnray.Fuller@gs.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Paul Batkowski <pbatkowski@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23proc: fix reported unit for RLIMIT_CPUKees Cook
/proc/$pid/limits should show RLIMIT_CPU as seconds, which is the unit used in kernel/posix-cpu-timers.c: unsigned long psecs = cputime_to_secs(ptime); ... if (psecs >= sig->rlim[RLIMIT_CPU].rlim_max) { ... __group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk); Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23getrusage: fill ru_maxrss valueJiri Pirko
Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater mark. This struct is filled as a parameter to getrusage syscall. ->ru_maxrss value is set to KBs which is the way it is done in BSD systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs which seems to be incorrect behavior. Maintainer of this util was notified by me with the patch which corrects it and cc'ed. To make this happen we extend struct signal_struct by two fields. The first one is ->maxrss which we use to store rss hiwater of the task. The second one is ->cmaxrss which we use to store highest rss hiwater of all task childs. These values are used in k_getrusage() to actually fill ->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm struct exists. Note: exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss. it is intetionally behavior. *BSD getrusage have exec() inheriting. test programs ======================================================== getrusage.c =========== #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #include <signal.h> #include <sys/mman.h> #include "common.h" #define err(str) perror(str), exit(1) int main(int argc, char** argv) { int status; printf("allocate 100MB\n"); consume(100); printf("testcase1: fork inherit? \n"); printf(" expect: initial.self ~= child.self\n"); show_rusage("initial"); if (__fork()) { wait(&status); } else { show_rusage("fork child"); _exit(0); } printf("\n"); printf("testcase2: fork inherit? (cont.) \n"); printf(" expect: initial.children ~= 100MB, but child.children = 0\n"); show_rusage("initial"); if (__fork()) { wait(&status); } else { show_rusage("child"); _exit(0); } printf("\n"); printf("testcase3: fork + malloc \n"); printf(" expect: child.self ~= initial.self + 50MB\n"); show_rusage("initial"); if (__fork()) { wait(&status); } else { printf("allocate +50MB\n"); consume(50); show_rusage("fork child"); _exit(0); } printf("\n"); printf("testcase4: grandchild maxrss\n"); printf(" expect: post_wait.children ~= 300MB\n"); show_rusage("initial"); if (__fork()) { wait(&status); show_rusage("post_wait"); } else { system("./child -n 0 -g 300"); _exit(0); } printf("\n"); printf("testcase5: zombie\n"); printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n"); printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n"); show_rusage("initial"); if (__fork()) { sleep(1); /* children become zombie */ show_rusage("pre_wait"); wait(&status); show_rusage("post_wait"); } else { system("./child -n 400"); _exit(0); } printf("\n"); printf("testcase6: SIG_IGN\n"); printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n"); show_rusage("initial"); signal(SIGCHLD, SIG_IGN); if (__fork()) { sleep(1); /* children become zombie */ show_rusage("after_zombie"); } else { system("./child -n 500"); _exit(0); } printf("\n"); signal(SIGCHLD, SIG_DFL); printf("testcase7: exec (without fork) \n"); printf(" expect: initial ~= exec \n"); show_rusage("initial"); execl("./child", "child", "-v", NULL); return 0; } child.c ======= #include <sys/types.h> #include <unistd.h> #include <sys/types.h> #include <sys/wait.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> #include "common.h" int main(int argc, char** argv) { int status; int c; long consume_size = 0; long grandchild_consume_size = 0; int show = 0; while ((c = getopt(argc, argv, "n:g:v")) != -1) { switch (c) { case 'n': consume_size = atol(optarg); break; case 'v': show = 1; break; case 'g': grandchild_consume_size = atol(optarg); break; default: break; } } if (show) show_rusage("exec"); if (consume_size) { printf("child alloc %ldMB\n", consume_size); consume(consume_size); } if (grandchild_consume_size) { if (fork()) { wait(&status); } else { printf("grandchild alloc %ldMB\n", grandchild_consume_size); consume(grandchild_consume_size); exit(0); } } return 0; } common.c ======== #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #include <signal.h> #include <sys/mman.h> #include "common.h" #define err(str) perror(str), exit(1) void show_rusage(char *prefix) { int err, err2; struct rusage rusage_self; struct rusage rusage_children; printf("%s: ", prefix); err = getrusage(RUSAGE_SELF, &rusage_self); if (!err) printf("self %ld ", rusage_self.ru_maxrss); err2 = getrusage(RUSAGE_CHILDREN, &rusage_children); if (!err2) printf("children %ld ", rusage_children.ru_maxrss); printf("\n"); } /* Some buggy OS need this worthless CPU waste. */ void make_pagefault(void) { void *addr; int size = getpagesize(); int i; for (i=0; i<1000; i++) { addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); if (addr == MAP_FAILED) err("make_pagefault"); memset(addr, 0, size); munmap(addr, size); } } void consume(int mega) { size_t sz = mega * 1024 * 1024; void *ptr; ptr = malloc(sz); memset(ptr, 0, sz); make_pagefault(); } pid_t __fork(void) { pid_t pid; pid = fork(); make_pagefault(); return pid; } common.h ======== void show_rusage(char *prefix); void make_pagefault(void); void consume(int mega); pid_t __fork(void); FreeBSD result (expected result) ======================================================== allocate 100MB testcase1: fork inherit? expect: initial.self ~= child.self initial: self 103492 children 0 fork child: self 103540 children 0 testcase2: fork inherit? (cont.) expect: initial.children ~= 100MB, but child.children = 0 initial: self 103540 children 103540 child: self 103564 children 0 testcase3: fork + malloc expect: child.self ~= initial.self + 50MB initial: self 103564 children 103564 allocate +50MB fork child: self 154860 children 0 testcase4: grandchild maxrss expect: post_wait.children ~= 300MB initial: self 103564 children 154860 grandchild alloc 300MB post_wait: self 103564 children 308720 testcase5: zombie expect: pre_wait ~= initial, IOW the zombie process is not accounted. post_wait ~= 400MB, IOW wait() collect child's max_rss. initial: self 103564 children 308720 child alloc 400MB pre_wait: self 103564 children 308720 post_wait: self 103564 children 411312 testcase6: SIG_IGN expect: initial ~= after_zombie (child's 500MB alloc should be ignored). initial: self 103564 children 411312 child alloc 500MB after_zombie: self 103624 children 411312 testcase7: exec (without fork) expect: initial ~= exec initial: self 103624 children 411312 exec: self 103624 children 411312 Linux result (actual test result) ======================================================== allocate 100MB testcase1: fork inherit? expect: initial.self ~= child.self initial: self 102848 children 0 fork child: self 102572 children 0 testcase2: fork inherit? (cont.) expect: initial.children ~= 100MB, but child.children = 0 initial: self 102876 children 102644 child: self 102572 children 0 testcase3: fork + malloc expect: child.self ~= initial.self + 50MB initial: self 102876 children 102644 allocate +50MB fork child: self 153804 children 0 testcase4: grandchild maxrss expect: post_wait.children ~= 300MB initial: self 102876 children 153864 grandchild alloc 300MB post_wait: self 102876 children 307536 testcase5: zombie expect: pre_wait ~= initial, IOW the zombie process is not accounted. post_wait ~= 400MB, IOW wait() collect child's max_rss. initial: self 102876 children 307536 child alloc 400MB pre_wait: self 102876 children 307536 post_wait: self 102876 children 410076 testcase6: SIG_IGN expect: initial ~= after_zombie (child's 500MB alloc should be ignored). initial: self 102876 children 410076 child alloc 500MB after_zombie: self 102880 children 410076 testcase7: exec (without fork) expect: initial ~= exec initial: self 102880 children 410076 exec: self 102880 children 410076 Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23fix compat_sys_utimensat()Suzuki Poulose
Compat utimensat() returns EINVAL when the tv_nsec is one of UTIME_OMIT or UTIME_NOW and the tv_sec is set to non-zero. As per man pages, the tv_sec field should be ignored. sys_utimensat() works fine in this case. Test case: #define _GNU_SOURCE #define _ATFILE_SOURCE #include <stdio.h> #include <fcntl.h> #include <unistd.h> #include <sys/stat.h> #include <stdlib.h> main(int argc, char *argv[]) { struct timespec ts[2]; struct timespec *tsp; if (argc < 2) { fprintf(stderr, "Usage : %s filename\n", argv[0]); exit (-1); } ts[0].tv_nsec = ts[1].tv_nsec = UTIME_NOW; ts[0].tv_sec = ts[1].tv_sec = 1; tsp = ts; if (utimensat(AT_FDCWD, argv[1],tsp,0) == -1) perror("utimensat"); else fprintf(stdout, "utimensat success\n"); return 0; } mjs22lp5:~ # cc -m64 utimensat-test.c -o utimensat_test64 mjs22lp5:~ # cc -m32 utimensat-test.c -o utimensat_test32 mjs22lp5:~ # ./utimensat_test32 /tmp/utimensat_test utimensat: Invalid argument mjs22lp5:~ # ./utimensat_test64 /tmp/utimensat_test utimensat success mjs22lp5:~ # uname -r 2.6.31-rc8 With the patch : mjs22lp5:~ # ./utimensat_test64 /tmp/utimensat_test utimensat success mjs22lp5:~ # ./utimensat_test32 /tmp/utimensat_test utimensat success mjs22lp5:~ # uname -r 2.6.31-rc8utimensat Signed-off-by: Suzuki K P <suzuki@in.ibm.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23qnx4: remove write supportChristoph Hellwig
qnx4 wrte support has never been fully implement, is broken since the dawn of time and hasn't been actively developed since before git history started. Instead of letting it further bitrot and complicate API transition (like the new truncate code) remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Anders Larsen <al@alarsen.net> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23ntfs: remove ntfs_file_writeChristoph Hellwig
do_sync_write() does the right thing for turning the aio_writev method into a normal non-vectored synchronous write, no need to duplicate it in ntfs. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Anton Altaparmakov <aia21@cantab.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23anonfd: split interface into file creation and installDavide Libenzi
Split the anonfd interface into a bare file pointer creation one, and a file pointer creation plus install one. There are cases, like the usage of eventfds inside other kernel interfaces, where the file pointer created by anonfd needs to be used inside the initialization of other structures. As it is right now, as soon as anon_inode_getfd() returns, the kenrle can race with userspace closing the newly installed file descriptor. This patch, while keeping the old anon_inode_getfd(), introduces a new anon_inode_getfile() (whose services are reused in anon_inode_getfd()) that allows to split the file creation phase and the fd install one. Once all the kernel structures are initialized, the code can call the proper fd_install(). Gregory manifested the need for something like this inside KVM. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: James Morris <jmorris@namei.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23aio.c: move EXPORT* macros to line after functionH Hartley Sweeten
As mentioned in Documentation/CodingStyle, move EXPORT* macro's to the line immediately after the closing function brace line. Also, move the __initcall() similarly. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23fs/buffer.c: clean up EXPORT* macrosH Hartley Sweeten
According to Documentation/CodingStyle the EXPORT* macro should follow immediately after the closing function brace line. Also, mark_buffer_async_write_endio() and do_thaw_all() are not used elsewhere so they should be marked as static. In addition, file_fsync() is actually in fs/sync.c so move the EXPORT* to that file. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23fs: turn iprune_mutex into rwsemNick Piggin
We have had a report of bad memory allocation latency during DVD-RAM (UDF) writing. This is causing the user's desktop session to become unusable. Jan tracked the cause of this down to UDF inode reclaim blocking: gnome-screens D ffff810006d1d598 0 20686 1 ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800 ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580 ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0 Call Trace: [<ffffffff804477f3>] io_schedule+0x63/0xa5 [<ffffffff802c2587>] sync_buffer+0x3b/0x3f [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79 [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77 [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21 [<ffffffff802c442a>] __bread+0x70/0x86 [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394 [<ffffffff802bd205>] write_inode_now+0x7d/0xc4 [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53 [<ffffffff802b39ae>] clear_inode+0xc2/0x11b [<ffffffff802b3ab1>] dispose_list+0x5b/0x102 [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392 [<ffffffff802951fa>] alloc_page_vma+0x176/0x189 [<ffffffff802822d8>] __do_fault+0x10c/0x417 [<ffffffff80284232>] handle_mm_fault+0x466/0x940 [<ffffffff8044b922>] do_page_fault+0x676/0xabf This blocks with iprune_mutex held, which then blocks other reclaimers: X D ffff81009d47c400 0 17285 14831 ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288 ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400 ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740 Call Trace: [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9 [<ffffffff80447e1a>] mutex_lock+0x1e/0x22 [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392 [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6 [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73 [<ffffffff802ad949>] do_select+0x308/0x506 [<ffffffff802adced>] core_sys_select+0x1a6/0x254 [<ffffffff802ae0b7>] sys_select+0xb5/0x157 Now I think the main problem is having the filesystem block (and do IO) in inode reclaim. The problem is that this doesn't get accounted well and penalizes a random allocator with a big latency spike caused by work generated from elsewhere. I think the best idea would be to avoid this. By design if possible, or by deferring the hard work to an asynchronous context. If the latter, then the fs would probably want to throttle creation of new work with queue size of the deferred work, but let's not get into those details. Anyway, the other obvious thing we looked at is the iprune_mutex which is causing the cascading blocking. We could turn this into an rwsem to improve concurrency. It is unreasonable to totally ban all potentially slow or blocking operations in inode reclaim, so I think this is a cheap way to get a small improvement. This doesn't solve the whole problem of course. The process doing inode reclaim will still take the latency hit, and concurrent processes may end up contending on filesystem locks. So fs developers should keep these problems in mind. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jan Kara <jack@ucw.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>