summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2010-10-28fs: Add FITRIM ioctlLukas Czerner
Adds an filesystem independent ioctl to allow implementation of file system batched discard support. I takes fstrim_range structure as an argument. fstrim_range is definec in the include/fs.h and its definition is as follows. struct fstrim_range { start; len; minlen; } start - first Byte to trim len - number of Bytes to trim from start minlen - minimum extent length to trim, free extents shorter than this number of Bytes will be ignored. This will be rounded up to fs block size. It is also possible to specify NULL as an argument. In this case the arguments will set itself as follows: start = 0; len = ULLONG_MAX; minlen = 0; So it will trim the whole file system at one run. After the FITRIM is done, the number of actually discarded Bytes is stored in fstrim_range.len to give the user better insight on how much storage space has been really released for wear-leveling. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: Use return value from sb_issue_discard()Lukas Czerner
Use return value from sb_issue_discard() as return value in ext4_issue_discard(). Since sb_issue_discard() may result in more serious errors than just -EOPNOTSUPP it is worth to inform user of this function about them to handle error cases properly. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: Check return value of sb_getblk() and friendsNamhyung Kim
Fail block allocation if sb_getblk() returns NULL. In that case, sb_find_get_block() also likely to fail so that it should skip calling ext4_forget(). Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: use bio layer instead of buffer layer in mpage_da_submit_ioTheodore Ts'o
Call the block I/O layer directly instad of going through the buffer layer. This should give us much better performance and scalability, as well as lowering our CPU utilization when doing buffered writeback. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: move mpage_put_bnr_to_bhs()'s functionality to mpage_da_submit_io()Theodore Ts'o
This massively simplifies the ext4_da_writepages() code path by completely removing mpage_put_bnr_bhs(), which is almost 100 lines of code iterating over a set of pages using pagevec_lookup(), and folds that functionality into mpage_da_submit_io()'s existing pagevec_lookup() loop. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: inline walk_page_buffers() into mpage_da_submit_ioTheodore Ts'o
Expand the call: if (walk_page_buffers(NULL, page_bufs, 0, len, NULL, ext4_bh_delay_or_unwritten)) goto redirty_page into mpage_da_submit_io(). This will allow us to merge in mpage_put_bnr_to_bhs() in the next patch. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: inline ext4_writepage() into mpage_da_submit_io()Theodore Ts'o
As a prepratory step to switching to bio_submit, inline ext4_writepage() into mpage_da_submit() and then simplify things a bit. This makes it clearer what mpage_da_submit needs to do. Also, move the ClearPageChecked(page) call into __ext4_journalled_writepage(), as a minor bit of cleanup refactoring. This also allows us to pull i_size_read() and ext4_should_journal_data() out of the loop, which should be a very minor CPU savings. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: simplify ext4_writepage()Theodore Ts'o
The actual code in ext4_writepage() is unnecessarily convoluted. Simplify it so it is easier to understand, but otherwise logically equivalent. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: call mpage_da_submit_io() from mpage_da_map_blocks()Theodore Ts'o
Eventually we need to completely reorganize the ext4 writepage callpath, but for now, we simplify things a little by calling mpage_da_submit_io() from mpage_da_map_blocks(), since all of the places where we call mpage_da_map_blocks() it is followed up by a call to mpage_da_submit_io(). We're also a wee bit better with respect to error handling, but there are still a number of issues where it's not clear what the right thing is to do with ext4 functions deep in the writeback codepath fails. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: use KMEM_CACHE instead of kmem_cache_createTheodore Ts'o
Also remove the SLAB_RECLAIM_ACCOUNT flag from the system zone kmem cache. This slab tends to be fairly static, so it shouldn't be marked as likely to have free pages that can be reclaimed. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: use search_dirblock() in ext4_dx_find_entry()Theodore Ts'o
Use the search_dirblock() in ext4_dx_find_entry(). It makes the code easier to read, and it takes advantage of common code. It also saves 100 bytes or so of text space. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Brad Spengler <spender@grsecurity.net>
2010-10-28ext4: avoid uninitialized memory references in ext3_htree_next_block()Theodore Ts'o
If the first block of htree directory is missing '.' or '..' but is otherwise a valid directory, and we do a lookup for '.' or '..', it's possible to dereference an uninitialized memory pointer in ext4_htree_next_block(). We avoid this by moving the special case from ext4_dx_find_entry() to ext4_find_entry(); this also means we can optimize ext4_find_entry() slightly when NFS looks up "..". Thanks to Brad Spengler for pointing a Clang warning that led me to look more closely at this code. The warning was harmless, but it was useful in pointing out code that was too ugly to live. This warning was also reported by Roman Borisov. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Brad Spengler <spender@grsecurity.net>
2010-10-28ext4: remove unused ext4_sb_info membersEric Sandeen
Not that these take up a lot of room, but the structure is long enough as it is, and there's no need to confuse people with these various undocumented & unused structure members... Signed-off-by: Eric Sandeen <sandeen@redaht.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: queue conversion after adding to inode's completed IO listEric Sandeen
By queuing the io end on the unwritten workqueue before adding it to our inode's list of completed IOs, I think we run the risk of the work getting completed, and the IO freed, before we try to add it to the inode's i_completed_io_list. It should be safe to add it to the inode's list of completed IOs, and -then- queue it for completion, I think. Thanks to Dave Chinner for pointing out the race. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: don't use ext4_allocation_contexts for tracingEric Sandeen
Many tracepoints were populating an ext4_allocation_context to pass in, but this requires a slab allocation even when tracepoints are off. In fact, 4 of 5 of these allocations were only for tracing. In addition, we were only using a small fraction of the 144 bytes of this structure for this purpose. We can do away with all these alloc/frees of the ac and simply pass in the bits we care about, instead. I tested this by turning on tracing and running through xfstests on x86_64. I did not actually do anything with the trace output, however. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: fix potential infinite loop in ext4_da_writepages()Toshiyuki Okajima
On linux-2.6.36-rc2, if we execute the following script, we can hang the system when the /bin/sync command is executed: ======================================================================== #!/bin/sh echo -n "HANG UP TEST: " /bin/dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1M 2> /dev/null /sbin/mkfs.ext4 -Fq /tmp/img /bin/mount -o loop -t ext4 /tmp/img /mnt /bin/dd if=/dev/zero of=/mnt/file bs=1 count=1 \ seek=$((16*1024*1024*1024*1024-4096)) 2> /dev/null /bin/sync /bin/umount /mnt echo "DONE" exit 0 ======================================================================== We can see the following backtrace if we get the kdump when this hangup occurs: ====================================================================== kthread() => bdi_writeback_thread() => wb_do_writeback() => wb_writeback() => writeback_inodes_wb() => writeback_sb_inodes() => writeback_single_inode() => ext4_da_writepages() ---+ ^ infinite | | loop | +-------------+ ====================================================================== The reason why this hangup happens is described as follows: 1) We write the last extent block of the file whose size is the filesystem maximum size. 2) "BH_Delay" flag is set on the buffer_head of its block. 3) - the member, "m_lblk" of struct mpage_da_data is 4294967295 (UINT_MAX) - the member, "m_len" of struct mpage_da_data is 1 mpage_put_bnr_to_bhs() which is called via ext4_da_writepages() cannot clear "BH_Delay" flag of the buffer_head because the type of m_lblk is ext4_lblk_t and then m_lblk + m_len is overflow. Therefore an infinite loop occurs because ext4_da_writepages() cannot write the page (which corresponds to the block) since "BH_Delay" flag isn't cleared. ---------------------------------------------------------------------- static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd, struct ext4_map_blocks *map) { ... int blocks = map->m_len; ... do { // cur_logical = 4294967295 // map->m_lblk = 4294967295 // blocks = 1 // *** map->m_lblk + blocks == 0 (OVERFLOW!) *** // (cur_logical >= map->m_lblk + blocks) => true if (cur_logical >= map->m_lblk + blocks) break; ---------------------------------------------------------------------- NOTE: Mounting with the nodelalloc option will avoid this codepath, and thus, avoid this hang Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: improve llseek error handling for overly large seek offsetsToshiyuki Okajima
The llseek system call should return EINVAL if passed a seek offset which results in a write error. What this maximum offset should be depends on whether or not the huge_file file system feature is set, and whether or not the file is extent based or not. If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be written (write systemcall) is different from the maximum size which can be sought (lseek systemcall). For example, the following 2 cases demonstrates the differences between the maximum size which can be written, versus the seek offset allowed by the llseek system call: #1: mkfs.ext3 <dev>; mount -t ext4 <dev> #2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev> Table. the max file size which we can write or seek at each filesystem feature tuning and file flag setting +============+===============================+===============================+ | \ File flag| | | | \ | !EXT4_EXTENTS_FL | EXT4_EXTETNS_FL | |case \| | | +------------+-------------------------------+-------------------------------+ | #1 | write: 2194719883264 | write: -------------- | | | seek: 2199023251456 | seek: -------------- | +------------+-------------------------------+-------------------------------+ | #2 | write: 4402345721856 | write: 17592186044415 | | | seek: 17592186044415 | seek: 17592186044415 | +------------+-------------------------------+-------------------------------+ The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes (= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped maxbytes). Although generic_file_llseek uses only extent-mapped maxbytes. (llseek of ext4_file_operations is generic_file_llseek which uses sb->s_maxbytes.) Therefore we create ext4 llseek function which uses 2 maxbytes. The new own function originates from generic_file_llseek(). If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca>
2010-10-28ext4: don't update sb journal_devnum when RO devMaciej Żenczykowski
An ext4 filesystem on a read-only device, with an external journal which is at a different device number then recorded in the superblock will fail to honor the read-only setting of the device and trigger a superblock update (write). For example: - ext4 on a software raid which is in read-only mode - external journal on a read-write device which has changed device num - attempt to mount with -o journal_dev=<new_number> - hits BUG_ON(mddev->ro = 1) in md.c Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Maciej Żenczykowski <zenczykowski@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: use sb_issue_zeroout in ext4_ext_zerooutLukas Czerner
Change ext4_ext_zeroout to use sb_issue_zeroout instead of its own approach to zero out extents. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: use sb_issue_zeroout in setup_new_group_blocksLukas Czerner
Use sb_issue_zeroout to zero out inode table and descriptor table blocks instead of old approach which involves journaling. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: add interface to advertise ext4 features in sysfsLukas Czerner
User-space should have the opportunity to check what features doest ext4 support in each particular copy. This adds easy interface by creating new "features" directory in sys/fs/ext4/. In that directory files advertising feature names can be created. Add lazy_itable_init to the feature list. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: add support for lazy inode table initializationLukas Czerner
When the lazy_itable_init extended option is passed to mke2fs, it considerably speeds up filesystem creation because inode tables are not zeroed out. The fact that parts of the inode table are uninitialized is not a problem so long as the block group descriptors, which contain information regarding how much of the inode table has been initialized, has not been corrupted However, if the block group checksums are not valid, e2fsck must scan the entire inode table, and the the old, uninitialized data could potentially cause e2fsck to report false problems. Hence, it is important for the inode tables to be initialized as soon as possble. This commit adds this feature so that mke2fs can safely use the lazy inode table initialization feature to speed up formatting file systems. This is done via a new new kernel thread called ext4lazyinit, which is created on demand and destroyed, when it is no longer needed. There is only one thread for all ext4 filesystems in the system. When the first filesystem with inititable mount option is mounted, ext4lazyinit thread is created, then the filesystem can register its request in the request list. This thread then walks through the list of requests picking up scheduled requests and invoking ext4_init_inode_table(). Next schedule time for the request is computed by multiplying the time it took to zero out last inode table with wait multiplier, which can be set with the (init_itable=n) mount option (default is 10). We are doing this so we do not take the whole I/O bandwidth. When the thread is no longer necessary (request list is empty) it frees the appropriate structures and exits (and can be created later later by another filesystem). We do not disturb regular inode allocations in any way, it just do not care whether the inode table is, or is not zeroed. But when zeroing, we have to skip used inodes, obviously. Also we should prevent new inode allocations from the group, while zeroing is on the way. For that we take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem in the ext4_claim_inode, so when we are unlucky and allocator hits the group which is currently being zeroed, it just has to wait. This can be suppresed using the mount option no_init_itable. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28jbd2: Add sanity check for attempts to start handle during umountTheodore Ts'o
An attempt to modify the file system during the call to jbd2_destroy_journal() can lead to a system lockup. So add some checking to make it much more obvious when this happens to and to determine where the offending code is located. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: fix NULL pointer dereference in print_daily_error_info()Sergey Senozhatsky
Fix NULL pointer dereference in print_daily_error_info, when called on unmounted fs (EXT4_SB(sb) returns NULL), by removing error reporting timer in ext4_put_super. Google-Bug-Id: 3017663 Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: don't hold spinlock while calling ext4_issue_discard()Lukas Czerner
We can't hold the block group spinlock because we ext4_issue_discard() calls wait and hence can get rescheduled. Google-Bug-Id: 3017678 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: check for negative error code from sb_issue_discardLukas Czerner
sb_issue_discard() is returning negative error code, so check for -EOPNOTSUPP. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: don't bump up LONG_MAX nr_to_write by a factor of 8Eric Sandeen
I'm uneasy with lots of stuff going on in ext4_da_writepages(), but bumping nr_to_write from LLONG_MAX to -8 clearly isn't making anything better, so avoid the multiplier in that case. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: stop looping in ext4_num_dirty_pages when max_pages reachedEric Sandeen
Today we simply break out of the inner loop when we have accumulated max_pages; this keeps scanning forwad and doing pagevec_lookup_tag() in the while (!done) loop, this does potentially a lot of work with no net effect. When we have accumulated max_pages, just clean up and return. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: use dedicated slab caches for group_info structuresCurt Wohlgemuth
ext4_group_info structures are currently allocated with kmalloc(). With a typical 4K block size, these are 136 bytes each -- meaning they'll each consume a 256-byte slab object. On a system with many ext4 large partitions, that's a lot of wasted kernel slab space. (E.g., a single 1TB partition will have about 8000 block groups, using about 2MB of slab, of which nearly 1MB is wasted.) This patch creates an array of slab pointers created as needed -- depending on the superblock block size -- and uses these slabs to allocate the group info objects. Google-Bug-Id: 2980809 Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28jbd2: Fix I/O hang in jbd2_journal_release_jbd_inodeBrian King
This fixes a hang seen in jbd2_journal_release_jbd_inode on a lot of Power 6 systems running with ext4. When we get in the hung state, all I/O to the disk in question gets blocked where we stay indefinitely. Looking at the task list, I can see we are stuck in jbd2_journal_release_jbd_inode waiting on a wake up. I added some debug code to detect this scenario and dump additional data if we were stuck in jbd2_journal_release_jbd_inode for longer than 30 minutes. When it hit, I was able to see that i_flags was 0, suggesting we missed the wake up. This patch changes i_flags to be an unsigned long, uses bit operators to access it, and adds barriers around the accesses. Prior to applying this patch, we were regularly hitting this hang on numerous systems in our test environment. After applying the patch, the hangs no longer occur. Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28ext4: fix EOFBLOCKS_FL handlingTheodore Ts'o
It turns out we have several problems with how EOFBLOCKS_FL is handled. First of all, there was a fencepost error where we were not clearing the EOFBLOCKS_FL when fill in the last uninitialized block, but rather when we allocate the next block _after_ the uninitalized block. Secondly we were not testing to see if we needed to clear the EOFBLOCKS_FL when writing to the file O_DIRECT or when were converting an uninitialized block (which is the most common case). Google-Bug-Id: 2928259 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28fasync: Fix placement of FASYNC flag commentLinus Torvalds
In commit f7347ce4ee7c ("fasync: re-organize fasync entry insertion to allow it under a spinlock") Arnd took an earlier patch of mine that had the comment about the FASYNC flag above the wrong function. When the fasync_add_entry() function was split to introduce the new fasync_insert_entry() helper function, the code that actually cares about the FASYNC bit moved to that new helper. So just move the comment to the right point. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28Merge branch 'flock' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bklLinus Torvalds
* 'flock' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: locks: turn lock_flocks into a spinlock fasync: re-organize fasync entry insertion to allow it under a spinlock locks/nfsd: allocate file lock outside of spinlock lockd: fix nlmsvc_notify_blocked locking lockd: push lock_flocks down
2010-10-28epoll: make epoll_wait() use the hrtimer range featureShawn Bohrer
This make epoll use hrtimers for the timeout value which prevents epoll_wait() from timing out up to a millisecond early. This mirrors the behavior of select() and poll(). Signed-off-by: Shawn Bohrer <shawn.bohrer@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28select: rename estimate_accuracy() to select_estimate_accuracy()Andrew Morton
Make it a subsystem-specific identifier because we wish to amke it non-static in the next patch ("epoll: make epoll_wait() use the hrtimer range feature"). Cc: Shawn Bohrer <shawn.bohrer@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28fuse: use release_pages()Miklos Szeredi
Replace iterated page_cache_release() with release_pages(), which is faster and shorter. Needs release_pages() to be exported to modules. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28exec: don't turn PF_KTHREAD off when a target command was not foundKOSAKI Motohiro
Presently do_execve() turns PF_KTHREAD off before search_binary_handler(). THis has a theorical risk of PF_KTHREAD getting lost. We don't have to turn PF_KTHREAD off in the ENOEXEC case. This patch moves this flag modification to after the finding of the executable file. This is only a theorical issue because kthreads do not call do_execve() directly. But fixing would be better. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28/proc/stat: fix scalability of irq sum of all cpuKAMEZAWA Hiroyuki
In /proc/stat, the number of per-IRQ event is shown by making a sum each irq's events on all cpus. But we can make use of kstat_irqs(). kstat_irqs() do the same calculation, If !CONFIG_GENERIC_HARDIRQ, it's not a big cost. (Both of the number of cpus and irqs are small.) If a system is very big and CONFIG_GENERIC_HARDIRQ, it does for_each_irq() for_each_cpu() - look up a radix tree - read desc->irq_stat[cpu] This seems not efficient. This patch adds kstat_irqs() for CONFIG_GENRIC_HARDIRQ and change the calculation as for_each_irq() look up radix tree for_each_cpu() - read desc->irq_stat[cpu] This reduces cost. A test on (4096cpusp, 256 nodes, 4592 irqs) host (by Jack Steiner) %time cat /proc/stat > /dev/null Before Patch: 2.459 sec After Patch : .561 sec [akpm@linux-foundation.org: unexport kstat_irqs, coding-style tweaks] [akpm@linux-foundation.org: fix unused variable 'per_irq_sum'] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Jack Steiner <steiner@sgi.com> Acked-by: Jack Steiner <steiner@sgi.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28/proc/stat: scalability of irq num per cpuKAMEZAWA Hiroyuki
/proc/stat shows the total number of all interrupts to each cpu. But when the number of IRQs are very large, it take very long time and 'cat /proc/stat' takes more than 10 secs. This is because sum of all irq events are counted when /proc/stat is read. This patch adds "sum of all irq" counter percpu and reduce read costs. The cost of reading /proc/stat is important because it's used by major applications as 'top', 'ps', 'w', etc.... A test on a mechin (4096cpu, 256 nodes, 4592 irqs) shows %time cat /proc/stat > /dev/null Before Patch: 12.627 sec After Patch: 2.459 sec Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Jack Steiner <steiner@sgi.com> Acked-by: Jack Steiner <steiner@sgi.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28procfs: fix /proc/softirqs formattingDavidlohr Bueso
The length of the BLOCK_IPOLL string is making i's value be printed too far to the right. This patch fixes this and makes the output a bit neater. Currently: CPU0 HI: 0 TIMER: 599792 NET_TX: 2 NET_RX: 6 BLOCK: 80807 BLOCK_IOPOLL: 0 TASKLET: 20012 SCHED: 0 HRTIMER: 63 RCU: 619279 With patch: CPU0 HI: 0 TIMER: 585582 NET_TX: 2 NET_RX: 6 BLOCK: 80320 BLOCK_IOPOLL: 0 TASKLET: 19287 SCHED: 0 HRTIMER: 62 RCU: 604441 Signed-off-by: Davidlohr Bueso <dave@gnu.org> Acked-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28/proc/pid/smaps: export amount of anonymous memory in a mappingNikanth Karthikesan
Export the number of anonymous pages in a mapping via smaps. Even the private pages in a mapping backed by a file, would be marked as anonymous, when they are modified. Export this information to user-space via smaps. Exporting this count will help gdb to make a better decision on which areas need to be dumped in its coredump; and should be useful to others studying the memory usage of a process. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28coredump: default CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=yRoland McGrath
The userland ELF tools have been coping with partial-segments core files for a few years now. Multiple distro builds are now setting this option. It behooves everyone who ever deals with core files to have more info dumped in there, especially as more and more people's compilers are producing build IDs. Make it the default. Anyone using older tools confused by these core files can configure this option off, or just change /proc/PID/coredump_filter after boot. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28core_pattern: fix truncation by core_pattern handler with long parametersXiaotian Feng
We met a parameter truncated issue, consider following: > echo "|/root/core_pattern_pipe_test %p /usr/libexec/blah-blah-blah \ %s %c %p %u %g 11 12345678901234567890123456789012345678 %t" > \ /proc/sys/kernel/core_pattern This is okay because the strings is less than CORENAME_MAX_SIZE. "cat /proc/sys/kernel/core_pattern" shows the whole string. but after we run core_pattern_pipe_test in man page, we found last parameter was truncated like below: argc[10]=<12807486> The root cause is core_pattern allows % specifiers, which need to be replaced during parse time, but the replace may expand the strings to larger than CORENAME_MAX_SIZE. So if the last parameter is % specifiers, the replace code is using snprintf(out_ptr, out_end - out_ptr, ...), this will write out of corename array. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Neil Horman <nhorman@tuxdriver.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28signals: move cred_guard_mutex from task_struct to signal_structKOSAKI Motohiro
Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent execve() has no worth. Let's move ->cred_guard_mutex from task_struct to signal_struct. It naturally prevent multiple-threads-inside-exec. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28isofs: work-around for Rock Ridge+Joliet CDs with empty ISO root directoryOndrej Zary
If a CD has both Rock Ridge and Joliet extensions and the ISO root directory is empty, no files are visible. Disable Rock Ridge extensions in this case and use Joliet root directory instead. Signed-off-by: Ondrej Zary <linux@rainbow-software.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Guenter Roeck <guenter.roeck@ericsson.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27quota: Fix possible oops in __dquot_initialize()Jan Kara
When quotaon(8) races with __dquot_initialize() or dqget() fails because of EIO, ENOSPC, or similar error, we could possibly dereference NULL pointer in inode->i_dquot[cnt]. Add proper checking. Reported-by: Dmitry Monakhov <dmonakhov@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-27ext3: Update kernel-doc commentsNamhyung Kim
Update missing/broken argument descriptions and fix formatting. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-27jbd/2: fixed typosAndrea Gelmini
"wakup" Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-27ext2: fixed typo.Andrea Gelmini
"excpet" Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-27ext3: Fix debug messages in ext3_group_extend()Namhyung Kim
Fix a typo, break long lines and use E3FSBLK on ext3_fsblk_t. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>