summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2007-07-19sunrpc: use vfs_path_lookupJosef 'Jeff' Sipek
use vfs_path_lookup instead of open-coding the necessary functionality. Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Neil Brown <neilb@suse.de> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19fs: introduce vfs_path_lookupJosef 'Jeff' Sipek
Stackable file systems, among others, frequently need to lookup paths or path components starting from an arbitrary point in the namespace (identified by a dentry and a vfsmount). Currently, such file systems use lookup_one_len, which is frowned upon [1] as it does not pass the lookup intent along; not passing a lookup intent, for example, can trigger BUG_ON's when stacking on top of NFSv4. The first patch introduces a new lookup function to allow lookup starting from an arbitrary point in the namespace. This approach has been suggested by Christoph Hellwig [2]. The second patch changes sunrpc to use vfs_path_lookup. The third patch changes nfsctl.c to use vfs_path_lookup. The fourth patch marks link_path_walk static. The fifth, and last patch, unexports path_walk because it is no longer unnecessary to call it directly, and using the new vfs_path_lookup is cleaner. For example, the following snippet of code, looks up "some/path/component" in a directory pointed to by parent_{dentry,vfsmnt}: err = vfs_path_lookup(parent_dentry, parent_vfsmnt, "some/path/component", 0, &nd); if (!err) { /* exits */ ... /* once done, release the references */ path_release(&nd); } else if (err == -ENOENT) { /* doesn't exist */ } else { /* other error */ } VFS functions such as lookup_create can be used on the nameidata structure to pass the create intent to the file system. Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Neil Brown <neilb@suse.de> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19mm: variable length argument supportOllie Wild
Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from the old mm into the new mm. We create the new mm before the binfmt code runs, and place the new stack at the very top of the address space. Once the binfmt code runs and figures out where the stack should be, we move it downwards. It is a bit peculiar in that we have one task with two mm's, one of which is inactive. [a.p.zijlstra@chello.nl: limit stack size] Signed-off-by: Ollie Wild <aaw@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <linux-arch@vger.kernel.org> Cc: Hugh Dickins <hugh@veritas.com> [bunk@stusta.de: unexport bprm_mm_init] Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19audit: rework execve auditPeter Zijlstra
The purpose of audit_bprm() is to log the argv array to a userspace daemon at the end of the execve system call. Since user-space hasn't had time to run, this array is still in pristine state on the process' stack; so no need to copy it, we can just grab it from there. In order to minimize the damage to audit_log_*() copy each string into a temporary kernel buffer first. Currently the audit code requires that the full argument vector fits in a single packet. So currently it does clip the argv size to a (sysctl) limit, but only when execve auditing is enabled. If the audit protocol gets extended to allow for multiple packets this check can be removed. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ollie Wild <aaw@google.com> Cc: <linux-audit@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19arch: personality independent stack topPeter Zijlstra
New arch macro STACK_TOP_MAX it gives the larges valid stack address for the architecture in question. It differs from STACK_TOP in that it will not distinguish between personalities but will always return the largest possible address. This is used to create the initial stack on execve, which we will move down to the proper location once the binfmt code has figured out where that is. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ollie Wild <aaw@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19use the new percpu interface for shared dataFenghua Yu
Currently most of the per cpu data, which is accessed by different cpus, has a ____cacheline_aligned_in_smp attribute. Move all this data to the new per cpu shared data section: .data.percpu.shared_aligned. This will seperate the percpu data which is referenced frequently by other cpus from the local only percpu data. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19define new percpu interface for shared dataFenghua Yu
per cpu data section contains two types of data. One set which is exclusively accessed by the local cpu and the other set which is per cpu, but also shared by remote cpus. In the current kernel, these two sets are not clearely separated out. This can potentially cause the same data cacheline shared between the two sets of data, which will result in unnecessary bouncing of the cacheline between cpus. One way to fix the problem is to cacheline align the remotely accessed per cpu data, both at the beginning and at the end. Because of the padding at both ends, this will likely cause some memory wastage and also the interface to achieve this is not clean. This patch: Moves the remotely accessed per cpu data (which is currently marked as ____cacheline_aligned_in_smp) into a different section, where all the data elements are cacheline aligned. And as such, this differentiates the local only data and remotely accessed data cleanly. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Lameter <clameter@sgi.com> Cc: <linux-arch@vger.kernel.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19jprobes: make jprobes a little safer for usersMichael Ellerman
I realise jprobes are a razor-blades-included type of interface, but that doesn't mean we can't try and make them safer to use. This guy I know once wrote code like this: struct jprobe jp = { .kp.symbol_name = "foo", .entry = "jprobe_foo" }; And then his kernel exploded. Oops. This patch adds an arch hook, arch_deref_entry_point() (I don't like it either) which takes the void * in a struct jprobe, and gives back the text address that it represents. We can then use that in register_jprobe() to check that the entry point we're passed is actually in the kernel text, rather than just some random value. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19jprobes: remove JPROBE_ENTRY()Michael Ellerman
AFAICT now that jprobe.entry is a void *, JPROBE_ENTRY doesn't do anything useful - so remove it .. I've left a do-nothing version so that out-of-tree jprobes code will still compile without modifications. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19jprobes: make struct jprobe.entry a void *Michael Ellerman
Currently jprobe.entry is a kprobe_opcode_t *, but that's a lie. On some platforms it doesn't point to an opcode at all, it points to a function descriptor. It's really a pointer to something that the arch code can turn into a function entry point. And that's what actually happens, none of the generic code ever looks at jprobe.entry, it's only ever dereferenced by arch code. So just make it a void *. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19readahead: sanify file_ra_state namesFengguang Wu
Rename some file_ra_state variables and remove some accessors. It results in much simpler code. Kudos to Rusty! Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19readahead: split ondemand readahead interface into two functionsRusty Russell
Split ondemand readahead interface into two functions. I think this makes it a little clearer for non-readahead experts (like Rusty). Internally they both call ondemand_readahead(), but the page argument is changed to an obvious boolean flag. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19mm: share PG_readahead and PG_reclaimFengguang Wu
Share the same page flag bit for PG_readahead and PG_reclaim. One is used only on file reads, another is only for emergency writes. One is used mostly for fresh/young pages, another is for old pages. Combinations of possible interactions are: a) clear PG_reclaim => implicit clear of PG_readahead it will delay an asynchronous readahead into a synchronous one it actually does _good_ for readahead: the pages will be reclaimed soon, it's readahead thrashing! in this case, synchronous readahead makes more sense. b) clear PG_readahead => implicit clear of PG_reclaim one(and only one) page will not be reclaimed in time it can be avoided by checking PageWriteback(page) in readahead first c) set PG_reclaim => implicit set of PG_readahead will confuse readahead and make it restart the size rampup process it's a trivial problem, and can mostly be avoided by checking PageWriteback(page) first in readahead d) set PG_readahead => implicit set of PG_reclaim PG_readahead will never be set on already cached pages. PG_reclaim will always be cleared on dirtying a page. so not a problem. In summary, a) we get better behavior b,d) possible interactions can be avoided c) racy condition exists that might affect readahead, but the chance is _really_ low, and the hurt on readahead is trivial. Compound pages also use PG_reclaim, but for now they do not interact with reclaim/readahead code. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19readahead: pass real splice sizeFengguang Wu
Pass real splice size to page_cache_readahead_ondemand(). The splice code works in chunks of 16 pages internally. The readahead code should be told of the overall splice size, instead of the internal chunk size. Otherwize bad things may happen. Imagine some 17-page random splice reads. The code before this patch will result in two readahead calls: readahead(16); readahead(1); That leads to one 16-page I/O and one 32-page I/O: one extra I/O and 31 readahead miss pages. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19readahead: move synchronous readahead call out of splice loopFengguang Wu
Move synchronous page_cache_readahead_ondemand() call out of splice loop. This avoids one pointless page allocation/insertion in case of non-zero ra_pages, or many pointless readahead calls in case of zero ra_pages. Note that if a user sets ra_pages to less than PIPE_BUFFERS=16 pages, he will not get expected readahead behavior anyway. The splice code works in batches of 16 pages, which can be taken as another form of synchronous readahead. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19readahead: remove the old algorithmFengguang Wu
Remove the old readahead algorithm. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19readahead: convert ext3/ext4 invocationsFengguang Wu
Convert ext3/ext4 dir reads to use on-demand readahead. Readahead for dirs operates _not_ on file level, but on blockdev level. This makes a difference when the data blocks are not continuous. And the read routine is somehow opaque: there's no handy info about the status of current page. So a simplified call scheme is employed: to call into readahead whenever the current page falls out of readahead windows. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19readahead: convert splice invocationsFengguang Wu
Convert splice reads to use on-demand readahead. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Jens Axboe <axboe@suse.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19readahead: convert filemap invocationsFengguang Wu
Convert filemap reads to use on-demand readahead. The new call scheme is to - call readahead on non-cached page - call readahead on look-ahead page - update prev_index when finished with the read request Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19readahead: on-demand readahead logicFengguang Wu
This is a minimal readahead algorithm that aims to replace the current one. It is more flexible and reliable, while maintaining almost the same behavior and performance. Also it is full integrated with adaptive readahead. It is designed to be called on demand: - on a missing page, to do synchronous readahead - on a lookahead page, to do asynchronous readahead In this way it eliminated the awkward workarounds for cache hit/miss, readahead thrashing, retried read, and unaligned read. It also adopts the data structure introduced by adaptive readahead, parameterizes readahead pipelining with `lookahead_index', and reduces the current/ahead windows to one single window. HEURISTICS The logic deals with four cases: - sequential-next found a consistent readahead window, so push it forward - random standalone small read, so read as is - sequential-first create a new readahead window for a sequential/oversize request - lookahead-clueless hit a lookahead page not associated with the readahead window, so create a new readahead window and ramp it up In each case, three parameters are determined: - readahead index: where the next readahead begins - readahead size: how much to readahead - lookahead size: when to do the next readahead (for pipelining) BEHAVIORS The old behaviors are maximally preserved for trivial sequential/random reads. Notable changes are: - It no longer imposes strict sequential checks. It might help some interleaved cases, and clustered random reads. It does introduce risks of a random lookahead hit triggering an unexpected readahead. But in general it is more likely to do good than to do evil. - Interleaved reads are supported in a minimal way. Their chances of being detected and proper handled are still low. - Readahead thrashings are better handled. The current readahead leads to tiny average I/O sizes, because it never turn back for the thrashed pages. They have to be fault in by do_generic_mapping_read() one by one. Whereas the on-demand readahead will redo readahead for them. OVERHEADS The new code reduced the overheads of - excessively calling the readahead routine on small sized reads (the current readahead code insists on seeing all requests) - doing a lot of pointless page-cache lookups for small cached files (the current readahead only turns itself off after 256 cache hits, unfortunately most files are < 1MB, so never see that chance) That accounts for speedup of - 0.3% on 1-page sequential reads on sparse file - 1.2% on 1-page cache hot sequential reads - 3.2% on 256-page cache hot sequential reads - 1.3% on cache hot `tar /lib` However, it does introduce one extra page-cache lookup per cache miss, which impacts random reads slightly. That's 1% overheads for 1-page random reads on sparse file. PERFORMANCE The basic benchmark setup is - 2.6.20 kernel with on-demand readahead - 1MB max readahead size - 2.9GHz Intel Core 2 CPU - 2GB memory - 160G/8M Hitachi SATA II 7200 RPM disk The benchmarks show that - it maintains the same performance for trivial sequential/random reads - sysbench/OLTP performance on MySQL gains up to 8% - performance on readahead thrashing gains up to 3 times iozone throughput (KB/s): roughly the same ========================================== iozone -c -t1 -s 4096m -r 64k 2.6.20 on-demand gain first run " Initial write " 61437.27 64521.53 +5.0% " Rewrite " 47893.02 48335.20 +0.9% " Read " 62111.84 62141.49 +0.0% " Re-read " 62242.66 62193.17 -0.1% " Reverse Read " 50031.46 49989.79 -0.1% " Stride read " 8657.61 8652.81 -0.1% " Random read " 13914.28 13898.23 -0.1% " Mixed workload " 19069.27 19033.32 -0.2% " Random write " 14849.80 14104.38 -5.0% " Pwrite " 62955.30 65701.57 +4.4% " Pread " 62209.99 62256.26 +0.1% second run " Initial write " 60810.31 66258.69 +9.0% " Rewrite " 49373.89 57833.66 +17.1% " Read " 62059.39 62251.28 +0.3% " Re-read " 62264.32 62256.82 -0.0% " Reverse Read " 49970.96 50565.72 +1.2% " Stride read " 8654.81 8638.45 -0.2% " Random read " 13901.44 13949.91 +0.3% " Mixed workload " 19041.32 19092.04 +0.3% " Random write " 14019.99 14161.72 +1.0% " Pwrite " 64121.67 68224.17 +6.4% " Pread " 62225.08 62274.28 +0.1% In summary, writes are unstable, reads are pretty close on average: access pattern 2.6.20 on-demand gain Read 62085.61 62196.38 +0.2% Re-read 62253.49 62224.99 -0.0% Reverse Read 50001.21 50277.75 +0.6% Stride read 8656.21 8645.63 -0.1% Random read 13907.86 13924.07 +0.1% Mixed workload 19055.29 19062.68 +0.0% Pread 62217.53 62265.27 +0.1% aio-stress: roughly the same ============================ aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso 2.6.20 on-demand delta sequential 92.57s 92.54s -0.0% random 311.87s 312.15s +0.1% sysbench fileio: roughly the same ================================= sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \ --file-total-size=4G --file-block-size=64K \ --num-threads=001 --max-requests=10000 --max-time=900 run threads 2.6.20 on-demand delta first run 1 59.1974s 59.2262s +0.0% 2 58.0575s 58.2269s +0.3% 4 48.0545s 47.1164s -2.0% 8 41.0684s 41.2229s +0.4% 16 35.8817s 36.4448s +1.6% 32 32.6614s 32.8240s +0.5% 64 23.7601s 24.1481s +1.6% 128 24.3719s 23.8225s -2.3% 256 23.2366s 22.0488s -5.1% second run 1 59.6720s 59.5671s -0.2% 8 41.5158s 41.9541s +1.1% 64 25.0200s 23.9634s -4.2% 256 22.5491s 20.9486s -7.1% Note that the numbers are not very stable because of the writes. The overall performance is close when we sum all seconds up: sum all up 495.046s 491.514s -0.7% sysbench oltp (trans/sec): up to 8% gain ======================================== sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \ --mysql-socket=/var/run/mysqld/mysqld.sock \ --mysql-user=root --mysql-password=readahead \ --num-threads=064 --max-requests=10000 --max-time=900 run 10000-transactions run threads 2.6.20 on-demand gain 1 62.81 64.56 +2.8% 2 67.97 70.93 +4.4% 4 81.81 85.87 +5.0% 8 94.60 97.89 +3.5% 16 99.07 104.68 +5.7% 32 95.93 104.28 +8.7% 64 96.48 103.68 +7.5% 5000-transactions run 1 48.21 48.65 +0.9% 8 68.60 70.19 +2.3% 64 70.57 74.72 +5.9% 2000-transactions run 1 37.57 38.04 +1.3% 2 38.43 38.99 +1.5% 4 45.39 46.45 +2.3% 8 51.64 52.36 +1.4% 16 54.39 55.18 +1.5% 32 52.13 54.49 +4.5% 64 54.13 54.61 +0.9% That's interesting results. Some investigations show that - MySQL is accessing the db file non-uniformly: some parts are more hot than others - It is mostly doing 4-page random reads, and sometimes doing two reads in a row, the latter one triggers a 16-page readahead. - The on-demand readahead leaves many lookahead pages (flagged PG_readahead) there. Many of them will be hit, and trigger more readahead pages. Which might save more seeks. - Naturally, the readahead windows tend to lie in hot areas, and the lookahead pages in hot areas is more likely to be hit. - The more overall read density, the more possible gain. That also explains the adaptive readahead tricks for clustered random reads. readahead thrashing: 3 times better =================================== We boot kernel with "mem=128m single", and start a 100KB/s stream on every second, until reaching 200 streams. max throughput min avg I/O size 2.6.20: 5MB/s 16KB on-demand: 15MB/s 140KB Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19readahead: data structure and routinesFengguang Wu
Extend struct file_ra_state to support the on-demand readahead logic. Also define some helpers for it. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19readahead: MIN_RA_PAGES/MAX_RA_PAGES macrosFengguang Wu
Define two convenient macros for read-ahead: - MAX_RA_PAGES: rounded down counterpart of VM_MAX_READAHEAD - MIN_RA_PAGES: rounded _up_ counterpart of VM_MIN_READAHEAD Note that the rounded up MIN_RA_PAGES will work flawlessly with _large_ page sizes like 64k. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19readahead: add look-ahead support to __do_page_cache_readahead()Fengguang Wu
Add look-ahead support to __do_page_cache_readahead(). It works by - mark the Nth backwards page with PG_readahead, (which instructs the page's first reader to invoke readahead) - and only do the marking for newly allocated pages. (to prevent blindly doing readahead on already cached pages) Look-ahead is a technique to achieve I/O pipelining: While the application is working through a chunk of cached pages, the kernel reads-ahead the next chunk of pages _before_ time of need. It effectively hides low level I/O latencies to high level applications. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19readahead: introduce PG_readaheadFengguang Wu
Introduce a new page flag: PG_readahead. It acts as a look-ahead mark, which tells the page reader: Hey, it's time to invoke the read-ahead logic. For the sake of I/O pipelining, don't wait until it runs out of cached pages! Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19AIO sparse fix (type of ki_flags)David Brownell
Fix type issue reported by latest 'sparse': kiocb.ki_flags should be "unsigned long" (not "long"), to match bitop type signature. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19eCryptfs: ecryptfs_setattr() bugfixMichael Halcrow
There is another bug recently introduced into the ecryptfs_setattr() function in 2.6.22. eCryptfs will attempt to treat special files like regular eCryptfs files on chmod, chown, and so forth. This leads to a NULL pointer dereference. This patch validates that the file is a regular file before proceeding with operations related to the inode's crypt_stat. Thanks to Ryusuke Konishi for finding this bug and suggesting the fix. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19mbcs: Remove lots of global symbolsAlan Cox
MBCS has a collection of things that searches say are not used elsewhere and could be static. If this is the case they should be static, if not then someone at SGI should rename things like "soft_list" so they don't pollute the global namespace with generic names... Signed-off-by: Alan Cox <alan@redhat.com> Acked-by: Bruce Losure <blosure@sgi.com> Cc: Jes Sorensen <jes@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19Avoid too many remote cpu references due to /proc/statRavikiran G Thirumalai
Optimize show_stat to collect per-irq information just once. On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem. On every call to kstat_irqs, the process brings in per-cpu data from all online cpus. Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS results in (256+32*63) * 63 remote cpu references on a 64 cpu config. Considering the fact that we already compute this value per-cpu, we can save on the remote references as below. Signed-off-by: Alok N Kataria <alok.kataria@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19gpio calls don't need i/o barriersDavid Brownell
Clarify that drivers using the GPIO operations don't need to issue io barrier instructions themselves. Previously this wasn't clear, and at least one platform assumed otherwise (and would thus break various otherwise-portable drivers which don't issue barriers). Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19unregister_chrdev() return voidAkinobu Mita
unregister_chrdev() does not return meaningful value. This patch makes it return void like most unregister_* functions. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19unregister_chrdev(): ignore the return valueAkinobu Mita
unregister_chrdev() always returns 0. There is no need to check the return value. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19UDF: coding style conversion - lindentCyrill Gorcunov
This patch converts UDF coding style to kernel coding style using Lindent. Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19Suspend MAINTAINERS updatePavel Machek
I guess it is time to clarify that suspend and hibernation are separate things, and add Rafael as a maintainer. Plus, people blame us for suspend problems, anyway, I guess it is fair to mark us as suspend maintainers, too. Signed-off-by: Pavel Machek <pavel@suse.cz> Acked-by: Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19PM: Integrate beeping flag with existing acpi_sleep flagsPavel Machek
Move "debug during resume from s2ram" into the variable we already use for real-mode flags to simplify code. It also closes nasty trap for the user in acpi_sleep_setup; order of parameters actually mattered there, acpi_sleep=s3_bios,s3_mode doing something different from acpi_sleep=s3_mode,s3_bios. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19PM: Optional beeping during resume from suspend to RAMNigel Cunningham
Add a feature allowing the user to make the system beep during a resume from suspend to RAM, on x86_64 and i386. This is useful for the users with broken resume from RAM, so that they can verify if the control reaches the kernel after a wake-up event. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19PM: Introduce pm_power_off_prepareRafael J. Wysocki
Introduce the pm_power_off_prepare() callback that can be registered by the interested platforms in analogy with pm_idle() and pm_power_off(), used for preparing the system to power off (needed by ACPI). This allows us to drop acpi_sysclass and device_acpi that are only defined in order to register the ACPI power off preparation callback, which is needed by pm_power_off() registered in a much different way. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19ACPI: Do not prepare for hibernation in acpi_shutdownRafael J. Wysocki
Since we are now explicitly calling hibernation_ops->prepare() before hibernation_ops->enter() in hibernation_platform_enter() (defined in kernel/power/disk.c), ACPI should not call acpi_sleep_prepare(ACPI_STATE_S4) from acpi_shutdown(). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <lenb@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19PM: Reduce code duplication between main.c and user.cRafael J. Wysocki
The SNAPSHOT_S2RAM ioctl code is outdated and it should not duplicate the suspend code in kernel/power/main.c. Fix that. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19PM: prevent frozen user mode helpers from failing the freezing of tasksRafael J. Wysocki
At present, if a user mode helper is running while usermodehelper_pm_callback() is executed, the helper may be frozen and the completion in call_usermodehelper_exec() won't be completed until user space processes are thawed. As a result, the freezing of kernel threads may fail, which is not desirable. Prevent this from happening by introducing a counter of running user mode helpers and allowing usermodehelper_pm_callback() to succeed for action = PM_HIBERNATION_PREPARE or action = PM_SUSPEND_PREPARE only if there are no helpers running. [Namely, usermodehelper_pm_callback() waits for at most RUNNING_HELPERS_TIMEOUT for the number of running helpers to become zero and fails if that doesn't happen.] Special thanks to Uli Luckas <u.luckas@road.de>, Pavel Machek <pavel@ucw.cz> and Oleg Nesterov <oleg@tv-sign.ru> for reviewing the previous versions of this patch and for very useful comments. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Uli Luckas <u.luckas@road.de> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19PM: disable usermode helper before hibernation and suspendRafael J. Wysocki
Use a hibernation and suspend notifier to disable the user mode helper before a hibernation/suspend and enable it after the operation. [akpm@linux-foundation.org: build fix] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19PM: introduce hibernation and suspend notifiersRafael J. Wysocki
Make it possible to register hibernation and suspend notifiers, so that subsystems can perform hibernation-related or suspend-related operations that should not be carried out by device drivers' .suspend() and .resume() routines. [akpm@linux-foundation.org: build fixes] [akpm@linux-foundation.org: cleanups] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19Freezer: remove redundant check in try_to_freeze_tasksRafael J. Wysocki
We don't need to check if todo is positive before calling time_after() in try_to_freeze_tasks(), because if todo is zero at this point, the loop will be broken anyway due to the while () condition being false. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19Freezer: return int from freeze_processesRafael J. Wysocki
Make try_to_freeze_tasks() and freeze_processes() return -EBUSY on failure instead of the number of unfrozen tasks (none of the callers actually uses this number). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19Freezer: use __set_current_state in refrigeratorRafael J. Wysocki
Use __set_current_state() as appropriate in refrigerator() instead of accessing current->state directly. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19Freezer: avoid freezing kernel threads prematurelyRafael J. Wysocki
Kernel threads should not have TIF_FREEZE set when user space processes are being frozen, since otherwise some of them might be frozen prematurely. To prevent this from happening we can (1) make exit_mm() unset TIF_FREEZE unconditionally just after clearing tsk->mm and (2) make try_to_freeze_tasks() check if p->mm is different from zero and PF_BORROWED_MM is unset in p->flags when user space processes are to be frozen. Namely, when user space processes are being frozen, we only should set TIF_FREEZE for tasks that have p->mm different from NULL and don't have PF_BORROWED_MM set in p->flags. For this reason task_lock() must be used to prevent try_to_freeze_tasks() from racing with use_mm()/unuse_mm(), in which p->mm and p->flags.PF_BORROWED_MM are changed under task_lock(p). Also, we need to prevent the following scenario from happening: * daemonize() is called by a task spawned from a user space code path * freezer checks if the task has p->mm set and the result is positive * task enters exit_mm() and clears its TIF_FREEZE * freezer sets TIF_FREEZE for the task * task calls try_to_freeze() and goes to the refrigerator, which is wrong at that point This requires us to acquire task_lock(p) before p->flags.PF_BORROWED_MM and p->mm are examined and release it after TIF_FREEZE is set for p (or it turns out that TIF_FREEZE should not be set). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19Hibernation: prepare to enter the low power stateRafael J. Wysocki
During hibernation we call hibernation_ops->prepare() before creating the image, but then, before saving it, we cancel the power transition by calling hibernation_ops->finish(). Thus prior to calling hibernation_ops->enter() we should let the platform firmware know that we're going to enter the low power state after all. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19swsusp: fix hibernation code orderingRafael J. Wysocki
Change the code ordering so that hibernation_ops->prepare() is called after device_suspend(). This is needed so that we don't violate the ACPI specification, which states that the _PTS and _GTS system-control methods, executed from acpi_sleep_prepare(), ought to be called after devices have been put in low power states. The "Finish" label in hibernation_restore() is moved, because device_suspend() resumes devices if the suspending of them fails and the restore code ordering should reflect the hibernation code ordering. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19swsusp: introduce restore platform operationsRafael J. Wysocki
At least on some machines it is necessary to prepare the ACPI firmware for the restoration of the system memory state from the hibernation image if the "platform" mode of hibernation has been used. Namely, in that cases we need to disable the GPEs before replacing the "boot" kernel with the "frozen" kernel (cf. http://bugzilla.kernel.org/show_bug.cgi?id=7887). After the restore they will be re-enabled by hibernation_ops->finish(), but if the restore fails, they have to be re-enabled by the restore code explicitly. For this purpose we can introduce two additional hibernation operations, called pre_restore() and restore_cleanup() and call them from the restore code path. Still, they should be called if the "platform" mode of hibernation has been used, so we need to pass the information about the hibernation mode from the "frozen" kernel to the "boot" kernel in the image header. Apparently, we can't drop the disabling of GPEs before the restore because of Bug #7887 .  We also can't do it unconditionally, because the GPEs wouldn't have been enabled after a successful restore if the suspend had been done in the 'shutdown' or 'reboot' mode. In principle we could (and probably should) unconditionally disable the GPEs before each snapshot creation *and* before the restore, but then we'd have to unconditionally enable them after the snapshot creation as well as after the restore (or restore failure)   Still, for this purpose we'd need to modify acpi_enter_sleep_state_prep() and acpi_leave_sleep_state() and we'd have to introduce some mechanism synchronizing the disablind/enabling of the GPEs with the device drivers' .suspend()/.resume() routines and with disable_/enable_nonboot_cpus().  However, this would have affected the suspend (ie. s2ram) code as well as the hibernation, which I'd like to avoid in this patch series. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19swsusp: remove code duplication between disk.c and user.cRafael J. Wysocki
Currently, much of the code in kernel/power/disk.c is duplicated in kernel/power/user.c , mainly for historical reasons. By eliminating this code duplication we can reduce the size of user.c quite substantially and remove the maintenance difficulty resulting from it. [bunk@stusta.de: kernel/power/disk.c: make code static] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19swsusp: remove incorrect code from user.cRafael J. Wysocki
In the face of the recent change of suspend code ordering (cf. http://marc.info/?l=linux-acpi&m=117938245931603&w=2) we should also modify the code ordering in swsusp so that hibernation_ops->prepare() is executed after device_suspend(). However, for this purpose it seems reasonable to eliminate the code duplication between kernel/power/disk.c and kernel/power/user.c first. By eliminating it we can reduce the size of user.c quite substantially and remove the maintenance difficulty with making essentially the same changes in two different places. Moreover, we should also remove the calls to "platform" functions from the restore code path, since it doesn't carry out any power transition of the system, but we generally need to disable the GPEs before the restore if the 'platform' hibernation mode has been used. To do this, we can introduce two new hibernation_ops to be used in the restore code. This patch: Make the code hibernation code in kernel/power/user.c be functionally equivalent to the corresponding code in kernel/power/disk.c , as it should be. The calls to the platform functions removed by this patch are incorrect. They should be replaced with some other "platform" invocations that will be introduced in one of the subsequent patches. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>