summaryrefslogtreecommitdiff
path: root/mm/shmem.c
AgeCommit message (Collapse)Author
2008-04-28mempolicy: rework shmem mpol parsing and displayLee Schermerhorn
mm/shmem.c currently contains functions to parse and display memory policy strings for the tmpfs 'mpol' mount option. Move this to mm/mempolicy.c with the rest of the mempolicy support. With subsequent patches, we'll be able to remove knowledge of the details [mode, flags, policy, ...] completely from shmem.c 1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in mm/mempolicy.c. Rework to use the policy_types[] array [used by mpol_to_str()] to look up mode by name. 2) use mpol_to_str() to format policy for shmem_show_mpol(). mpol_to_str() expects a pointer to a struct mempolicy, so temporarily construct one. This will be replaced with a reference to a struct mempolicy in the tmpfs superblock in a subsequent patch. NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal sign '=' as the nodemask delimiter to match mpol_parse_str() and the tmpfs/shmem mpol mount option formatting that now uses mpol_to_str(). This is a user visible change to numa_maps, but then the addition of the mode flags already changed the display. It makes sense to me to have the mounts and numa_maps display the policy in the same format. However, if anyone objects strongly, I can pass the desired nodemask delimeter as an arg to mpol_to_str(). Note 2: Like show_numa_map(), I don't check the return code from mpol_to_str(). I do use a longer buffer than the one provided by show_numa_map(), which seems to have sufficed so far. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: rework mempolicy Reference Counting [yet again]Lee Schermerhorn
After further discussion with Christoph Lameter, it has become clear that my earlier attempts to clean up the mempolicy reference counting were a bit of overkill in some areas, resulting in superflous ref/unref in what are usually fast paths. In other areas, further inspection reveals that I botched the unref for interleave policies. A separate patch, suitable for upstream/stable trees, fixes up the known errors in the previous attempt to fix reference counting. This patch reworks the memory policy referencing counting and, one hopes, simplifies the code. Maybe I'll get it right this time. See the update to the numa_memory_policy.txt document for a discussion of memory policy reference counting that motivates this patch. Summary: Lookup of mempolicy, based on (vma, address) need only add a reference for shared policy, and we need only unref the policy when finished for shared policies. So, this patch backs out all of the unneeded extra reference counting added by my previous attempt. It then unrefs only shared policies when we're finished with them, using the mpol_cond_put() [conditional put] helper function introduced by this patch. Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma containing just the policy. read_swap_cache_async() can call alloc_page_vma() multiple times, so we can't let alloc_page_vma() unref the shared policy in this case. To avoid this, we make a copy of any non-null shared policy and remove the MPOL_F_SHARED flag from the copy. This copy occurs before reading a page [or multiple pages] from swap, so the overhead should not be an issue here. I introduced a new static inline function "mpol_cond_copy()" to copy the shared policy to an on-stack policy and remove the flags that would require a conditional free. The current implementation of mpol_cond_copy() assumes that the struct mempolicy contains no pointers to dynamically allocated structures that must be duplicated or reference counted during copy. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: rename mpol_free to mpol_putLee Schermerhorn
This is a change that was requested some time ago by Mel Gorman. Makes sense to me, so here it is. Note: I retain the name "mpol_free_shared_policy()" because it actually does free the shared_policy, which is NOT a reference counted object. However, ... The mempolicy object[s] referenced by the shared_policy are reference counted, so mpol_put() is used to release the reference held by the shared_policy. The mempolicy might not be freed at this time, because some task attached to the shared object associated with the shared policy may be in the process of allocating a page based on the mempolicy. In that case, the task performing the allocation will hold a reference on the mempolicy, obtained via mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks holding such a reference have called mpol_put() for the mempolicy. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: fix parsing of tmpfs mpol mount optionLee Schermerhorn
Parsing of new mode flags in the tmpfs mpol mount option is slightly broken: Setting a valid flag works OK: #mount -o remount,mpol=bind=static:1-2 /dev/shm #mount ... tmpfs on /dev/shm type tmpfs (rw,mpol=bind=static:1-2) ... However, we can't remove them or change them, once we've set a valid flag: #mount -o remount,mpol=bind:1-2 /dev/shm #mount ... tmpfs on /dev/shm type tmpfs (rw,mpol=bind:1-2) ... It SAYS it removed it, but that's just a copy of the input string. If we now try to set it to a different flag, we get: #mount -o remount,mpol=bind=relative:1-2 /dev/shm mount: /dev/shm not mounted already, or bad option And on the console, we see: tmpfs: Bad value 'bind' for mount option 'mpol' ^ lost remainder of string Furthermore, bogus flags are accepted with out error. Granted, they are a no-op: #mount -o remount,mpol=interleave=foo:0-3 /dev/shm #mount ... tmpfs on /dev/shm type tmpfs (rw,mpol=interleave=foo:0-3) Again, that's just a copy of the input string shown by the mount command. This patch fixes the behavior by pre-zeroing the flags so that only one of the mutually exclusive flags can be set at one time. It also reports an error when an unrecognized flag is specified. The check for both flags being set is removed because it can't happen with this implementation. If we ever want to support multiple non-exclusive flags, this area will need rework and we will need to check that any mutually exclusive flags aren't specified. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: Eric Whitney <eric.whitney@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: add MPOL_F_RELATIVE_NODES flagDavid Rientjes
Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies nodemasks passed via set_mempolicy() or mbind() should be considered relative to the current task's mems_allowed. When the mempolicy is created, the passed nodemask is folded and mapped onto the current task's mems_allowed. For example, consider a task using set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a nodemask of 1-3. If current's mems_allowed is 4-7, the effected nodemask is 5-7 (the second, third, and fourth node of mems_allowed). If the same task is attached to a cpuset, the mempolicy nodemask is rebound each time the mems are changed. Some possible rebinds and results are: mems result 1-3 1-3 1-7 2-4 1,5-6 1,5-6 1,5-7 5-7 Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned to the resultant nodemask from the relative remap. In the MPOL_PREFERRED case, the preferred node is remapped from the currently effected nodemask to the relative nodemask. This mempolicy mode flag was conceived of by Paul Jackson <pj@sgi.com>. Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: add MPOL_F_STATIC_NODES flagDavid Rientjes
Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the node remap when the policy is rebound. Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of a union with cpuset_mems_allowed: struct mempolicy { ... union { nodemask_t cpuset_mems_allowed; nodemask_t user_nodemask; } w; } that stores the the nodemask that the user passed when he or she created the mempolicy via set_mempolicy() or mbind(). When using MPOL_F_STATIC_NODES, which is passed with any mempolicy mode, the user's passed nodemask intersected with the VMA or task's allowed nodes is always used when determining the preferred node, setting the MPOL_BIND zonelist, or creating the interleave nodemask. This happens whenever the policy is rebound, including when a task's cpuset assignment changes or the cpuset's mems are changed. This creates an interesting side-effect in that it allows the mempolicy "intent" to lie dormant and uneffected until it has access to the node(s) that it desires. For example, if you currently ask for an interleaved policy over a set of nodes that you do not have access to, the mempolicy is not created and the task continues to use the previous policy. With this change, however, it is possible to create the same mempolicy; it is only effected when access to nodes in the nodemask is acquired. It is also possible to mount tmpfs with the static nodemask behavior when specifying a node or nodemask. To do this, simply add "=static" immediately following the mempolicy mode at mount time: mount -o remount mpol=interleave=static:1-3 Also removes mpol_check_policy() and folds its logic into mpol_new() since it is now obsoleted. The unused vma_mpol_equal() is also removed. Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: support optional mode flagsDavid Rientjes
With the evolution of mempolicies, it is necessary to support mempolicy mode flags that specify how the policy shall behave in certain circumstances. The most immediate need for mode flag support is to suppress remapping the nodemask of a policy at the time of rebind. Both the mempolicy mode and flags are passed by the user in the 'int policy' formal of either the set_mempolicy() or mbind() syscall. A new constant, MPOL_MODE_FLAGS, represents the union of legal optional flags that may be passed as part of this int. Mempolicies that include illegal flags as part of their policy are rejected as invalid. An additional member to struct mempolicy is added to support the mode flags: struct mempolicy { ... unsigned short policy; unsigned short flags; } The splitting of the 'int' actual passed by the user is done in sys_set_mempolicy() and sys_mbind() for their respective syscalls. This is done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of there are additional flags, and storing it in the new 'flags' member of struct mempolicy. The intersection of the actual with ~MPOL_MODE_FLAGS is stored in the 'policy' member of the struct and all current users of pol->policy remain unchanged. The union of the policy mode and optional mode flags is passed back to the user in get_mempolicy(). This combination of mode and flags within the same actual does not break userspace code that relies on get_mempolicy(&policy, ...) and either switch (policy) { case MPOL_BIND: ... case MPOL_INTERLEAVE: ... }; statements or if (policy == MPOL_INTERLEAVE) { ... } statements. Such applications would need to use optional mode flags when calling set_mempolicy() or mbind() for these previously implemented statements to stop working. If an application does start using optional mode flags, it will need to mask the optional flags off the policy in switch and conditional statements that only test mode. An additional member is also added to struct shmem_sb_info to store the optional mode flags. [hugh@veritas.com: shmem mpol: fix build warning] Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28mempolicy: convert MPOL constants to enumDavid Rientjes
The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and MPOL_INTERLEAVE, are better declared as part of an enum since they are sequentially numbered and cannot be combined. The policy member of struct mempolicy is also converted from type short to type unsigned short. A negative policy does not have any legitimate meaning, so it is possible to change its type in preparation for adding optional mode flags later. The equivalent member of struct shmem_sb_info is also changed from int to unsigned short. For compatibility, the policy formal to get_mempolicy() remains as a pointer to an int: int get_mempolicy(int *policy, unsigned long *nmask, unsigned long maxnode, unsigned long addr, unsigned long flags); although the only possible values is the range of type unsigned short. Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-20mm/shmem and tiny-shmem: fix some kernel-docRandy Dunlap
Convert tiny-shmem.c function comments to kernel-doc. Add parameters and convert/fix other kernel-doc in shmem.c. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-05memcg: mem_cgroup_charge never NULLHugh Dickins
My memcgroup patch to fix hang with shmem/tmpfs added NULL page handling to mem_cgroup_charge_common. It seemed convenient at the time, but hard to justify now: there's a perfectly appropriate swappage to charge and uncharge instead, this is not on any hot path through shmem_getpage, and no performance hit was observed from the slight extra overhead. So revert that NULL page handling from mem_cgroup_charge_common; and make it clearer by bringing page_cgroup_assign_new_page_cgroup into its body - that was a helper I found more of a hindrance to understanding. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08mount-options-fix-tmpfs-fixAndrew Morton
Documentation/SubmitCheckist, please. Cc: Hugh Dickins <hugh@veritas.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08mount options: fix tmpfsakpm@linux-foundation.org
Add .show_options super operation to tmpfs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07memcgroup: fix hang with shmem/tmpfsHugh Dickins
The memcgroup regime relies upon a cgroup reclaiming pages from itself within add_to_page_cache: which may involve some waiting. Whereas shmem and tmpfs rely upon using add_to_page_cache while holding a spinlock: when it cannot wait. The consequence is that when a cgroup reaches its limit, shmem_getpage just hangs - unless there is outside memory pressure too, neither kswapd nor radix_tree_preload get it out of the retry loop. In most cases we can mem_cgroup_cache_charge the page waitably first, to attach the page_cgroup in advance, so add_to_page_cache will do no more than increment a count; then mem_cgroup_uncharge_page after (in both success and failure cases) to balance the books again. And where there used to be a congestion_wait for kswapd (recently made redundant by radix_tree_preload), use mem_cgroup_cache_charge with NULL page to go through a cycle of allocation and freeing, without accounting to any particular page, and without updating the statistics vector. This brings the cgroup below its limit so the next try usually succeeds. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05VFS/Security: Rework inode_getsecurity and callers to return resulting bufferDavid P. Quigley
This patch modifies the interface to inode_getsecurity to have the function return a buffer containing the security blob and its length via parameters instead of relying on the calling function to give it an appropriately sized buffer. Security blobs obtained with this function should be freed using the release_secctx LSM hook. This alleviates the problem of the caller having to guess a length and preallocate a buffer for this function allowing it to be used elsewhere for Labeled NFS. The patch also removed the unused err parameter. The conversion is similar to the one performed by Al Viro for the security_getprocattr hook. Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Chris Wright <chrisw@sous-sol.org> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: fix shmem_swaplist racesHugh Dickins
Intensive swapoff testing shows shmem_unuse spinning on an entry in shmem_swaplist pointing to itself: how does that come about? Days pass... First guess is this: shmem_delete_inode tests list_empty without taking the global mutex (so the swapping case doesn't slow down the common case); but there's an instant in shmem_unuse_inode's list_move_tail when the list entry may appear empty (a rare case, because it's actually moving the head not the the list member). So there's a danger of leaving the inode on the swaplist when it's freed, then reinitialized to point to itself when reused. Fix that by skipping the list_move_tail when it's a no-op, which happens to plug this. But this same spinning then surfaces on another machine. Ah, I'd never suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we still hold page lock, which would hold off inode deletion if the page were in pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache doesn't wait on locked pages). Hmm: we could put the the inode on swaplist earlier, but then shmem_unuse_inode could never prune unswapped inodes. Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode; though I am a little uneasy about the iput which has to follow - it works, and I see nothing wrong with it, but it is surprising that shmem inode deletion may now occur below shmem_writepage. Revisit this fix later? And while we're looking at these races: the way shmem_unuse tests swapped without holding info->lock looks unsafe, if we've more than one swap area: a racing shmem_writepage on another page of the same inode could be putting it in swapcache, just as we're deciding to remove the inode from swaplist - there's a danger of going on swap without being listed, so a later swapoff would hang, being unable to locate the entry. Move that test and removal down into shmem_unuse_inode, once info->lock is held. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: radix_tree_preloadingHugh Dickins
Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page cache or swap cache, without any radix tree preload: so tending to deplete emergency reserves of memory. GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache: it's being called under memory pressure, so must not wait for more memory to become available. But shmem_unuse_inode now has a window in which it can and should preload with GFP_KERNEL, and say GFP_NOWAIT instead of GFP_ATOMIC in its add_to_page_cache. shmem_getpage is not so straightforward: its filepage/swappage integrity relies upon exchanging between caches under spinlock, and it would need a lot of restructuring to place the preloads correctly. Instead, follow its pattern of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in add_to_page_cache, and begin each circuit of the repeat loop with a sleeping radix_tree_preload, followed immediately by radix_tree_preload_end - that won't guarantee success in the next add_to_page_cache, but doesn't need to. And we can then remove that bothersome congestion_wait: when needed, it'll automatically get done in the course of the radix_tree_preload. Signed-off-by: Hugh Dickins <hugh@veritas.com> Looks-good-to: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: open a window in shmem_unuse_inodeHugh Dickins
There are a couple of reasons (patches follow) why it would be good to open a window for sleep in shmem_unuse_inode, between its search for a matching swap entry, and its handling of the entry found. shmem_unuse_inode must then use igrab to hold the inode against deletion in that window, and its corresponding iput might result in deletion: so it had better unlock_page before the iput, and might as well release the page too. Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll leave the loop. So this unwinding moves from try_to_unuse and shmem_unuse into shmem_unuse_inode, in the case when it finds a match. Let try_to_unuse break on error in the shmem_unuse case, as it does in the unuse_mm case: though at this point in the series, no error to break on. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: make shmem_unuse more preemptibleHugh Dickins
shmem_unuse is at present an unbroken search through every swap vector page of every tmpfs file which might be swapped, all under shmem_swaplist_lock. This dates from long ago, when the caller held mmlist_lock over it all too: long gone, but there's never been much pressure for preemptible swapoff. Make it a little more preemptible, replacing shmem_swaplist_lock by shmem_swaplist_mutex, inserting a cond_resched in the main loop, and a cond_resched_lock (on info->lock) at one convenient point in the shmem_unuse_inode loop, where it has no outstanding kmap_atomic. If we're serious about preemptible swapoff, there's much further to go e.g. I'm stupid to let the kmap_atomics of the decreasingly significant HIGHMEM case dictate preemptiblility for other configs. But as in the earlier patch to make swapoff scan ptes preemptibly, my hidden agenda is really towards making memcgroups work, hardly about preemptibility at all. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: allocate on read when stackedHugh Dickins
tmpfs is expected to limit the memory used (unless mounted with nr_blocks=0 or size=0). But if a stacked filesystem such as unionfs gets pages from a sparse tmpfs file by reading holes, and then writes to them, it can easily exceed any such limit at present. So suppress the SGP_READ "don't allocate page" ZERO_PAGE optimization when reading for the kernel (a KERNEL_DS check, ugh, sorry about that). Indeed, pessimistically mark such pages as dirty, so they cannot get reclaimed and unaccounted by mistake. The venerable shmem_recalc_inode code (originally to account for the reclaim of clean pages) suffices to get the accounting right when swappages are dropped in favour of more uptodate filepages. This also fixes the NULL shmem_swp_entry BUG or oops in shmem_writepage, caused by unionfs writing to a very sparse tmpfs file: to minimize memory allocation in swapout, tmpfs requires the swap vector be allocated upfront, which wasn't always happening in this stacked case. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: allow filepage alongside swappageHugh Dickins
tmpfs has long allowed for a fresh filepage to be created in pagecache, just before shmem_getpage gets the chance to match it up with the swappage which already belongs to that offset. But unionfs_writepage now does a find_or_create_page, divorced from shmem_getpage, which leaves conflicting filepage and swappage outstanding indefinitely, when unionfs is over tmpfs. Therefore shmem_writepage (where a page is swizzled from file to swap) must now be on the lookout for existing swap, ready to free it in favour of the more uptodate filepage, instead of BUGging on that clash. And when the add_to_page_cache fails in shmem_unuse_inode, it must defer to an uptodate filepage, otherwise swapoff would hang. Whereas when add_to_page_cache fails in shmem_getpage, it should retry in the same way it already does. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: move swap swizzling into shmemHugh Dickins
move_to_swap_cache and move_from_swap_cache functions (which swizzle a page between tmpfs page cache and swap cache, to avoid page copying) are only used by shmem.c; and our subsequent fix for unionfs needs different treatments in the two instances of move_from_swap_cache. Move them from swap_state.c into their callsites shmem_writepage, shmem_unuse_inode and shmem_getpage, making add_to_swap_cache externally visible. shmem.c likes to say set_page_dirty where swap_state.c liked to say SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback makes moot (and implies we should lose that "shift page from clean_pages to dirty_pages list" comment: it's on neither). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: fix mounts when size is less than the page sizeMichael Marineau
When tmpfs is mounted with a size less than one page, the number of blocks is set to 0 which makes the tmpfs mount unlimited. This can lead to a quick and surprising death if someone typos a tmpfs mount command and writes too much. tmpfs can still be mounted as unlimited if size or nr_blocks is exactly 0, as Documentation/filesystems/tmpfs.txt says. Hugh: do this by rounding size up instead of down in all cases: which slightly expands other odd-sized tmpfs mounts, but in a consistent way. Signed-off-by: Michael Marineau <mike@marineau.org> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05shmem: factor out sbi->free_inodes manipulationsPavel Emelyanov
The shmem_sb_info structure has a number of free_inodes. This value is altered in appropriate places under spinlock and with the sbi->max_inodes != 0 check. Consolidate these manipulations into two helpers. This is minus 42 bytes of shmem.o and minus 4 :) lines of code. [akpm@linux-foundation.org: fix error return values] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05shmem_file_write is redundantHugh Dickins
With the old aops, writing to a tmpfs file had to use its own special method: the generic method would pass in a fresh page to prepare_write when the right page was there in swapcache - which was inefficient to handle, even once we'd concocted the code to handle it. With the new aops, the generic method uses shmem_write_end, which lets shmem_getpage find the right page: so now abandon shmem_file_write in favour of the generic method. Yes, that does do several things that tmpfs hasn't really needed (notably balance_dirty_pages_ratelimited, which ramfs also calls); but more use of common code is preferable. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05shmem_getpage return page lockedHugh Dickins
In the new aops, write_begin is supposed to return the page locked: though I've seen no ill effects, that's been overlooked in the case of shmem_write_begin, and should be fixed. Then shmem_write_end must unlock the page: do so _after_ updating i_size, as we found to be important in other filesystems (though since shmem pages don't go the usual writeback route, they never suffered from that corruption). For shmem_write_begin to return the page locked, we need shmem_getpage to return the page locked in SGP_WRITE case as well as SGP_CACHE case: let's simplify the interface and return it locked even when SGP_READ. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05shmem: SGP_QUICK and SGP_FAULT redundantHugh Dickins
Remove SGP_QUICK from the sgp_type enum: it was for shmem_populate and has no users now. Remove SGP_FAULT from the enum: SGP_CACHE does just as well (and shmem_getpage is about to return with page always locked). Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05swapin needs gfp_mask for loop on tmpfsHugh Dickins
Building in a filesystem on a loop device on a tmpfs file can hang when swapping, the loop thread caught in that infamous throttle_vm_writeout. In theory this is a long standing problem, which I've either never seen in practice, or long ago suppressed the recollection, after discounting my load and my tmpfs size as unrealistically high. But now, with the new aops, it has become easy to hang on one machine. Loop used to grab_cache_page before the old prepare_write to tmpfs, which seems to have been enough to free up some memory for any swapin needed; but the new write_begin lets tmpfs find or allocate the page (much nicer, since grab_cache_page missed tmpfs pages in swapcache). When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask, which has __GFP_IO|__GFP_FS stripped off, and throttle_vm_writeout is designed to break out when __GFP_IO or GFP_FS is unset; but when tmfps swaps in, read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE regardless of the mapping_gfp_mask - hence the hang. So, pass gfp_mask down the line from shmem_getpage to shmem_swapin to swapin_readahead to read_swap_cache_async to add_to_swap_cache. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05swapin_readahead: move and rearrange argsHugh Dickins
swapin_readahead has never sat well in mm/memory.c: move it to mm/swap_state.c beside its kindred read_swap_cache_async. Why were its args in a different order? rearrange them. And since it was always followed by a read_swap_cache_async of the target page, fold that in and return struct page*. Then CONFIG_SWAP=n no longer needs valid_swaphandles and read_swap_cache_async stubs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05swapin_readahead: excise NUMA bogosityHugh Dickins
For three years swapin_readahead has been cluttered with fanciful CONFIG_NUMA code, advancing addr, and stepping on to the next vma at the boundary, to line up the mempolicy for each page allocation. It _might_ be a good idea to allocate swap more according to vma layout; but the fact is, that's not how we do it at all, 2.6 even less than 2.4: swap is allocated as needed for pages as they sink to the bottom of the inactive LRUs. Sometimes that may match vma layout, but not so often that it's worth going to these misleading vma->vm_next lengths: rip all that out. Originally I intended to retain the incrementation of addr, but correct its initial value: valid_swaphandles generally supplies an offset below the target addr (this is readaround rather than readahead), but addr has not been adjusted accordingly, so in the interleave case it has usually been allocating the target page from the "wrong" node (though that may not matter very much). But look at the equivalent shmem_swapin code: either by oversight or by design, though it has all the apparatus for choosing a new mempolicy per page, it uses the same idx throughout, choosing the same mempolicy and interleave node for each page of the cluster. Which is actually a much better strategy: each node has its own LRUs and its own kswapd, so if you're betting on any particular relationship between swap and node, the best bet is that nearby swap entries belong to pages from the same node - even when the mempolicy of the target page is to interleave. And examining a map of nodes corresponding to swap entries on a numa=fake system bears this out. (We could later tweak swap allocation to make it even more likely, but this patch is merely about removing cruft.) So, neither adjust nor increment addr in swapin_readahead, and then shmem_swapin can use it too; the pseudo-vma to pass policy need only be set up once per cluster, and so few fields of pvma are used, let's skip the memset - from shmem_alloc_page also. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-28tmpfs: restore missing clear_highpageHugh Dickins
tmpfs was misconverted to __GFP_ZERO in 2.6.11. There's an unusual case in which shmem_getpage receives the page from its caller instead of allocating. We must cover this case by clear_highpage before SetPageUptodate, as before. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-30fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATEHugh Dickins
It's possible to provoke unionfs (not yet in mainline, though in mm and some distros) to hit shmem_writepage's BUG_ON(page_mapped(page)). I expect it's possible to provoke the 2.6.23 ecryptfs in the same way (but the 2.6.24 ecryptfs no longer calls lower level's ->writepage). This came to light with the recent find that AOP_WRITEPAGE_ACTIVATE could leak from tmpfs via write_cache_pages and unionfs to userspace. There's already a fix (e423003028183df54f039dfda8b58c49e78c89d7 - writeback: don't propagate AOP_WRITEPAGE_ACTIVATE) in the tree for that, and it's okay so far as it goes; but insufficient because it doesn't address the underlying issue, that shmem_writepage expects to be called only by vmscan (relying on backing_dev_info capabilities to prevent the normal writeback path from ever approaching it). That's an increasingly fragile assumption, and ramdisk_writepage (the other source of AOP_WRITEPAGE_ACTIVATEs) is already careful to check wbc->for_reclaim before returning it. Make the same check in shmem_writepage, thereby sidestepping the page_mapped BUG also. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Erez Zadok <ezk@cs.sunysb.edu> Cc: <stable@kernel.org> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-22exportfs: make struct export_operations constChristoph Hellwig
Now that nfsd has stopped writing to the find_exported_dentry member we an mark the export_operations const Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Neil Brown <neilb@suse.de> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: <linux-ext4@vger.kernel.org> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: David Chinner <dgc@sgi.com> Cc: Timothy Shimmin <tes@sgi.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: Chris Mason <mason@suse.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: "Vladimir V. Saveliev" <vs@namesys.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-22shmem: new export opsChristoph Hellwig
I'm not sure what people were thinking when adding support to export tmpfs, but here's the conversion anyway: Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Neil Brown <neilb@suse.de> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17r/o bind mounts: filesystem helpers for custom 'struct file'sDave Hansen
Why do we need r/o bind mounts? This feature allows a read-only view into a read-write filesystem. In the process of doing that, it also provides infrastructure for keeping track of the number of writers to any given mount. This has a number of uses. It allows chroots to have parts of filesystems writable. It will be useful for containers in the future because users may have root inside a container, but should not be allowed to write to somefilesystems. This also replaces patches that vserver has had out of the tree for several years. It allows security enhancement by making sure that parts of your filesystem read-only (such as when you don't trust your FTP server), when you don't want to have entire new filesystems mounted, or when you want atime selectively updated. I've been using the following script to test that the feature is working as desired. It takes a directory and makes a regular bind and a r/o bind mount of it. It then performs some normal filesystem operations on the three directories, including ones that are expected to fail, like creating a file on the r/o mount. This patch: Some filesystems forego the vfs and may_open() and create their own 'struct file's. This patch creates a couple of helper functions which can be used by these filesystems, and will provide a unified place which the r/o bind mount code may patch. Also, rename an existing, static-scope init_file() to a less generic name. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17SLAB_PANIC more (proc, posix-timers, shmem)Alexey Dobriyan
These aren't modular, so SLAB_PANIC is OK. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17Slab API: remove useless ctor parameter and reorder parametersChristoph Lameter
Slab constructors currently have a flags parameter that is never used. And the order of the arguments is opposite to other slab functions. The object pointer is placed before the kmem_cache pointer. Convert ctor(void *object, struct kmem_cache *s, unsigned long flags) to ctor(struct kmem_cache *s, void *object) throughout the kernel [akpm@linux-foundation.org: coupla fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17mm: bdi init hooksPeter Zijlstra
provide BDI constructor/destructor hooks [akpm@linux-foundation.org: compile fix] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16mm/shmem.c: make 3 functions staticAdrian Bunk
This patch makes three needlessly global functions static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16Group short-lived and reclaimable kernel allocationsMel Gorman
This patch marks a number of allocations that are either short-lived such as network buffers or are reclaimable such as inode allocations. When something like updatedb is called, long-lived and unmovable kernel allocations tend to be spread throughout the address space which increases fragmentation. This patch groups these allocations together as much as possible by adding a new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be reclaimed on demand, but not moved. i.e. they can be migrated by deleting them and re-reading the information from elsewhere. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16memoryless nodes: fixup uses of node_online_map in generic codeLee Schermerhorn
Here's a cut at fixing up uses of the online node map in generic code. mm/shmem.c:shmem_parse_mpol() Ensure nodelist is subset of nodes with memory. Use node_states[N_HIGH_MEMORY] as default for missing nodelist for interleave policy. mm/shmem.c:shmem_fill_super() initialize policy_nodes to node_states[N_HIGH_MEMORY] mm/page-writeback.c:highmem_dirtyable_memory() sum over nodes with memory mm/page_alloc.c:zlc_setup() allowednodes - use nodes with memory. mm/page_alloc.c:default_zonelist_order() average over nodes with memory. mm/page_alloc.c:find_next_best_node() skip nodes w/o memory. N_HIGH_MEMORY state mask may not be initialized at this time, unless we want to depend on early_calculate_totalpages() [see below]. Will ZONE_MOVABLE ever be configurable? mm/page_alloc.c:find_zone_movable_pfns_for_nodes() spread kernelcore over nodes with memory. This required calling early_calculate_totalpages() unconditionally, and populating N_HIGH_MEMORY node state therein from nodes in the early_node_map[]. If we can depend on this, we can eliminate the population of N_HIGH_MEMORY mask from __build_all_zonelists() and use the N_HIGH_MEMORY mask in find_next_best_node(). mm/mempolicy.c:mpol_check_policy() Ensure nodes specified for policy are subset of nodes with memory. [akpm@linux-foundation.org: fix warnings] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Christoph Lameter <clameter@sgi.com> Cc: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16implement simple fs aopsNick Piggin
Implement new aops for some of the simpler filesystems. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16Clean up duplicate includes in mm/Jesper Juhl
This patch cleans up duplicate includes in mm/ Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-20mm: Remove slab destructors from kmem_cache_create().Paul Mundt
Slab destructors were no longer supported after Christoph's c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-07-19mm: fault feedback #2Nick Piggin
This patch completes Linus's wish that the fault return codes be made into bit flags, which I agree makes everything nicer. This requires requires all handle_mm_fault callers to be modified (possibly the modifications should go further and do things like fault accounting in handle_mm_fault -- however that would be for another patch). [akpm@linux-foundation.org: fix alpha build] [akpm@linux-foundation.org: fix s390 build] [akpm@linux-foundation.org: fix sparc build] [akpm@linux-foundation.org: fix sparc64 build] [akpm@linux-foundation.org: fix ia64 build] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Bryan Wu <bryan.wu@analog.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Cc: Matthew Wilcox <willy@debian.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Acked-by: Kyle McMartin <kyle@mcmartin.ca> Acked-by: Haavard Skinnemoen <hskinnemoen@atmel.com> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Still apparently needs some ARM and PPC loving - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19mm: fault feedback #1Nick Piggin
Change ->fault prototype. We now return an int, which contains VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte. FAULT_RET_ code tells the VM whether a page was found, whether it has been locked, and potentially other things. This is not quite the way he wanted it yet, but that's changed in the next patch (which requires changes to arch code). This means we no longer set VM_CAN_INVALIDATE in the vma in order to say that a page is locked which requires filemap_nopage to go away (because we can no longer remain backward compatible without that flag), but we were going to do that anyway. struct fault_data is renamed to struct vm_fault as Linus asked. address is now a void __user * that we should firmly encourage drivers not to use without really good reason. The page is now returned via a page pointer in the vm_fault struct. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19mm: merge populate and nopage into fault (fixes nonlinear)Nick Piggin
Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes the virtual address -> file offset differently from linear mappings. ->populate is a layering violation because the filesystem/pagecache code should need to know anything about the virtual memory mapping. The hitch here is that the ->nopage handler didn't pass down enough information (ie. pgoff). But it is more logical to pass pgoff rather than have the ->nopage function calculate it itself anyway (because that's a similar layering violation). Having the populate handler install the pte itself is likewise a nasty thing to be doing. This patch introduces a new fault handler that replaces ->nopage and ->populate and (later) ->nopfn. Most of the old mechanism is still in place so there is a lot of duplication and nice cleanups that can be removed if everyone switches over. The rationale for doing this in the first place is that nonlinear mappings are subject to the pagefault vs invalidate/truncate race too, and it seemed stupid to duplicate the synchronisation logic rather than just consolidate the two. After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in pagecache. Seems like a fringe functionality anyway. NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no users have hit mainline yet. [akpm@linux-foundation.org: cleanup] [randy.dunlap@oracle.com: doc. fixes for readahead] [akpm@linux-foundation.org: build fix] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19mm: fix fault vs invalidate race for linear mappingsNick Piggin
Fix the race between invalidate_inode_pages and do_no_page. Andrea Arcangeli identified a subtle race between invalidation of pages from pagecache with userspace mappings, and do_no_page. The issue is that invalidation has to shoot down all mappings to the page, before it can be discarded from the pagecache. Between shooting down ptes to a particular page, and actually dropping the struct page from the pagecache, do_no_page from any process might fault on that page and establish a new mapping to the page just before it gets discarded from the pagecache. The most common case where such invalidation is used is in file truncation. This case was catered for by doing a sort of open-coded seqlock between the file's i_size, and its truncate_count. Truncation will decrease i_size, then increment truncate_count before unmapping userspace pages; do_no_page will read truncate_count, then find the page if it is within i_size, and then check truncate_count under the page table lock and back out and retry if it had subsequently been changed (ptl will serialise against unmapping, and ensure a potentially updated truncate_count is actually visible). Complexity and documentation issues aside, the locking protocol fails in the case where we would like to invalidate pagecache inside i_size. do_no_page can come in anytime and filemap_nopage is not aware of the invalidation in progress (as it is when it is outside i_size). The end result is that dangling (->mapping == NULL) pages that appear to be from a particular file may be mapped into userspace with nonsense data. Valid mappings to the same place will see a different page. Andrea implemented two working fixes, one using a real seqlock, another using a page->flags bit. He also proposed using the page lock in do_no_page, but that was initially considered too heavyweight. However, it is not a global or per-file lock, and the page cacheline is modified in do_no_page to increment _count and _mapcount anyway, so a further modification should not be a large performance hit. Scalability is not an issue. This patch implements this latter approach. ->nopage implementations return with the page locked if it is possible for their underlying file to be invalidated (in that case, they must set a special vm_flags bit to indicate so). do_no_page only unlocks the page after setting up the mapping completely. invalidation is excluded because it holds the page lock during invalidation of each page (and ensures that the page is not mapped while holding the lock). This also allows significant simplifications in do_no_page, because we have the page locked in the right place in the pagecache from the start. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17knfsd: exportfs: add exportfs.h headerChristoph Hellwig
currently the export_operation structure and helpers related to it are in fs.h. fs.h is already far too large and there are very few places needing the export bits, so split them off into a separate header. [akpm@linux-foundation.org: fix cifs build] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Neil Brown <neilb@suse.de> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17Add __GFP_MOVABLE for callers to flag allocations from high memory that may ↵Mel Gorman
be migrated It is often known at allocation time whether a page may be migrated or not. This patch adds a flag called __GFP_MOVABLE and a new mask called GFP_HIGH_MOVABLE. Allocations using the __GFP_MOVABLE can be either migrated using the page migration mechanism or reclaimed by syncing with backing storage and discarding. An API function very similar to alloc_zeroed_user_highpage() is added for __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable(). The flags used by alloc_zeroed_user_highpage() are not changed because it would change the semantics of an existing API. After this patch is applied there are no in-kernel users of alloc_zeroed_user_highpage() so it probably should be marked deprecated if this patch is merged. Note that this patch includes a minor cleanup to the use of __GFP_ZERO in shmem.c to keep all flag modifications to inode->mapping in the shmem_dir_alloc() helper function. This clean-up suggestion is courtesy of Hugh Dickens. Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the concept. Credit to Hugh Dickens for catching issues with shmem swap vector and ramfs allocations. [akpm@linux-foundation.org: build fix] [hugh@veritas.com: __GFP_ZERO cleanup] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-10shmem: convert to using splice instead of sendfile()Hugh Dickins
Remove shmem_file_sendfile and resurrect shmem_readpage, as used by tmpfs to support loop and sendfile in 2.4 and 2.5. Now tmpfs can support splice, loop and sendfile in the simplest way, using generic_file_splice_read and generic_file_splice_write (with the aid of shmem_prepare_write). We could make some efficiency tweaks later, if there's a real need; but this is stable and works well as is. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>