summaryrefslogtreecommitdiff
path: root/fs/exofs
AgeCommit message (Collapse)Author
2016-01-22wrappers for ->i_mutex accessAl Viro
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-15kmemcg: account certain kmem allocations to memcgVladimir Davydov
Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-11Merge branch 'work.symlinks' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs RCU symlink updates from Al Viro: "Replacement of ->follow_link/->put_link, allowing to stay in RCU mode even if the symlink is not an embedded one. No changes since the mailbomb on Jan 1" * 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: switch ->get_link() to delayed_call, kill ->put_link() kill free_page_put_link() teach nfs_get_link() to work in RCU mode teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode teach shmem_get_link() to work in RCU mode teach page_get_link() to work in RCU mode replace ->follow_link() with new method that could stay in RCU mode don't put symlink bodies in pagecache into highmem namei: page_getlink() and page_follow_link_light() are the same thing ufs: get rid of ->setattr() for symlinks udf: don't duplicate page_symlink_inode_operations logfs: don't duplicate page_symlink_inode_operations switch befs long symlinks to page_symlink_operations
2015-12-12osd fs: __r4w_get_page rely on PageUptodate for uptodateHugh Dickins
Commit 42cb14b110a5 ("mm: migrate dirty page without clear_page_dirty_for_io etc") simplified the migration of a PageDirty pagecache page: one stat needs moving from zone to zone and that's about all. It's convenient and safest for it to shift the PageDirty bit from old page to new, just before updating the zone stats: before copying data and marking the new PageUptodate. This is all done while both pages are isolated and locked, just as before; and just as before, there's a moment when the new page is visible in the radix_tree, but not yet PageUptodate. What's new is that it may now be briefly visible as PageDirty before it is PageUptodate. When I scoured the tree to see if this could cause a problem anywhere, the only places I found were in two similar functions __r4w_get_page(): which look up a page with find_get_page() (not using page lock), then claim it's uptodate if it's PageDirty or PageWriteback or PageUptodate. I'm not sure whether that was right before, but now it might be wrong (on rare occasions): only claim the page is uptodate if PageUptodate. Or perhaps the page in question could never be migratable anyway? Signed-off-by: Hugh Dickins <hughd@google.com> Tested-by: Boaz Harrosh <ooo@electrozaur.com> Cc: Benny Halevy <bhalevy@panasas.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-09don't put symlink bodies in pagecache into highmemAl Viro
kmap() in page_follow_link_light() needed to go - allowing to hold an arbitrary number of kmaps for long is a great way to deadlocking the system. new helper (inode_nohighmem(inode)) needs to be used for pagecache symlinks inodes; done for all in-tree cases. page_follow_link_light() instrumented to yell about anything missed. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-11-09fs/exofs/namei.c: remove unnecessary new_valid_dev() checkYaowei Bai
new_valid_dev() always returns 1, so the !new_valid_dev() check is not needed. Remove it. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Acked-by: Boaz Harrosh <ooo@electrozaur.com> Cc: Benny Halevy <bhalevy@primarydata.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-23pagemap.h: move dir_pages() over thereFabian Frederick
That function was declared in a lot of filesystems to calculate directory pages. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-11exofs: switch to {simple,page}_symlink_inode_operationsAl Viro
ACK-by: Boaz Harrosh <ooo@electrozaur.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells
that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-12direct_IO: remove rw from a_ops->direct_IO()Omar Sandoval
Now that no one is using rw, remove it completely. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-12make new_sync_{read,write}() staticAl Viro
All places outside of core VFS that checked ->read and ->write for being NULL or called the methods directly are gone now, so NULL {read,write} with non-NULL {read,write}_iter will do the right thing in all cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-17vfs: remove get_xip_memMatthew Wilcox
All callers of get_xip_mem() are now gone. Remove checks for it, initialisers of it, documentation of it and the only implementation of it. Also remove mm/filemap_xip.c as it is now empty. Also remove documentation of the long-gone get_xip_page(). Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-20fs: remove mapping->backing_dev_infoChristoph Hellwig
Now that we never use the backing_dev_info pointer in struct address_space we can simply remove it and save 4 to 8 bytes in every inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20fs: introduce f_op->mmap_capabilities for nommu mmap supportChristoph Hellwig
Since "BDI: Provide backing device capability information [try #3]" the backing_dev_info structure also provides flags for the kind of mmap operation available in a nommu environment, which is entirely unrelated to it's original purpose. Introduce a new nommu-only file operation to provide this information to the nommu mmap code instead. Splitting this from the backing_dev_info structure allows to remove lots of backing_dev_info instance that aren't otherwise needed, and entirely gets rid of the concept of providing a backing_dev_info for a character device. It also removes the need for the mtd_inodefs filesystem. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-19Boaz Harrosh - Fix broken email addressBoaz Harrosh
I no longer have access to the Panasas email. So change to an email that can always reach me. Signed-off-by: Boaz Harrosh <ooo@electrozaur.com>
2014-08-08fs/exofs/ore_raid.c: replace count*size kzalloc by kcallocFabian Frederick
kcalloc manages count*sizeof overflow. Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "This the bunch that sat in -next + lock_parent() fix. This is the minimal set; there's more pending stuff. In particular, I really hope to get acct.c fixes merged this cycle - we need that to deal sanely with delayed-mntput stuff. In the next pile, hopefully - that series is fairly short and localized (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more iov_iter work. Most of prereqs for ->splice_write with sane locking order are there and Kent's dio rewrite would also fit nicely on top of this pile" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits) lock_parent: don't step on stale ->d_parent of all-but-freed one kill generic_file_splice_write() ceph: switch to iter_file_splice_write() shmem: switch to iter_file_splice_write() nfs: switch to iter_splice_write_file() fs/splice.c: remove unneeded exports ocfs2: switch to iter_file_splice_write() ->splice_write() via ->write_iter() bio_vec-backed iov_iter optimize copy_page_{to,from}_iter() bury generic_file_aio_{read,write} lustre: get rid of messing with iovecs ceph: switch to ->write_iter() ceph_sync_direct_write: stop poking into iov_iter guts ceph_sync_read: stop poking into iov_iter guts new helper: copy_page_from_iter() fuse: switch to ->write_iter() btrfs: switch to ->write_iter() ocfs2: switch to ->write_iter() xfs: switch to ->write_iter() ...
2014-06-12->splice_write() via ->write_iter()Al Viro
iter_file_splice_write() - a ->splice_write() instance that gathers the pipe buffers, builds a bio_vec-based iov_iter covering those and feeds it to ->write_iter(). A bunch of simple cases coverted to that... [AV: fixed the braino spotted by Cyrill] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-22ore: Support for raid 6Boaz Harrosh
This simple patch adds support for raid6 to the ORE. Most operations and calculations where already for the general case. Only things left: * call async_gen_syndrome() in the case of raid6 (NOTE that the raid6 math is the one supported by the Linux Kernel see: crypto/async_tx/async_pq.c) * call _ore_add_parity_unit() twice with only last call generating the redundancy pages. * Fix couple BUGS in old code a. In reads when parity==2 it can happen that per_dev->length=0 but per_dev->offset was set and adjusted by _ore_add_sg_seg(). Don't let it be overwritten. b. The all 'cur_comp > starting_dev' thing to determine if: "per_dev->offset is in the current stripe number or the next one." Was a complete raid5/4 accident. When parity==2 this is not at all true usually. All we need to do is increment si->ob_offset once we pass by the first parity device. (This also greatly simplifies the code, amen) c. Calculation of si->dev rotation can overflow when parity==2. * Then last enable raid6 in ore_verify_layout() I want to deeply thank Daniel Gryniewicz who found first all the bugs in the old raid code, and inspired these patches: Inspired-by Daniel Gryniewicz <dang@linuxbox.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2014-05-22ore: Remove redundant dev_order(), more cleanupsBoaz Harrosh
Two cleanups: * si->cur_comp, si->cur_pg where always calculated after the call to ore_calc_stripe_info() with the help of _dev_order(...). But these are already calculated by ore_calc_stripe_info() and can be just set there. (This is left over from the time that si->cur_comp, si->cur_pg were only used by raid code, but now the main loop manages them anyway even though they are ultimately not used in none raid code) * si->cur_comp - For the very last stripe case, was set inside _ore_add_parity_unit(). This is not clear and will be wrong for coming raid6 so move this to only caller. Now si->cur_comp is only manipulated within _prepare_for_striping(), always next to the manipulation of cur_dev. Which is much easier to understand and follow. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2014-05-22ore: (trivial) reformat some codeBoaz Harrosh
rearrange some source lines. Nothing changed. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2014-05-06write_iter variants of {__,}generic_file_aio_write()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06switch simple generic_file_aio_read() users to ->read_iter()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06pass iov_iter to ->direct_IO()Al Viro
unmodified, for now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-10Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osdLinus Torvalds
Pull exofs updates from Boaz Harrosh: "Trivial updates to exofs for 3.15-rc1 Just a few fixes sent by people" * 'for-linus' of git://git.open-osd.org/linux-open-osd: MAINTAINERS: Update email address for bhalevy fs: Mark functions as static in exofs/ore_raid.c fs: Mark function as static in exofs/super.c
2014-04-03mm + fs: store shadow entries in page cacheJohannes Weiner
Reclaim will be leaving shadow entries in the page cache radix tree upon evicting the real page. As those pages are found from the LRU, an iput() can lead to the inode being freed concurrently. At this point, reclaim must no longer install shadow pages because the inode freeing code needs to ensure the page tree is really empty. Add an address_space flag, AS_EXITING, that the inode freeing code sets under the tree lock before doing the final truncate. Reclaim will check for this flag before installing shadow pages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03fs: Mark functions as static in exofs/ore_raid.cRashika Kheria
Mark functions as static in exofs/ore_raid.c because they are not used outside this file. This also eliminates the following warning in exofs/ore_raid.c: fs/exofs/ore_raid.c:24:14: warning: no previous prototype for _raid_page_alloc [-Wmissing-prototypes] fs/exofs/ore_raid.c:29:6: warning: no previous prototype for _raid_page_free [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2014-04-03fs: Mark function as static in exofs/super.cRashika Kheria
Mark function as static in exofs/super.c because it is not used outside this file. This also eliminates the following warning in exofs/super.c: fs/exofs/super.c:546:5: warning: no previous prototype \ for __alloc_dev_table[-Wmissing-prototypes] Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2014-01-23exofs: Print less in r4wBoaz Harrosh
In debug mode exofs is too verbose. Hiding the real problems remove some trivial stuff. Also fix some other prints. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2014-01-23exofs: Allow corrupted directory entry to be empty fileBoaz Harrosh
If there was an error in fetching an object or extracting inode info from attributes. Which means corrupted storage. Let it be an empty ZERO dated directory entry so it can be deleted. Otherwise the all directory will be inaccessible. This does not loose data, because if there is an orphan object somewhere it will be recovered by fschk. But usually this only means corrupted dir entry. The object was never generated and only its link exist. This way we can delete the bad entry. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2014-01-23exofs: Allow O_DIRECT openBoaz Harrosh
With this minimal do nothing patch an application can open O_DIRECT and then actually do buffered sync IO instead. But the aio API is supported which is a good thing Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2014-01-23ore: Don't crash on NULL bio in _clear_bioBoaz Harrosh
In the case of target returning OSD_ERR_PRI_CLEAR_PAGES when we only sent for attributes don't crash on NULL bio. This is an osd-target bug but don't crash regardless Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2014-01-23ore: Fix wrong math in allocation of per device BIOBoaz Harrosh
At IO preparation we calculate the max pages at each device and allocate a BIO per device of that size. The calculation was wrong on some unaligned corner cases offset/length combination and would make prepare return with -ENOMEM. This would be bad for pnfs-objects that would in that case IO through MDS. And fatal for exofs were it would fail writes with EIO. Fix it by doing the proper math, that will work in all cases. (I ran a test with all possible offset/length combinations this time round). Also when reading we do not need to allocate for the parity units since we jump over them. Also lower the max_io_length to take into account the parity pages so not to allocate BIOs bigger than PAGE_SIZE CC: Stable Kernel <stable@vger.kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2013-09-12truncate: drop 'oldsize' truncate_pagecache() parameterKirill A. Shutemov
truncate_pagecache() doesn't care about old size since commit cedabed49b39 ("vfs: Fix vmtruncate() regression"). Let's drop it. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-02Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 update from Ted Ts'o: "Lots of bug fixes, cleanups and optimizations. In the bug fixes category, of note is a fix for on-line resizing file systems where the block size is smaller than the page size (i.e., file systems 1k blocks on x86, or more interestingly file systems with 4k blocks on Power or ia64 systems.) In the cleanup category, the ext4's punch hole implementation was significantly improved by Lukas Czerner, and now supports bigalloc file systems. In addition, Jan Kara significantly cleaned up the write submission code path. We also improved error checking and added a few sanity checks. In the optimizations category, two major optimizations deserve mention. The first is that ext4_writepages() is now used for nodelalloc and ext3 compatibility mode. This allows writes to be submitted much more efficiently as a single bio request, instead of being sent as individual 4k writes into the block layer (which then relied on the elevator code to coalesce the requests in the block queue). Secondly, the extent cache shrink mechanism, which was introduce in 3.9, no longer has a scalability bottleneck caused by the i_es_lru spinlock. Other optimizations include some changes to reduce CPU usage and to avoid issuing empty commits unnecessarily." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits) ext4: optimize starting extent in ext4_ext_rm_leaf() jbd2: invalidate handle if jbd2_journal_restart() fails ext4: translate flag bits to strings in tracepoints ext4: fix up error handling for mpage_map_and_submit_extent() jbd2: fix theoretical race in jbd2__journal_restart ext4: only zero partial blocks in ext4_zero_partial_blocks() ext4: check error return from ext4_write_inline_data_end() ext4: delete unnecessary C statements ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree() jbd2: move superblock checksum calculation to jbd2_write_superblock() ext4: pass inode pointer instead of file pointer to punch hole ext4: improve free space calculation for inline_data ext4: reduce object size when !CONFIG_PRINTK ext4: improve extent cache shrink mechanism to avoid to burn CPU time ext4: implement error handling of ext4_mb_new_preallocation() ext4: fix corruption when online resizing a fs with 1K block size ext4: delete unused variables ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents jbd2: remove debug dependency on debug_fs and update Kconfig help text jbd2: use a single printk for jbd_debug() ...
2013-06-29[readdir] convert exofsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-22mm: change invalidatepage prototype to accept lengthLukas Czerner
Currently there is no way to truncate partial page where the end truncate point is not at the end of the page. This is because it was not needed and the functionality was enough for file system truncate operation to work properly. However more file systems now support punch hole feature and it can benefit from mm supporting truncating page just up to the certain point. Specifically, with this functionality truncate_inode_pages_range() can be changed so it supports truncating partial page at the end of the range (currently it will BUG_ON() if 'end' is not at the end of the page). This commit changes the invalidatepage() address space operation prototype to accept range to be invalidated and update all the instances for it. We also change the block_invalidatepage() in the same way and actually make a use of the new length argument implementing range invalidation. Actual file system implementations will follow except the file systems where the changes are really simple and should not change the behaviour in any way .Implementation for truncate_page_range() which will be able to accept page unaligned ranges will follow as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com>
2013-03-23block: Add bio_for_each_segment_all()Kent Overstreet
__bio_for_each_segment() iterates bvecs from the specified index instead of bio->bv_idx. Currently, the only usage is to walk all the bvecs after the bio has been advanced by specifying 0 index. For immutable bvecs, we need to split these apart; bio_for_each_segment() is going to have a different implementation. This will also help document the intent of code that's using it - bio_for_each_segment_all() is only legal to use for code that owns the bio. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Neil Brown <neilb@suse.de> CC: Boaz Harrosh <bharrosh@panasas.com>
2013-03-04fs: Limit sys_mount to only request filesystem modules.Eric W. Biederman
Modify the request_module to prefix the file system type with "fs-" and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Reported-by: Kees Cook <keescook@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-23new helper: file_inode(file)Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-14exofs: don't leak io_state and pages on read errorBoaz Harrosh
Same bug as fixed by Idan for write_exec was in read_exec. Fix the io_state leak and pages state on read error. Also while at it: The if (!pcol->read_4_write) at the error path is redundant because all goto err; are after the if (pcol->read_4_write) bale out. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-12-11exofs: clean up the correct page collection on write errorIdan Kedar
if ore_write() fails, we would unlock the pages of pcol, which is now empty, rather than pcol_copy which owns the pages when ore_write() is called. this means that no pages will actually be unlocked (pcol.nr_pages == 0) and the writing process (more accurately, the syncing process) will hang waiting for a writeback notification that never comes. moreover, if ore_write() fails, pcol_free() is called for pcol, whereas pcol_copy is the object owning the ore_io_state, thus leaking the ore_io_state. [Boaz] I have simplified Idan's original patch a bit, everything else still holds Signed-off-by: Idan Kedar <idank@tonian.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-10-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull pile 2 of vfs updates from Al Viro: "Stuff in this one - assorted fixes, lglock tidy-up, death to lock_super(). There'll be a VFS pile tomorrow (with patches from Jeff Layton, sanitizing getname() and related parts of audit and preparing for ESTALE fixes), but I'd rather push the stuff in this one ASAP - some of the bugs closed here are quite unpleasant." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: bogus warnings in fs/namei.c consitify do_mount() arguments lglock: add DEFINE_STATIC_LGLOCK() lglock: make the per_cpu locks static lglock: remove unused DEFINE_LGLOCK_LOCKDEP() MAX_LFS_FILESIZE definition for 64bit needs LL... tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking vfs: drop lock/unlock super ufs: drop lock/unlock super sysv: drop lock/unlock super hpfs: drop lock/unlock super fat: drop lock/unlock super ext3: drop lock/unlock super exofs: drop lock/unlock super dup3: Return an error when oldfd == newfd. fs: handle failed audit_log_start properly fs: prevent use after free in auditing when symlink following was denied
2012-10-11Merge branch 'for-3.7/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block IO update from Jens Axboe: "Core block IO bits for 3.7. Not a huge round this time, it contains: - First series from Kent cleaning up and generalizing bio allocation and freeing. - WRITE_SAME support from Martin. - Mikulas patches to prevent O_DIRECT crashes when someone changes the block size of a device. - Make bio_split() work on data-less bio's (like trim/discards). - A few other minor fixups." Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew Morton. It is due to the VM no longer using a prio-tree (see commit 6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree"). So make set_blocksize() use mapping_mapped() instead of open-coding the internal VM knowledge that has changed. * 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits) block: makes bio_split support bio without data scatterlist: refactor the sg_nents scatterlist: add sg_nents fs: fix include/percpu-rwsem.h export error percpu-rw-semaphore: fix documentation typos fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared blockdev: turn a rw semaphore into a percpu rw semaphore Fix a crash when block device is read and block size is changed at the same time block: fix request_queue->flags initialization block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue() block: ioctl to zero block ranges block: Make blkdev_issue_zeroout use WRITE SAME block: Implement support for WRITE SAME block: Consolidate command flag and queue limit checks for merges block: Clean up special command handling logic block/blk-tag.c: Remove useless kfree block: remove the duplicated setting for congestion_threshold block: reject invalid queue attribute values block: Add bio_clone_bioset(), bio_clone_kmalloc() block: Consolidate bio_alloc_bioset(), bio_kmalloc() ...
2012-10-10exofs: drop lock/unlock superMarco Stornelli
Removed lock/unlock super. Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-09Merge branch 'linux-next' of git://git.open-osd.org/linux-open-osdLinus Torvalds
Pull exofs update from Boaz Harrosh: "Just three one liners" * 'linux-next' of git://git.open-osd.org/linux-open-osd: pnfs_osd_xdr: Remove unused #include from pnfs_osd_xdr.h ore: signedness bug in _sp2d_min_pg() exofs: check for allocation failure in uri_store()
2012-10-03ore: signedness bug in _sp2d_min_pg()Dan Carpenter
This for loop doesn't work correctly when "p" is unsigned. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-10-03Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: - big one - consolidation of descriptor-related logics; almost all of that is moved to fs/file.c (BTW, I'm seriously tempted to rename the result to fd.c. As it is, we have a situation when file_table.c is about handling of struct file and file.c is about handling of descriptor tables; the reasons are historical - file_table.c used to be about a static array of struct file we used to have way back). A lot of stray ends got cleaned up and converted to saner primitives, disgusting mess in android/binder.c is still disgusting, but at least doesn't poke so much in descriptor table guts anymore. A bunch of relatively minor races got fixed in process, plus an ext4 struct file leak. - related thing - fget_light() partially unuglified; see fdget() in there (and yes, it generates the code as good as we used to have). - also related - bits of Cyrill's procfs stuff that got entangled into that work; _not_ all of it, just the initial move to fs/proc/fd.c and switch of fdinfo to seq_file. - Alex's fs/coredump.c spiltoff - the same story, had been easier to take that commit than mess with conflicts. The rest is a separate pile, this was just a mechanical code movement. - a few misc patches all over the place. Not all for this cycle, there'll be more (and quite a few currently sit in akpm's tree)." Fix up trivial conflicts in the android binder driver, and some fairly simple conflicts due to two different changes to the sock_alloc_file() interface ("take descriptor handling from sock_alloc_file() to callers" vs "net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries" adding a dentry name to the socket) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits) MAX_LFS_FILESIZE should be a loff_t compat: fs: Generic compat_sys_sendfile implementation fs: push rcu_barrier() from deactivate_locked_super() to filesystems btrfs: reada_extent doesn't need kref for refcount coredump: move core dump functionality into its own file coredump: prevent double-free on an error path in core dumper usb/gadget: fix misannotations fcntl: fix misannotations ceph: don't abuse d_delete() on failure exits hypfs: ->d_parent is never NULL or negative vfs: delete surplus inode NULL check switch simple cases of fget_light to fdget new helpers: fdget()/fdput() switch o2hb_region_dev_write() to fget_light() proc_map_files_readdir(): don't bother with grabbing files make get_file() return its argument vhost_set_vring(): turn pollstart/pollstop into bool switch prctl_set_mm_exe_file() to fget_light() switch xfs_find_handle() to fget_light() switch xfs_swapext() to fget_light() ...
2012-10-03fs: push rcu_barrier() from deactivate_locked_super() to filesystemsKirill A. Shutemov
There's no reason to call rcu_barrier() on every deactivate_locked_super(). We only need to make sure that all delayed rcu free inodes are flushed before we destroy related cache. Removing rcu_barrier() from deactivate_locked_super() affects some fast paths. E.g. on my machine exit_group() of a last process in IPC namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace changes from Eric Biederman: "This is a mostly modest set of changes to enable basic user namespace support. This allows the code to code to compile with user namespaces enabled and removes the assumption there is only the initial user namespace. Everything is converted except for the most complex of the filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs, nfs, ocfs2 and xfs as those patches need a bit more review. The strategy is to push kuid_t and kgid_t values are far down into subsystems and filesystems as reasonable. Leaving the make_kuid and from_kuid operations to happen at the edge of userspace, as the values come off the disk, and as the values come in from the network. Letting compile type incompatible compile errors (present when user namespaces are enabled) guide me to find the issues. The most tricky areas have been the places where we had an implicit union of uid and gid values and were storing them in an unsigned int. Those places were converted into explicit unions. I made certain to handle those places with simple trivial patches. Out of that work I discovered we have generic interfaces for storing quota by projid. I had never heard of the project identifiers before. Adding full user namespace support for project identifiers accounts for most of the code size growth in my git tree. Ultimately there will be work to relax privlige checks from "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing root in a user names to do those things that today we only forbid to non-root users because it will confuse suid root applications. While I was pushing kuid_t and kgid_t changes deep into the audit code I made a few other cleanups. I capitalized on the fact we process netlink messages in the context of the message sender. I removed usage of NETLINK_CRED, and started directly using current->tty. Some of these patches have also made it into maintainer trees, with no problems from identical code from different trees showing up in linux-next. After reading through all of this code I feel like I might be able to win a game of kernel trivial pursuit." Fix up some fairly trivial conflicts in netfilter uid/git logging code. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits) userns: Convert the ufs filesystem to use kuid/kgid where appropriate userns: Convert the udf filesystem to use kuid/kgid where appropriate userns: Convert ubifs to use kuid/kgid userns: Convert squashfs to use kuid/kgid where appropriate userns: Convert reiserfs to use kuid and kgid where appropriate userns: Convert jfs to use kuid/kgid where appropriate userns: Convert jffs2 to use kuid and kgid where appropriate userns: Convert hpfs to use kuid and kgid where appropriate userns: Convert btrfs to use kuid/kgid where appropriate userns: Convert bfs to use kuid/kgid where appropriate userns: Convert affs to use kuid/kgid wherwe appropriate userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids userns: On ia64 deal with current_uid and current_gid being kuid and kgid userns: On ppc convert current_uid from a kuid before printing. userns: Convert s390 getting uid and gid system calls to use kuid and kgid userns: Convert s390 hypfs to use kuid and kgid where appropriate userns: Convert binder ipc to use kuids userns: Teach security_path_chown to take kuids and kgids userns: Add user namespace support to IMA userns: Convert EVM to deal with kuids and kgids in it's hmac computation ...