summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2008-10-09block: mark bio_split_pool staticDenis ChengRq
Since all bio_split calls refer the same single bio_split_pool, the bio_split function can use bio_split_pool directly instead of the mempool_t parameter; then the mempool_t parameter can be removed from bio_split param list, and bio_split_pool is only referred in fs/bio.c file, can be marked static. Signed-off-by: Denis ChengRq <crquan@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: Find bio sector offset given idx and offsetMartin K. Petersen
Helper function to find the sector offset in a bio given bvec index and page offset. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: Introduce integrity data ownership flagMartin K. Petersen
A filesystem might supply its own integrity metadata. Introduce a flag that indicates whether the filesystem or the block layer owns the integrity buffer. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: revert part of d7533ad0e132f92e75c1b2eb7c26387b25a583c1Jens Axboe
We need bdev_get_integrity() to support the pending md/dm patches. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: cleanup some of the integrity stuff in blkdev.hJens Axboe
Don't put functions that are only used in fs/bio-integrity.c in blkdev.h, it's much cleaner to just keep it in there. Also kill completely unused bdev_get_tag_size() Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: add bio_kmalloc()Jens Axboe
Not all callers need (or want!) the mempool backing guarentee, it essentially means that you can only use bio_alloc() for short allocations and not for preallocating some bio's at setup or init time. So add bio_kmalloc() which does the same thing as bio_alloc(), except it just uses kmalloc() as the backing instead of the bio mempools. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09Call flush_disk() after detecting an online resize.Andrew Patterson
We call flush_disk() to make sure the buffer cache for the disk is flushed after a disk resize. There are two resize cases, growing and shrinking. Given that users can shrink/then grow a disk before revalidate_disk() is called, we treat the grow case identically to shrinking. We need to flush the buffer cache after an online shrink because, as James Bottomley puts it, The two use cases for shrinking I can see are 1. planned: the fs is already shrunk to within the new boundaries and all data is relocated, so invalidate is fine (any dirty buffers that might exist in the shrunk region are there only because they were relocated but not yet written to their original location). 2. unplanned: In this case, the fs is probably toast, so whether we invalidate or not isn't going to make a whole lot of difference; it's still going to try to read or write from sectors beyond the new size and get I/O errors. Immediately invalidating shrunk disks will cause errors for outstanding I/Os for reads/write beyond the new end of the disk to be generated earlier then if we waited for the normal buffer cache operation. It also removes a potential security hole where we might keep old data around from beyond the end of the shrunk disk if the disk was not invalidated. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09Added flush_disk to factor out common buffer cache flushing code.Andrew Patterson
We need to be able to flush the buffer cache for for more than just when a disk is changed, so we factor out common cache flush code in check_disk_change() to an internal flush_disk() routine. This routine will then be used for both disk changes and disk resizes (in a later patch). Include the disk name in the text indicating that there are busy inodes on the device and increase the KERN severity of the message. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09Check for device resize when rescanning partitionsAndrew Patterson
Check for device resize in the rescan_partitions() routine. If the device has been resized, the bdev size is set to match. The rescan_partitions() routine is called when opening the device and when calling the BLKRRPART ioctl. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09Adjust block device size after an online resize of a disk.Andrew Patterson
The revalidate_disk routine now checks if a disk has been resized by comparing the gendisk capacity to the bdev inode size. If they are different (usually because the disk has been resized underneath the kernel) the bdev inode size is adjusted to match the capacity. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09Wrapper for lower-level revalidate_disk routines.Andrew Patterson
This is a wrapper for the lower-level revalidate_disk call-backs such as sd_revalidate_disk(). It allows us to perform pre and post operations when calling them. We will use this wrapper in a later patch to adjust block device sizes after an online resize (a _post_ operation). Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: make blk_rq_map_user take a NULL user-space bufferFUJITA Tomonori
This patch changes blk_rq_map_user to accept a NULL user-space buffer with a READ command if rq_map_data is not NULL. Thus a caller can pass page frames to lk_rq_map_user to just set up a request and bios with page frames propely. bio_uncopy_user (called via blk_rq_unmap_user) doesn't copy data to user space with such request. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09bio: convert bio_copy_kern to use bio_copy_userFUJITA Tomonori
bio_copy_kern and bio_copy_user are very similar. This converts bio_copy_kern to use bio_copy_user. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: introduce struct rq_map_data to use reserved pagesFUJITA Tomonori
This patch introduces struct rq_map_data to enable bio_copy_use_iov() use reserved pages. Currently, bio_copy_user_iov allocates bounce pages but drivers/scsi/sg.c wants to allocate pages by itself and use them. struct rq_map_data can be used to pass allocated pages to bio_copy_user_iov. The current users of bio_copy_user_iov simply passes NULL (they don't want to use pre-allocated pages). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Douglas Gilbert <dougg@torque.net> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: add gfp_mask argument to blk_rq_map_user and blk_rq_map_user_iovFUJITA Tomonori
Currently, blk_rq_map_user and blk_rq_map_user_iov always do GFP_KERNEL allocation. This adds gfp_mask argument to blk_rq_map_user and blk_rq_map_user_iov so sg can use it (sg always does GFP_ATOMIC allocation). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Douglas Gilbert <dougg@torque.net> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: add support for IO CPU affinityJens Axboe
This patch adds support for controlling the IO completion CPU of either all requests on a queue, or on a per-request basis. We export a sysfs variable (rq_affinity) which, if set, migrates completions of requests to the CPU that originally submitted it. A bio helper (bio_set_completion_cpu()) is also added, so that queuers can ask for completion on that specific CPU. In testing, this has been show to cut the system time by as much as 20-40% on synthetic workloads where CPU affinity is desired. This requires a little help from the architecture, so it'll only work as designed for archs that are using the new generic smp helper infrastructure. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: allow disk to have extended device numberTejun Heo
Now that disk and partition handlings are mostly unified, it's easy to allow disk to have extended device number. This patch makes add_disk() use extended device number if disk->minors is zero. Both sd and ide-disk are updated to use this. * sd_format_disk_name() is implemented which can generically determine the drive name. This removes disk number restriction stemming from limited device names. * If sd index goes over SD_MAX_DISKS (which can be increased now BTW), sd simply doesn't initialize minors letting block layer choose extended device number. * If CONFIG_DEBUG_EXT_DEVT is set, both sd and ide-disk always set minors to 0 and use extended device numbers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: replace @ext_minors with GENHD_FL_EXT_DEVTTejun Heo
With previous changes, it's meaningless to limit the number of partitions. Replace @ext_minors with GENHD_FL_EXT_DEVT such that setting the flag allows the disk to have maximum number of allowed partitions (only limited by the number of entries in parsed_partitions as determined by MAX_PART constant). This kills not-too-pretty alloc_disk_ext[_node]() functions and makes @minors parameter to alloc_disk[_node]() unnecessary. The parameter is left alone to avoid disturbing the users. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: make partition array dynamicTejun Heo
disk->__part used to be statically allocated to the maximum possible number of partitions. This patch makes partition array allocation dynamic. The added overhead is minimal as only real change is one memory dereference changed to RCU one. This saves both a bit of memory and cpu cycles iterating through unoccupied slots and makes increasing partition limit easier. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: move stats from disk to part0Tejun Heo
Move stats related fields - stamp, in_flight, dkstats - from disk to part0 and unify stat handling such that... * part_stat_*() now updates part0 together if the specified partition is not part0. ie. part_stat_*() are now essentially all_stat_*(). * {disk|all}_stat_*() are gone. * part_round_stats() is updated similary. It handles part0 stats automatically and disk_round_stats() is killed. * part_{inc|dec}_in_fligh() is implemented which automatically updates part0 stats for parts other than part0. * disk_map_sector_rcu() is updated to return part0 if no part matches. Combined with the above changes, this makes NULL special case handling in callers unnecessary. * Separate stats show code paths for disk are collapsed into part stats show code paths. * Rename disk_stat_lock/unlock() to part_stat_lock/unlock() While at it, reposition stat handling macros a bit and add missing parentheses around macro parameters. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: kill GENHD_FL_FAIL and use part0->make_it_failTejun Heo
GENHD_FL_FAIL for disk is what make_it_fail is for parts. Kill it and use part0->make_it_fail. Sysfs node handling is unified too. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: always set bdev->bd_partTejun Heo
Till now, bdev->bd_part is set only if the bdev was for parts other than part0. This patch makes bdev->bd_part always set so that code paths don't have to differenciate common handling. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: move holder_dir from disk to part0Tejun Heo
Move disk->holder_dir to part0->holder_dir. Kill now mostly superflous bdev_get_holder(). While at it, kill superflous kobject_get/put() around holder_dir, slave_dir and cmd_filter creation and collapse disk_sysfs_add_subdirs() into register_disk(). These serve no purpose but obfuscating the code. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: move policy from disk to part0Tejun Heo
Move disk->policy to part0->policy. Implement and use get_disk_ro(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: unify sysfs size node handlingTejun Heo
Now that capacity and __dev are moved to part0, part0 and others can share the same method. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: move capacity from disk to part0Tejun Heo
Move disk->capacity to part0->nr_sects and convert all users who directly accessed the field to use {get|set}_capacity(). This is done early to allow the __dev field to be moved. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: introduce partition 0Tejun Heo
genhd and partition code handled disk and partitions separately. All information about the whole disk was in struct genhd and partitions in struct hd_struct. However, the whole disk (part0) and other partitions have a lot in common and the data structures end up having good number of common fields and thus separate code paths doing the same thing. Also, the partition array was indexed by partno - 1 which gets pretty confusing at times. This patch introduces partition 0 and makes the partition array indexed by partno. Following patches will unify the handling of disk and parts piece-by-piece. This patch also implements disk_partitionable() which tests whether a disk is partitionable. With coming dynamic partition array change, the most common usage of disk_max_parts() will be testing whether a disk is partitionable and the number of max partitions will become much less important. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: implement and use {disk|part}_to_dev()Tejun Heo
Implement {disk|part}_to_dev() and use them to access generic device instead of directly dereferencing {disk|part}->dev. To make sure no user is left behind, rename generic devices fields to __dev. This is in preparation of unifying partition 0 handling with other partitions. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: implement extended dev numbersTejun Heo
Implement extended device numbers. A block driver can tell block layer that it wants to use extended device numbers. After the usual minor space is used up, block layer automatically allocates devt's from EXT_BLOCK_MAJOR. Currently only one major number is allocated for this but as the allocation is strictly on-demand, ~1mil minor space under it should suffice unless the system actually has more than ~1mil partitions and if that ever happens adding more majors to the extended devt area is easy. Due to internal implementation issues, the first partition can't be allocated on the extended area. In other words, genhd->minors should at least be 1. This limitation will be lifted by later changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: fix diskstats accessTejun Heo
There are two variants of stat functions - ones prefixed with double underbars which don't care about preemption and ones without which disable preemption before manipulating per-cpu counters. It's unclear whether the underbarred ones assume that preemtion is disabled on entry as some callers don't do that. This patch unifies diskstats access by implementing disk_stat_lock() and disk_stat_unlock() which take care of both RCU (for partition access) and preemption (for per-cpu counter access). diskstats access should always be enclosed between the two functions. As such, there's no need for the versions which disables preemption. They're removed and double underbars ones are renamed to drop the underbars. As an extra argument is added, there's no danger of using the old version unconverted. disk_stat_lock() uses get_cpu() and returns the cpu index and all diskstat functions which access per-cpu counters now has @cpu argument to help RT. This change adds RCU or preemption operations at some places but also collapses several preemption ops into one at others. Overall, the performance difference should be negligible as all involved ops are very lightweight per-cpu ones. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: fix disk->part[] dereferencing raceTejun Heo
disk->part[] is protected by its matching bdev's lock. However, non-critical accesses like collecting stats and printing out sysfs and proc information used to be performed without any locking. As partitions can come and go dynamically, partitions can go away underneath those non-critical accesses. As some of those accesses are writes, this theoretically can lead to silent corruption. This patch fixes the race by using RCU for the partition array and dev reference counter to hold partitions. * Rename disk->part[] to disk->__part[] to make sure no one outside genhd layer proper accesses it directly. * Use RCU for disk->__part[] dereferencing. * Implement disk_{get|put}_part() which can be used to get and put partitions from gendisk respectively. * Iterators are implemented to help iterate through all partitions safely. * Functions which require RCU readlock are marked with _rcu suffix. * Use disk_put_part() in __blkdev_put() instead of directly putting the contained kobject. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: don't depend on consecutive minor spaceTejun Heo
* Implement disk_devt() and part_devt() and use them to directly access devt instead of computing it from ->major and ->first_minor. Note that all references to ->major and ->first_minor outside of block layer is used to determine devt of the disk (the part0) and as ->major and ->first_minor will continue to represent devt for the disk, converting these users aren't strictly necessary. However, convert them for consistency. * Implement disk_max_parts() to avoid directly deferencing genhd->minors. * Update bdget_disk() such that it doesn't assume consecutive minor space. * Move devt computation from register_disk() to add_disk() and make it the only one (all other usages use the initially determined value). These changes clean up the code and will help disk->part dereference fix and extended block device numbers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: make variable and argument names more consistentTejun Heo
In hd_struct, @partno is used to denote partition number and a number of other places use @part to denote hd_struct. Functions use @part and @index instead. This causes confusion and makes it difficult to use consistent variable names for hd_struct. Always use @partno if a variable represents partition number. Also, print out functions use @f or @part for seq_file argument. Use @seqf uniformly instead. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: update add_partition() error handlingTejun Heo
d805dda4 tried to fix error case handling in add_partition() but had a few problems. * disk->part[] entry is set early and left dangling if operation fails. * Once device initialized, the last put_device() is responsible for freeing all the resources. The failure path freed part_stats and p regardless of put_device() causing double free. * holders subdir holds reference to the disk device, so failure path should remove it to release resources properly which was missing. This patch fixes the above problems and while at it move partition slot busy check into add_partition() for completeness and inlines holders subdirectory creation. Using separate function for it just obfuscates the code. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Abdel Benamrouche <draconux@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: allow deleting zero length partitionTejun Heo
delete_partition() was noop for zero length partition. As the addition code allows creating zero lenght partition and deletion is assumed to always succeed, this causes memory leak for zero length partitions. Allow zero length partitions to end their meaningless lives. While at it, allow deleting zero lenght partition via BLKPG_DEL_PARTITION ioctl too. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09drop vmerge accountingMikulas Patocka
Remove hw_segments field from struct bio and struct request. Without virtual merge accounting they have no purpose. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09block: drop virtual merging accountingMikulas Patocka
Remove virtual merge accounting. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09Let the block device know when sectors can be discardedDavid Woodhouse
[hirofumi@mail.parknet.co.jp: discard _after_ checking for corrupt chains] Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-02mm: tiny-shmem nommu fixNick Piggin
The previous patch db203d53d474aa068984e409d807628f5841da1b ("mm: tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU architectures because it was using the expanding truncate to signal ramfs to allocate a physically contiguous RAM backing the inode (otherwise it is unusable for "memory mapping" it to userspace). However do_truncate is what caused the lock ordering error, due to it taking i_mutex. In this case, we can actually just call ramfs directly to allocate memory for the mapping, rather than go via truncate. Acked-by: David Howells <dhowells@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-02inotify: fix lock ordering wrt do_page_fault's mmap_semNick Piggin
Fix inotify lock order reversal with mmap_sem due to holding locks over copy_to_user. Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-by: "Daniel J Blueman" <daniel.blueman@gmail.com> Tested-by: "Daniel J Blueman" <daniel.blueman@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-29mm owner: fix race between swapoff and exitBalbir Singh
There's a race between mm->owner assignment and swapoff, more easily seen when task slab poisoning is turned on. The condition occurs when try_to_unuse() runs in parallel with an exiting task. A similar race can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats> or ptrace or page migration. CPU0 CPU1 try_to_unuse looks at mm = task0->mm increments mm->mm_users task 0 exits mm->owner needs to be updated, but no new owner is found (mm_users > 1, but no other task has task->mm = task0->mm) mm_update_next_owner() leaves mmput(mm) decrements mm->mm_users task0 freed dereferencing mm->owner fails The fix is to notify the subsystem via mm_owner_changed callback(), if no new owner is found, by specifying the new task as NULL. Jiri Slaby: mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but must be set after that, so as not to pass NULL as old owner causing oops. Daisuke Nishimura: mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task() and its callers need to take account of this situation to avoid oops. Hugh Dickins: Lockdep warning and hang below exec_mmap() when testing these patches. exit_mm() up_reads mmap_sem before calling mm_update_next_owner(), so exec_mmap() now needs to do the same. And with that repositioning, there's now no point in mm_need_new_owner() allowing for NULL mm. Reported-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-29Fix NULL pointer dereference in proc_sys_compareLinus Torvalds
The VFS interface for the 'd_compare()' is a bit special (read: 'odd'), because it really just essentially replaces a memcmp(). The filesystem is supposed to just compare the two names with whatever case-independent or other function. And when I say 'is supposed to', I obviously mean that 'procfs does odd things, and actually looks at the dentry that we don't even pass down, rather than just the name'. Which results in problems, because we actually call d_compare before we have even verified that the dentry is still hashed at all. And that causes a problm since the inode that procfs looks at may have been free'd and the d_inode pointer is NULL. procfs just assumes that all dentries are positive, since procfs itself never generates a negative one. But memory pressure will still result in the dentry getting torn down, and as it is removed by RCU, it still remains visible on some lists - and to d_compare. If the filesystem just did a name comparison, we wouldn't care. And we could just fix procfs to know about negative dentries too. But rather than have the low-level filesystems know about internal VFS details, just move the check for a unhashed dentry up a bit, so that we will only call d_compare on dentries that are still active. The actual oops this caused didn't look like a NULL pointer dereference because procfs did a 'container_of(inode, struct proc_inode, vfs_inode)' to get at its internal proc_inode information from the inode pointer, and accessed a field below the inode. So the oops would look something like BUG: unable to handle kernel paging request at fffffffffffffff0 IP: [<ffffffff802bc6c6>] proc_sys_compare+0x36/0x50 and was seen on both x86-64 (Alexey Dobriyan and Hugh Dickins) and ppc64 (Hugh Dickins). Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-of-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-26Merge git://oss.sgi.com:8090/xfs/linux-2.6Linus Torvalds
* git://oss.sgi.com:8090/xfs/linux-2.6: [XFS] Remove xfs_iext_irec_compact_full() [XFS] Fix extent list corruption in xfs_iext_irec_compact_full().
2008-09-26Merge branch 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6Linus Torvalds
* 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6: UBIFS: fix printk format warnings UBIFS: remove incorrect assert UBIFS: TNC / GC race fixes UBIFS: create the name of the background thread in every case
2008-09-26[XFS] Remove xfs_iext_irec_compact_full()Lachlan McIlroy
Yet another bug was found in xfs_iext_irec_compact_full() and while the source of the bug was found it wasn't an easy task to track it down because the conditions are very difficult to reproduce. A HUGE thank-you goes to Russell Cattelan and Eric Sandeen for their significant effort in tracking down the source of this corruption. xfs_iext_irec_compact_full() and xfs_iext_irec_compact_pages() are almost identical - they both compact indirect extent lists by moving extents from subsequent buffers into earlier ones. xfs_iext_irec_compact_pages() only moves extents if all of the extents in the next buffer will fit into the empty space in the buffer before it. xfs_iext_irec_compact_full() will go a step further and move part of the next buffer if all the extents wont fit. It will then shift the remaining extents in the next buffer up to the start of the buffer. The bug here was that we did not update er_extoff and this caused extent list corruption. It does not appear that this extra functionality gains us much. Calling xfs_iext_irec_compact_pages() instead will do a good enough job at compacting the indirect list and will be quicker too. For the case in xfs_iext_indirect_to_direct() the total number of extents in the indirect list will fit into one buffer so we will never need the extra functionality of xfs_iext_irec_compact_full() there. Also xfs_iext_irec_compact_pages() doesn't need to do a memmove() (the buffers will never overlap) so we don't want the performance hit that can incur. SGI-PV: 987159 SGI-Modid: xfs-linux-melb:xfs-kern:32166a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
2008-09-26[XFS] Fix extent list corruption in xfs_iext_irec_compact_full().Lachlan McIlroy
If we don't move all the records from the next buffer into the current buffer then we need to update the er_extoff field of the next buffer as we shift the remaining records to the start of the buffer. SGI-PV: 987159 SGI-Modid: xfs-linux-melb:xfs-kern:32165a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Russell Cattelan <cattelan@thebarn.com>
2008-09-249p: use an IS_ERR test rather than a NULL testJulien Brunel
In case of error, the function p9_client_walk returns an ERR pointer, but never returns a NULL pointer. So a NULL test that comes after an IS_ERR test should be deleted. The semantic match that finds this problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @match_bad_null_test@ expression x, E; statement S1,S2; @@ x = p9_client_walk(...) ... when != x = E * if (x != NULL) S1 else S2 // </smpl> Signed-off-by: Julien Brunel <brunel@diku.dk> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-09-18UBIFS: fix printk format warningsAlexander Beregalov
fs/ubifs/dir.c:428: warning: format '%llu' expects type 'long long unsigned int', but argument 5 has type 'long unsigned int' fs/ubifs/debug.c:541: warning: format '%llu' expects type 'long long unsigned int', but argument 2 has type 'long unsigned int' Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-09-17UBIFS: remove incorrect assertAdrian Hunter
The assert was not valid because one of the variables 'taken_empty_lebs' has transient values out of sync with the other variables. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-09-17UBIFS: TNC / GC race fixesAdrian Hunter
- update GC sequence number if any nodes may have been moved even if GC did not finish the LEB - don't ignore error return when reading Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>