summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-11-06mm/filemap.c: make global sync not clear error status of individual inodesJunichi Nomura
filemap_fdatawait() is a function to wait for on-going writeback to complete but also consume and clear error status of the mapping set during writeback. The latter functionality is critical for applications to detect writeback error with system calls like fsync(2)/fdatasync(2). However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl, which don't check error status of individual mappings. As a result, fsync() may not be able to detect writeback error if events happen in the following order: Application System admin ---------------------------------------------------------- write data on page cache Run sync command writeback completes with error filemap_fdatawait() clears error fsync returns success (but the data is not on disk) This patch adds filemap_fdatawait_keep_errors() for call sites where writeback error is not handled so that they don't clear error status. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Fengguang Wu <fengguang.wu@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: hugetlb: proc: add HugetlbPages field to /proc/PID/statusNaoya Horiguchi
Currently there's no easy way to get per-process usage of hugetlb pages, which is inconvenient because userspace applications which use hugetlb typically want to control their processes on the basis of how much memory (including hugetlb) they use. So this patch simply provides easy access to the info via /proc/PID/status. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Joern Engel <joern@logfs.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smapsNaoya Horiguchi
Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which is inconvenient when we want to know per-task or per-vma base hugetlb usage. To solve this, this patch adds new fields for hugetlb usage like below: Size: 20480 kB Rss: 0 kB Pss: 0 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 0 kB Anonymous: 0 kB AnonHugePages: 0 kB Shared_Hugetlb: 18432 kB Private_Hugetlb: 2048 kB Swap: 0 kB KernelPageSize: 2048 kB MMUPageSize: 2048 kB Locked: 0 kB VmFlags: rd wr mr mw me de ht [hughd@google.com: fix Private_Hugetlb alignment ] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Joern Engel <joern@logfs.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-069p: do not overwrite return code when locking failsDominique Martinet
If the remote locking fail, we run a local vfs unlock that should work and return success to userland when we didn't actually lock at all. We need to tell the application that tried to lock that it didn't get it, not that all went well. Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ocfs2: clean up unused variable in ocfs2_duplicate_clusters_by_page()Joseph Qi
readahead_pages in ocfs2_duplicate_clusters_by_page is defined but not used, so clean it up. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ocfs2: add uuid to ocfs2 thread name for problem analysisJoseph Qi
A node can mount multiple ocfs2 volumes. And if thread names are same for each volume/domain, it will bring inconvenience when analyzing problems because we have to identify which volume/domain the messages belong to. Since thread name will be printed to messages, so add volume uuid or dlm name to thread name can benefit problem analysis. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Gang He <ghe@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an erroralex chen
In ocfs2_mknod_locked if '__ocfs2_mknod_locke d' returns an error, we should reclaim the inode successfully claimed above, otherwise, the inode never be reused. The case is described below: ocfs2_mknod ocfs2_mknod_locked ocfs2_claim_new_inode Successfully claim the inode __ocfs2_mknod_locked ocfs2_journal_access_di Failed because of -ENOMEM or other reasons, the inode lockres has not been initialized yet. iput(inode) ocfs2_evict_inode ocfs2_delete_inode ocfs2_inode_lock ocfs2_inode_lock_full_nested __ocfs2_cluster_lock Return -EINVAL because of the inode lockres has not been initialized. So the following operations are not performed ocfs2_wipe_inode ocfs2_remove_inode ocfs2_free_dinode ocfs2_free_suballoc_bits Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ocfs2: fix race between mount and delete node/clusterJoseph Qi
There is a race case between mount and delete node/cluster, which will lead o2hb_thread to malfunctioning dead loop. o2hb_thread { o2nm_depend_this_node(); <<<<<< race window, node may have already been deleted, and then enter the loop, o2hb thread will be malfunctioning because of no configured nodes found. while (!kthread_should_stop() && !reg->hr_unclean_stop && !reg->hr_aborted_start) { } So check the return value of o2nm_depend_this_node() is needed. If node has been deleted, do not enter the loop and let mount fail. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ocfs2: only take lock if dio entry when recover orphansJoseph Qi
We have no need to take inode mutex, rw and inode lock if it is not dio entry when recover orphans. Optimize it by adding a flag OCFS2_INODE_DIO_ORPHAN_ENTRY to ocfs2_inode_info to reduce contention. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ocfs2: do not include dio entry in case of orphan scanJoseph Qi
dio entry will only do truncate in case of ORPHAN_NEED_TRUNCATE. So do not include it when doing normal orphan scan to reduce contention. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ocfs2: improve performance for localallocJoseph Qi
Currently cluster allocation is always trying to find a victim chain (a chian has most space), and this may lead to poor performance because of discontiguous allocation in some scenarios. Our test case is block size 4k, cluster size 1M and mount option with localalloc=2048 (2G), since a gd is 32256M (about 31.5G) and a localalloc window is only 2G, creating 50G file will result in 2G from gd0, 2G from gd1, ... One way to improve performance is enlarge localalloc window size (max 31104M), but this will make end user feel that about 30G is suddenly "missing", and localalloc currently do not support steal, which means one node cannot use another node's localalloc even it is not used in fact. So using the last gd to record the allocation and continues with the gd if it has enough space for a localalloc window can make the allocation as more contiguous as possible. Our test result is below (evaluated in IOPS), which is using iometer running in VM, dynamic vhd virtual disk stored in ocfs2. IO model Original After Improved(%) 16K60%Write100%Random 703 876 24.59% 8K90%Write100%Random 735 827 12.59% 4K100%Write100%Random 859 915 6.52% 4K100%Read100%Random 2092 2600 24.30% Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Tested-by: Norton Zhu <norton.zhu@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ocfs2: fill in the unused portion of the block with zeros by dio_zero_block()jiangyiwen
A simplified test case is (this case from Ryan): 1) dd if=/dev/zero of=/mnt/hello bs=512 count=1 oflag=direct; 2) truncate /mnt/hello -s 2097152 file 'hello' is not exist before test. After this command, file 'hello' should be all zero. But 512~4096 is some random data. Setting bh state to new when get a new block, if so, direct_io_worker()->dio_zero_block() will fill-in the unused portion of the block with zero. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ocfs2_direct_IO_write() misses ocfs2_is_overwrite() error codeNorton.Zhu
If ocfs2_is_overwrite failed, ocfs2_direct_IO_write mays till return success to the caller. Signed-off-by: Norton.Zhu <norton.zhu@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06logfs: fix build warningSudip Mukherjee
fs/logfs/dev_bdev.c: In function '__bdev_writeseg': include/linux/kernel.h:601:17: warning: comparison of distinct pointer types lacks a cast [enabled by default] (void) (&_min1 == &_min2); \ fs/logfs/dev_bdev.c:84:14: note: in expansion of macro 'min' max_pages = min(nr_pages, BIO_MAX_PAGES); fs/logfs/dev_bdev.c: In function 'do_erase': include/linux/kernel.h:601:17: warning: comparison of distinct pointer types lacks a cast [enabled by default] (void) (&_min1 == &_min2); \ fs/logfs/dev_bdev.c:174:14: note: in expansion of macro 'min' max_pages = min(nr_pages, BIO_MAX_PAGES); Lets use min_t and mention the type. Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06inotify: actually check for invalid bits in sys_inotify_add_watch()Dave Hansen
The comment here says that it is checking for invalid bits. But, the mask is *actually* checking to ensure that _any_ valid bit is set, which is quite different. Without this check, an unexpected bit could get set on an inotify object. Since these bits are also interpreted by the fsnotify/dnotify code, there is the potential for an object to be mishandled inside the kernel. For instance, can we be sure that setting the dnotify flag FS_DN_RENAME on an inotify watch is harmless? Add the actual check which was intended. Retain the existing inotify bits are being added to the watch. Plus, this is existing behavior which would be nice to preserve. I did a quick sniff test that inotify functions and that my 'inotify-tools' package passes 'make check'. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: John McCutchan <john@johnmccutchan.com> Cc: Robert Love <rlove@rlove.org> Cc: Eric Paris <eparis@parisplace.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06inotify: hide internal kernel bits from fdinfoDave Hansen
There was a report that my patch: inotify: actually check for invalid bits in sys_inotify_add_watch() broke CRIU. The reason is that CRIU looks up raw flags in /proc/$pid/fdinfo/* to figure out how to rebuild inotify watches and then passes those flags directly back in to the inotify API. One of those flags (FS_EVENT_ON_CHILD) is set in mark->mask, but is not part of the inotify API. It is used inside the kernel to _implement_ inotify but it is not and has never been part of the API. My patch above ensured that we only allow bits which are part of the API (IN_ALL_EVENTS). This broke CRIU. FS_EVENT_ON_CHILD is really internal to the kernel. It is set _anyway_ on all inotify marks. So, CRIU was really just trying to set a bit that was already set. This patch hides that bit from fdinfo. CRIU will not see the bit, not try to set it, and should work as before. We should not have been exposing this bit in the first place, so this is a good patch independent of the CRIU problem. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reported-by: Andrey Wagin <avagin@gmail.com> Acked-by: Andrey Vagin <avagin@openvz.org> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Eric Paris <eparis@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: John McCutchan <john@johnmccutchan.com> Cc: Robert Love <rlove@rlove.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05Merge tag 'driver-core-4.4-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here's the "big" driver core updates for 4.4-rc1. Primarily a bunch of debugfs updates, with a smattering of minor driver core fixes and updates as well. All have been in linux-next for a long time" * tag 'driver-core-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: debugfs: Add debugfs_create_ulong() of: to support binding numa node to specified device in devicetree debugfs: Add read-only/write-only bool file ops debugfs: Add read-only/write-only size_t file ops debugfs: Add read-only/write-only x64 file ops debugfs: Consolidate file mode checks in debugfs_create_*() Revert "mm: Check if section present during memory block (un)registering" driver-core: platform: Provide helpers for multi-driver modules mm: Check if section present during memory block (un)registering devres: fix a for loop bounds check CMA: fix CONFIG_CMA_SIZE_MBYTES overflow in 64bit base/platform: assert that dev_pm_domain callbacks are called unconditionally sysfs: correctly handle short reads on PREALLOC attrs. base: soc: siplify ida usage kobject: move EXPORT_SYMBOL() macros next to corresponding definitions kobject: explain what kobject's sd field is debugfs: document that debugfs_remove*() accepts NULL and error values debugfs: Pass bool pointer to debugfs_create_bool() ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'
2015-11-05Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block integrity updates from Jens Axboe: ""This is the joint work of Dan and Martin, cleaning up and improving the support for block data integrity" * 'for-4.4/integrity' of git://git.kernel.dk/linux-block: block, libnvdimm, nvme: provide a built-in blk_integrity nop profile block: blk_flush_integrity() for bio-based drivers block: move blk_integrity to request_queue block: generic request_queue reference counting nvme: suspend i/o during runtime blk_integrity_unregister md: suspend i/o during runtime blk_integrity_unregister md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown block: Inline blk_integrity in struct gendisk block: Export integrity data interval size in sysfs block: Reduce the size of struct blk_integrity block: Consolidate static integrity profile properties block: Move integrity kobject to struct gendisk
2015-11-05Merge branch 'for-4.4/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block updates from Jens Axboe: "This is the core block pull request for 4.4. I've got a few more topic branches this time around, some of them will layer on top of the core+drivers changes and will come in a separate round. So not a huge chunk of changes in this round. This pull request contains: - Enable blk-mq page allocation tracking with kmemleak, from Catalin. - Unused prototype removal in blk-mq from Christoph. - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two xchg()'s, from Davidlohr. - A plug flush fix from Jeff. - Also from Jeff, a fix that means we don't have to update shared tag sets at init time unless we do a state change. This cuts down boot times on thousands of devices a lot with scsi/blk-mq. - blk-mq waitqueue barrier fix from Kosuke. - Various fixes from Ming: - Fixes for segment merging and splitting, and checks, for the old core and blk-mq. - Potential blk-mq speedup by marking ctx pending at the end of a plug insertion batch in blk-mq. - direct-io no page dirty on kernel direct reads. - A WRITE_SYNC fix for mpage from Roman" * 'for-4.4/core' of git://git.kernel.dk/linux-block: blk-mq: avoid excessive boot delays with large lun counts blktrace: re-write setting q->blk_trace blk-mq: mark ctx as pending at batch in flush plug path blk-mq: fix for trace_block_plug() block: check bio_mergeable() early before merging blk-mq: check bio_mergeable() early before merging block: avoid to merge splitted bio block: setup bi_phys_segments after splitting block: fix plug list flushing for nomerge queues blk-mq: remove unused blk_mq_clone_flush_request prototype blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write block: kmemleak: Track the page allocations for struct request
2015-11-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "There is only one new feature in this pull for the 4.4 merge window, most of it is small enhancements, cleanup and bug fixes: - Add the s390 backend for the software dirty bit tracking. This adds two new pgtable functions pte_clear_soft_dirty and pmd_clear_soft_dirty which is why there is a hit to arch/x86/include/asm/pgtable.h in this pull request. - A series of cleanup patches for the AP bus, this includes the removal of the support for two outdated crypto cards (PCICC and PCICA). - The irq handling / signaling on buffer full in the runtime instrumentation code is dropped. - Some micro optimizations: remove unnecessary memory barriers for a couple of functions: [smb_]rmb, [smb_]wmb, atomics, bitops, and for spin_unlock. Use the builtin bswap if available and make test_and_set_bit_lock more cache friendly. - Statistics and a tracepoint for the diagnose calls to the hypervisor. - The CPU measurement facility support to sample KVM guests is improved. - The vector instructions are now always enabled for user space processes if the hardware has the vector facility. This simplifies the FPU handling code. The fpu-internal.h header is split into fpu internals, api and types just like x86. - Cleanup and improvements for the common I/O layer. - Rework udelay to solve a problem with kprobe. udelay has busy loop semantics but still uses an idle processor state for the wait" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (66 commits) s390: remove runtime instrumentation interrupts s390/cio: de-duplicate subchannel validation s390/css: unneeded initialization in for_each_subchannel s390/Kconfig: use builtin bswap s390/dasd: fix disconnected device with valid path mask s390/dasd: fix invalid PAV assignment after suspend/resume s390/dasd: fix double free in dasd_eckd_read_conf s390/kernel: fix ptrace peek/poke for floating point registers s390/cio: move ccw_device_stlck functions s390/cio: move ccw_device_call_handler s390/topology: reduce per_cpu() invocations s390/nmi: reduce size of percpu variable s390/nmi: fix terminology s390/nmi: remove casts s390/nmi: remove pointless error strings s390: don't store registers on disabled wait anymore s390: get rid of __set_psw_mask() s390/fpu: split fpu-internal.h into fpu internals, api, and type headers s390/dasd: fix list_del corruption after lcu changes s390/spinlock: remove unneeded serializations at unlock ...
2015-11-03Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU changes from Ingo Molnar: "The main changes in this cycle were: - Improvements to expedited grace periods (Paul E McKenney) - Performance improvements to and locktorture tests for percpu-rwsem (Oleg Nesterov, Paul E McKenney) - Torture-test changes (Paul E McKenney, Davidlohr Bueso) - Documentation updates (Paul E McKenney) - Miscellaneous fixes (Paul E McKenney, Boqun Feng, Oleg Nesterov, Patrick Marlier)" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs() rcu: Better hotplug handling for synchronize_sched_expedited() rcu: Enable stall warnings for synchronize_rcu_expedited() rcu: Add tasks to expedited stall-warning messages rcu: Add online/offline info to expedited stall warning message rcu: Consolidate expedited CPU selection rcu: Prepare for consolidating expedited CPU selection cpu: Remove try_get_online_cpus() rcu: Stop excluding CPU hotplug in synchronize_sched_expedited() rcu: Stop silencing lockdep false positive for expedited grace periods rcu: Switch synchronize_sched_expedited() to IPI locktorture: Fix module unwind when bad torture_type specified torture: Forgive non-plural arguments rcutorture: Fix unused-function warning for torturing_tasks() rcutorture: Fix module unwind when bad torture_type specified rcu_sync: Cleanup the CONFIG_PROVE_RCU checks locking/percpu-rwsem: Clean up the lockdep annotations in percpu_down_read() locking/percpu-rwsem: Fix the comments outdated by rcu_sync locking/percpu-rwsem: Make use of the rcu_sync infrastructure locking/percpu-rwsem: Make percpu_free_rwsem() after kzalloc() safe ...
2015-11-03Merge branch 'core-debug-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull wchan kernel address hiding from Ingo Molnar: "This fixes a wchan related information leak in /proc/PID/stat. There's a bit of an ABI twist to it: instead of setting the wchan field to 0 (which is our usual technique) we set it conditionally to a 0/1 flag to keep ABI compatibility with older procps versions that only fetches /proc/PID/wchan (symbolic names) if the absolute wchan address is nonzero" * 'core-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: fs/proc, core/debug: Don't expose absolute kernel addresses via wchan
2015-11-02mm: get rid of 'vmalloc_info' from /proc/meminfoLinus Torvalds
It turns out that at least some versions of glibc end up reading /proc/meminfo at every single startup, because glibc wants to know the amount of memory the machine has. And while that's arguably insane, it's just how things are. And it turns out that it's not all that expensive most of the time, but the vmalloc information statistics (amount of virtual memory used in the vmalloc space, and the biggest remaining chunk) can be rather expensive to compute. The 'get_vmalloc_info()' function actually showed up on my profiles as 4% of the CPU usage of "make test" in the git source repository, because the git tests are lots of very short-lived shell-scripts etc. It turns out that apparently this same silly vmalloc info gathering shows up on the facebook servers too, according to Dave Jones. So it's not just "make test" for git. We had two patches to just cache the information (one by me, one by Ingo) to mitigate this issue, but the whole vmalloc information of of rather dubious value to begin with, and people who *actually* want to know what the situation is wrt the vmalloc area should just look at the much more complete /proc/vmallocinfo instead. In fact, according to my testing - and perhaps more importantly, according to that big search engine in the sky: Google - there is nothing out there that actually cares about those two expensive fields: VmallocUsed and VmallocChunk. So let's try to just remove them entirely. Actually, this just removes the computation and reports the numbers as zero for now, just to try to be minimally intrusive. If this breaks anything, we'll obviously have to re-introduce the code to compute this all and add the caching patches on top. But if given the option, I'd really prefer to just remove this bad idea entirely rather than add even more code to work around our historical mistake that likely nobody really cares about. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-02Merge branch 'fs-file-descriptor-optimization'Linus Torvalds
Merge file descriptor allocation speedup. Eric Dumazet has a test-case for a fairly common network deamon load pattern: openign and closing a lot of sockets that each have very little work done on them. It turns out that in that case, the cost of just finding the correct file descriptor number can be a dominating factor. We've long had a trivial optimization for allocating file descriptors sequentially, but that optimization ends up being not very effective when other file descriptors are being closed concurrently, and the fd patterns are not some simple FIFO pattern. In such cases we ended up spending a lot of time just scanning the bitmap of open file descriptors in order to find the next file descriptor number to open. This trivial patch-series mitigates that by simply introducing a second-level bitmap of which words in the first bitmap are already fully allocated. That cuts down the cost of scanning by an order of magnitude in some pathological (but realistic) cases. The second patch is an even more trivial patch to avoid unnecessarily dirtying the cacheline for the close-on-exec bit array that normally ends up being all empty. * fs-file-descriptor-optimization: vfs: conditionally clear close-on-exec flag vfs: Fix pathological performance case for __alloc_fd()
2015-10-31vfs: conditionally clear close-on-exec flagLinus Torvalds
We clear the close-on-exec flag when opening and closing files, and the bit was almost always already clear before. Avoid dirtying the cacheline if the clearning isn't necessary. That avoids unnecessary cacheline dirtying and bouncing in multi-socket environments. Eric Dumazet has a file descriptor benchmark that goes 4% faster from this on his two-socket machine. It's probably partly superlinear improvement due to getting slightly less spinlock contention on the file_lock spinlock due to less work in the critical section. Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-31vfs: Fix pathological performance case for __alloc_fd()Linus Torvalds
Al Viro points out that: > > * [Linux-specific aside] our __alloc_fd() can degrade quite badly > > with some use patterns. The cacheline pingpong in the bitmap is probably > > inevitable, unless we accept considerably heavier memory footprint, > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard > > to trigger - close(3);open(...); will have the next open() after that > > scanning the entire in-use bitmap. And Eric Dumazet has a somewhat realistic multithreaded microbenchmark that opens and closes a lot of sockets with minimal work per socket. This patch largely fixes it. We keep a 2nd-level bitmap of the open file bitmaps, showing which words are already full. So then we can traverse that second-level bitmap to efficiently skip already allocated file descriptors. On his benchmark, this improves performance by up to an order of magnitude, by avoiding the excessive open file bitmap scanning. Tested-and-acked-by: Eric Dumazet <edumazet@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-31Merge branch 'overlayfs-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs bug fixes from Miklos Szeredi: "This contains fixes for bugs that appeared in earlier kernels (all are marked for -stable)" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: free lower_mnt array in ovl_put_super ovl: free stack of paths in ovl_fill_super ovl: fix open in stacked overlay ovl: fix dentry reference leak ovl: use O_LARGEFILE in ovl_copy_up()
2015-10-28fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in ↵Tejun Heo
bdi_split_work_to_wbs() bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue() to walk @bdi->wb_list. To set up the initial iteration condition, it uses list_entry_rcu() to calculate the entry pointer corresponding to the list head; however, this isn't an actual RCU dereference and using list_entry_rcu() for it ended up breaking a proposed list_entry_rcu() change because it was feeding an non-lvalue pointer into the macro. Don't use the RCU variant for simple pointer offsetting. Use list_entry() instead. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Darren Hart <dvhart@linux.intel.com> Cc: David Howells <dhowells@redhat.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Patrick Marlier <patrick.marlier@gmail.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pranith kumar <bobby.prani@gmail.com> Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-23Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer fixes from Jens Axboe: "A final set of fixes for 4.3. It is (again) bigger than I would have liked, but it's all been through the testing mill and has been carefully reviewed by multiple parties. Each fix is either a regression fix for this cycle, or is marked stable. You can scold me at KS. The pull request contains: - Three simple fixes for NVMe, fixing regressions since 4.3. From Arnd, Christoph, and Keith. - A single xen-blkfront fix from Cathy, fixing a NULL dereference if an error is returned through the staste change callback. - Fixup for some bad/sloppy code in nbd that got introduced earlier in this cycle. From Markus Pargmann. - A blk-mq tagset use-after-free fix from Junichi. - A backing device lifetime fix from Tejun, fixing a crash. - And finally, a set of regression/stable fixes for cgroup writeback from Tejun" * 'for-linus' of git://git.kernel.dk/linux-block: writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy() NVMe: Fix memory leak on retried commands block: don't release bdi while request_queue has live references nvme: use an integer value to Linux errno values blk-mq: fix use-after-free in blk_mq_free_tag_set() nvme: fix 32-bit build warning writeback: fix incorrect calculation of available memory for memcg domains writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions writeback: bdi_writeback iteration must not skip dying ones writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback() writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration nbd: Add locking for tasks xen-blkfront: check for null drvdata in blkback_changed (XenbusStateClosing)
2015-10-23Merge branch 'for-linus-4.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I have two more small fixes this week: Qu's fix avoids unneeded COW during fallocate, and Christian found a memory leak in the error handling of an earlier fix" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix possible leak in btrfs_ioctl_balance() btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size
2015-10-23ocfs2/dlm: unlock lockres spinlock before dlm_lockres_putJoseph Qi
dlm_lockres_put will call dlm_lockres_release if it is the last reference, and then it may call dlm_print_one_lock_resource and take lockres spinlock. So unlock lockres spinlock before dlm_lockres_put to avoid deadlock. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-22btrfs: fix possible leak in btrfs_ioctl_balance()Christian Engelmayer
Commit 8eb934591f8b ("btrfs: check unsupported filters in balance arguments") adds a jump to exit label out_bargs in case the argument check fails. At this point in addition to the bargs memory, the memory for struct btrfs_balance_control has already been allocated. Ownership of bctl is passed to btrfs_balance() in the good case, thus the memory is not freed due to the introduced jump. Make sure that the memory gets freed in any case as necessary. Detected by Coverity CID 1328378. Signed-off-by: Christian Engelmayer <cengelma@gmx.at> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21block: Inline blk_integrity in struct gendiskMartin K. Petersen
Up until now the_integrity profile has been dynamically allocated and attached to struct gendisk after the disk has been made active. This causes problems because NVMe devices need to register the profile prior to the partition table being read due to a mandatory metadata buffer requirement. In addition, DM goes through hoops to deal with preallocating, but not initializing integrity profiles. Since the integrity profile is small (4 bytes + a pointer), Christoph suggested moving it to struct gendisk proper. This requires several changes: - Moving the blk_integrity definition to genhd.h. - Inlining blk_integrity in struct gendisk. - Removing the dynamic allocation code. - Adding helper functions which allow gendisk to set up and tear down the integrity sysfs dir when a disk is added/deleted. - Adding a blk_integrity_revalidate() callback for updating the stable pages bdi setting. - The calls that depend on whether a device has an integrity profile or not now key off of the bi->profile pointer. - Simplifying the integrity support routines in DM (Mike Snitzer). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode sizeQu Wenruo
Current code will always truncate tailing page if its alloc_start is smaller than inode size. For example, the file extent layout is like: 0 4K 8K 16K 32K |<-----Extent A---------------->| |<--Inode size: 18K---------->| But if calling fallocate even for range [0,4K), it will cause btrfs to re-truncate the range [16,32K), causing COW and a new extent. 0 4K 8K 16K 32K |///////| <- Fallocate call range |<-----Extent A-------->|<--B-->| The cause is quite easy, just a careless btrfs_truncate_inode() in a else branch without extra judgment. Fix it by add judgment on whether the fallocate range is beyond isize. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-18debugfs: Add debugfs_create_ulong()Viresh Kumar
Add debugfs_create_ulong() for the users of type 'unsigned long'. These will be 32 bits long on a 32 bit machine and 64 bits long on a 64 bit machine. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-18debugfs: Add read-only/write-only bool file opsStephen Boyd
There aren't any read-only or write-only bool file ops, but there is a caller of debugfs_create_bool() that calls it with mode equal to 0400. This leads to the possibility of userspace modifying the file, so let's use the newly created debugfs_create_mode() helper here to fix this. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-18debugfs: Add read-only/write-only size_t file opsStephen Boyd
There aren't any read-only or write-only size_t file ops, but there is a caller of debugfs_create_size_t() that calls it with mode equal to 0400. This leads to the possibility of userspace modifying the file, so let's use the newly created debugfs_create_mode() helper here to fix this. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-18debugfs: Add read-only/write-only x64 file opsStephen Boyd
There aren't any read-only or write-only x64 file ops, but there is a caller of debugfs_create_x64() that calls it with mode equal to S_IRUGO. This leads to the possibility of userspace modifying the file, so let's use the newly created debugfs_create_mode() helper here to fix this. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-18debugfs: Consolidate file mode checks in debugfs_create_*()Stephen Boyd
The code that creates debugfs file with different file ops based on the file mode is duplicated in each debugfs_create_*() API. Consolidate that code into debugfs_create_mode(), that takes three file ops structures so that we don't have to keep copy/pasting that logic. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-16Merge branch 'for-linus-4.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I have two more bug fixes for btrfs. My commit fixes a bug we hit last week at FB, a combination of lots of hard links and an admin command to resolve inode numbers. Dave is adding checks to make sure balance on current kernels ignores filters it doesn't understand. The penalty for being wrong is just doing more work (not crashing etc), but it's a good fix" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix use after free iterating extrefs btrfs: check unsupported filters in balance arguments
2015-10-16Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "6 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: sh: add copy_user_page() alias for __copy_user() lib/Kconfig: ZLIB_DEFLATE must select BITREVERSE mm, dax: fix DAX deadlocks memcg: convert threshold to bytes builddeb: remove debian/files before build mm, fs: obey gfp_mapping for add_to_page_cache()
2015-10-16mm, dax: fix DAX deadlocksRoss Zwisler
The following two locking commits in the DAX code: commit 843172978bb9 ("dax: fix race between simultaneous faults") commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") introduced a number of deadlocks and other issues which need to be fixed for the v4.3 kernel. The list of issues in DAX after these commits (some newly introduced by the commits, some preexisting) can be found here: https://lkml.org/lkml/2015/9/25/602 (Subject: "Re: [PATCH] dax: fix deadlock in __dax_fault"). This undoes most of the changes introduced by those two commits, essentially returning us to the DAX locking scheme that was used in v4.2. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Tested-by: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-16mm, fs: obey gfp_mapping for add_to_page_cache()Michal Hocko
Commit 6afdb859b710 ("mm: do not ignore mapping_gfp_mask in page cache allocation paths") has caught some users of hardcoded GFP_KERNEL used in the page cache allocation paths. This, however, wasn't complete and there were others which went unnoticed. Dave Chinner has reported the following deadlock for xfs on loop device: : With the recent merge of the loop device changes, I'm now seeing : XFS deadlock on my single CPU, 1GB RAM VM running xfs/073. : : The deadlocked is as follows: : : kloopd1: loop_queue_read_work : xfs_file_iter_read : lock XFS inode XFS_IOLOCK_SHARED (on image file) : page cache read (GFP_KERNEL) : radix tree alloc : memory reclaim : reclaim XFS inodes : log force to unpin inodes : <wait for log IO completion> : : xfs-cil/loop1: <does log force IO work> : xlog_cil_push : xlog_write : <loop issuing log writes> : xlog_state_get_iclog_space() : <blocks due to all log buffers under write io> : <waits for IO completion> : : kloopd1: loop_queue_write_work : xfs_file_write_iter : lock XFS inode XFS_IOLOCK_EXCL (on image file) : <wait for inode to be unlocked> : : i.e. the kloopd, with it's split read and write work queues, has : introduced a dependency through memory reclaim. i.e. that writes : need to be able to progress for reads make progress. : : The problem, fundamentally, is that mpage_readpages() does a : GFP_KERNEL allocation, rather than paying attention to the inode's : mapping gfp mask, which is set to GFP_NOFS. : : The didn't used to happen, because the loop device used to issue : reads through the splice path and that does: : : error = add_to_page_cache_lru(page, mapping, index, : GFP_KERNEL & mapping_gfp_mask(mapping)); This has changed by commit aa4d86163e4 ("block: loop: switch to VFS ITER_BVEC"). This patch changes mpage_readpage{s} to follow gfp mask set for the mapping. There are, however, other places which are doing basically the same. lustre:ll_dir_filler is doing GFP_KERNEL from the function which apparently uses GFP_NOFS for other allocations so let's make this consistent. cifs:readpages_get_pages is called from cifs_readpages and __cifs_readpages_from_fscache called from the same path obeys mapping gfp. ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well regardless it uses mapping_gfp_mask for the page allocation. ext4_mpage_readpages is the called from the page cache allocation path same as read_pages and read_cache_pages As I've noticed in my previous post I cannot say I would be happy about sprinkling mapping_gfp_mask all over the place and it sounds like we should drop gfp_mask argument altogether and use it internally in __add_to_page_cache_locked that would require all the filesystems to use mapping gfp consistently which I am not sure is the case here. From a quick glance it seems that some file system use it all the time while others are selective. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Dave Chinner <david@fromorbit.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Ming Lei <ming.lei@canonical.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-15Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext4 Kconfig description fixup from Jan Kara: "A small fixup in description of EXT4_USE_FOR_EXT2 config option" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext4: Update EXT4_USE_FOR_EXT2 description
2015-10-14mm: add architecture primitives for software dirty bit clearingMartin Schwidefsky
There are primitives to create and query the software dirty bits in a pte or pmd. But the clearing of the software dirty bits is done in common code with x86 specific page table functions. Add the missing architecture primitives to clear the software dirty bits to allow the feature to be used on non-x86 systems, e.g. the s390 architecture. Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-10-14btrfs: fix use after free iterating extrefsChris Mason
The code for btrfs inode-resolve has never worked properly for files with enough hard links to trigger extrefs. It was trying to get the leaf out of a path after freeing the path: btrfs_release_path(path); leaf = path->nodes[0]; item_size = btrfs_item_size_nr(leaf, slot); The fix here is to use the extent buffer we cloned just a little higher up to avoid deadlocks caused by using the leaf in the path. Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org # v3.7+ cc: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de>
2015-10-14btrfs: check unsupported filters in balance argumentsDavid Sterba
We don't verify that all the balance filter arguments supplemented by the flags are actually known to the kernel. Thus we let it silently pass and do nothing. At the moment this means only the 'limit' filter, but we're going to add a few more soon so it's better to have that fixed. Also in older stable kernels so that it works with newer userspace tools. Cc: stable@vger.kernel.org # 3.16+ Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-13Merge tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd fixes from Bruce Fields: "Two nfsd fixes, one for an RDMA crash, one for a pnfs/block protocol bug" * tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linux: svcrdma: Fix NFS server crash triggered by 1MB NFS WRITE nfsd/blocklayout: accept any minlength
2015-10-12writeback: bdi_writeback iteration must not skip dying onesTejun Heo
bdi_for_each_wb() is used in several places to wake up or issue writeback work items to all wb's (bdi_writeback's) on a given bdi. The iteration is performed by walking bdi->cgwb_tree; however, the tree only indexes wb's which are currently active. For example, when a memcg gets associated with a different blkcg, the old wb is removed from the tree so that the new one can be indexed. The old wb starts dying from then on but will linger till all its inodes are drained. As these dying wb's may still host dirty inodes, writeback operations which affect all wb's must include them. bdi_for_each_wb() skipping dying wb's led to sync(2) missing and failing to sync the inodes belonging to those wb's. This patch adds a RCU protected @bdi->wb_list which lists all wb's beloinging to that bdi. wb's are added on creation and removed on release rather than on the start of destruction. bdi_for_each_wb() usages are replaced with list_for_each[_continue]_rcu() iterations over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed. v2: Updated as per Jan. last_wb ref leak in bdi_split_work_to_wbs() fixed and unnecessary list head severing in cgwb_bdi_destroy() removed. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com> Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()") Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-12writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback()Tejun Heo
wakeup_dirtytime_writeback() walks and wakes up all wb's of all bdi's; unfortunately, it was always waking up bdi->wb instead of the wb being walked. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 001fe6f617b1 ("writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's") Reviewed-by: Jan Kara <jack@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>