summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2013-07-05Merge tag 'upstream-3.11-rc1' of git://git.infradead.org/linux-ubifsLinus Torvalds
Pull ubifs fix from Artem Bityutskiy: "Only a single patch which fixes a message" * tag 'upstream-3.11-rc1' of git://git.infradead.org/linux-ubifs: UBIFS: correct mount message
2013-07-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "The usual stuff from trivial tree" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) treewide: relase -> release Documentation/cgroups/memory.txt: fix stat file documentation sysctl/net.txt: delete reference to obsolete 2.4.x kernel spinlock_api_smp.h: fix preprocessor comments treewide: Fix typo in printk doc: device tree: clarify stuff in usage-model.txt. open firmware: "/aliasas" -> "/aliases" md: bcache: Fixed a typo with the word 'arithmetic' irq/generic-chip: fix a few kernel-doc entries frv: Convert use of typedef ctl_table to struct ctl_table sgi: xpc: Convert use of typedef ctl_table to struct ctl_table doc: clk: Fix incorrect wording Documentation/arm/IXP4xx fix a typo Documentation/networking/ieee802154 fix a typo Documentation/DocBook/media/v4l fix a typo Documentation/video4linux/si476x.txt fix a typo Documentation/virtual/kvm/api.txt fix a typo Documentation/early-userspace/README fix a typo Documentation/video4linux/soc-camera.txt fix a typo lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment ...
2013-07-04Merge branch 'hpfs' from Mikulas PatockaLinus Torvalds
Merge hpfs patches from Mikulas Patocka. * emailed patches from Mikulas Patocka <mpatocka@artax.karlin.mff.cuni.cz>: hpfs: implement prefetch to improve performance hpfs: use mpage hpfs: better test for errors
2013-07-04hpfs: implement prefetch to improve performanceMikulas Patocka
This patch implements prefetch to improve performance. It helps mostly when scanning the bitmaps to calculate free space. Signed-off-by: Mikulas Patocka <mpatocka@artax.karlin.mff.cuni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-04hpfs: use mpageMikulas Patocka
Use the mpage interface to improve performance. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-04hpfs: better test for errorsMikulas Patocka
The test if bitmap access is out of bound could errorneously pass if the device size is divisible by 16384 sectors and we are asking for one bitmap after the end. Check for invalid size in the superblock. Invalid size could cause integer overflows in the rest of the code. Signed-off-by: Mikulas Patocka <mpatocka@artax.karlin.mff.cuni.cz> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-04Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Ben Herrenschmidt: "This is the powerpc changes for the 3.11 merge window. In addition to the usual bug fixes and small updates, the main highlights are: - Support for transparent huge pages by Aneesh Kumar for 64-bit server processors. This allows the use of 16M pages as transparent huge pages on kernels compiled with a 64K base page size. - Base VFIO support for KVM on power by Alexey Kardashevskiy - Wiring up of our nvram to the pstore infrastructure, including putting compressed oopses in there by Aruna Balakrishnaiah - Move, rework and improve our "EEH" (basically PCI error handling and recovery) infrastructure. It is no longer specific to pseries but is now usable by the new "powernv" platform as well (no hypervisor) by Gavin Shan. - I fixed some bugs in our math-emu instruction decoding and made it usable to emulate some optional FP instructions on processors with hard FP that lack them (such as fsqrt on Freescale embedded processors). - Support for Power8 "Event Based Branch" facility by Michael Ellerman. This facility allows what is basically "userspace interrupts" for performance monitor events. - A bunch of Transactional Memory vs. Signals bug fixes and HW breakpoint/watchpoint fixes by Michael Neuling. And more ... I appologize in advance if I've failed to highlight something that somebody deemed worth it." * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits) pstore: Add hsize argument in write_buf call of pstore_ftrace_call powerpc/fsl: add MPIC timer wakeup support powerpc/mpic: create mpic subsystem object powerpc/mpic: add global timer support powerpc/mpic: add irq_set_wake support powerpc/85xx: enable coreint for all the 64bit boards powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig powerpc/mpic: Add get_version API both for internal and external use powerpc: Handle both new style and old style reserve maps powerpc/hw_brk: Fix off by one error when validating DAWR region end powerpc/pseries: Support compression of oops text via pstore powerpc/pseries: Re-organise the oops compression code pstore: Pass header size in the pstore write callback powerpc/powernv: Fix iommu initialization again powerpc/pseries: Inform the hypervisor we are using EBB regs powerpc/perf: Add power8 EBB support powerpc/perf: Core EBB support for 64-bit book3s powerpc/perf: Drop MMCRA from thread_struct powerpc/perf: Don't enable if we have zero events ...
2013-07-04Merge branch 'akpm' (updates from Andrew Morton)Linus Torvalds
Merge first patch-bomb from Andrew Morton: - various misc bits - I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been distracted. There has been quite a bit of activity. - About half the MM queue - Some backlight bits - Various lib/ updates - checkpatch updates - zillions more little rtc patches - ptrace - signals - exec - procfs - rapidio - nbd - aoe - pps - memstick - tools/testing/selftests updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (445 commits) tools/testing/selftests: don't assume the x bit is set on scripts selftests: add .gitignore for kcmp selftests: fix clean target in kcmp Makefile selftests: add .gitignore for vm selftests: add hugetlbfstest self-test: fix make clean selftests: exit 1 on failure kernel/resource.c: remove the unneeded assignment in function __find_resource aio: fix wrong comment in aio_complete() drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode drivers/memstick/host/r592.c: convert to module_pci_driver drivers/memstick/host/jmb38x_ms: convert to module_pci_driver pps-gpio: add device-tree binding and support drivers/pps/clients/pps-gpio.c: convert to module_platform_driver drivers/pps/clients/pps-gpio.c: convert to devm_* helpers drivers/parport/share.c: use kzalloc Documentation/accounting/getdelays.c: avoid strncpy in accounting tool aoe: update internal version number to v83 aoe: update copyright date aoe: perform I/O completions in parallel ...
2013-07-03aio: fix wrong comment in aio_complete()Tang Chen
ctx->ctx_lock should be ctx->completion_lock. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03fs/exec.c:de_thread: mt-exec should update ->real_start_timeOleg Nesterov
924b42d5 ("Use boot based time for process start time and boot time in /proc") updated copy_process/do_task_stat but forgot about de_thread(). This breaks "ps axOT" if a sub-thread execs. Note: I think that task->start_time should die. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Tomas Janousek <tjanouse@redhat.com> Cc: Tomas Smetana <tsmetana@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03fs/exec.c: do_execve_common(): use current_user()Oleg Nesterov
Trivial cleanup. do_execve_common() can use current_user() and avoid the unnecessary "struct cred *cred" var. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03fs/proc/kcore.c: using strlcpy() instead of strncpy()Zhao Hongjiang
For NUL terminated string, set '\0' at the end. Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03fs/proc/uptime.c:uptime_proc_show(): use get_monotonic_boottime()Oleg Nesterov
Change uptime_proc_show() to use get_monotonic_boottime() instead of do_posix_clock_monotonic_gettime() + monotonic_to_bootbased(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Tomas Janousek <tjanouse@redhat.com> Cc: Tomas Smetana <tsmetana@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03fs/exec.c:de_thread(): use change_pid() rather than detach_pid/attach_pidOleg Nesterov
de_thread() can use change_pid() instead of detach + attach. This looks better and this ensures that, say, next_thread() can never see a task with ->pid == NULL. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03coredump: '% at the end' shouldn't bypass core_uses_pid logicOleg Nesterov
"goto end" should not bypass the "Backward compatibility with core_uses_pid" code, move this label up. While at it, - It is ugly to copy '|' into cn->corename and then inc the pointer for argv_split(). Change format_corename() to increment pat_ptr instead. - Remove the dead "if (*pat_ptr == 0)" in format_corename(), we already checked it is not zero. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Colin Walters <walters@verbum.org> Cc: Denys Vlasenko <vda.linux@googlemail.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Lennart Poettering <mzxreary@0pointer.de> Cc: Lucas De Marchi <lucas.de.marchi@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03coredump: kill call_count, add core_name_sizeOleg Nesterov
Imho, "atomic_t call_count" is ugly and should die. It buys nothing and in fact it can grow more than necessary, expand doesn't check if it was already incremented by another task. Kill it, and introduce "static int core_name_size" updated by expand_corename(). This is obviously racy too but harmless, and core_name_size never grows for no reason. We do not bother to to calculate the "right" new size, we simply do kmalloc(size_we_need) and use ksize() to rely on kmalloc_index's decision. Finally change format_corename() to use expand_corename(), krealloc(NULL) is fine. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Colin Walters <walters@verbum.org> Cc: Denys Vlasenko <vda.linux@googlemail.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Lennart Poettering <mzxreary@0pointer.de> Cc: Lucas De Marchi <lucas.de.marchi@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03coredump: kill cn_escape(), introduce cn_esc_printf()Oleg Nesterov
The usage of cn_escape() looks really annoying, imho this sequence needs a wrapper. And it is buggy. If cn_printf() does expand_corename() cn_escape() writes to the freed memory. Introduce cn_esc_printf() which hopefully does this all right. It records the index before cn_vprintf(), not "char *" which is no longer valid (in general) after krealloc(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Colin Walters <walters@verbum.org> Cc: Denys Vlasenko <vda.linux@googlemail.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Lennart Poettering <mzxreary@0pointer.de> Cc: Lucas De Marchi <lucas.de.marchi@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03coredump: cn_vprintf() has no reason to call vsnprintf() twiceOleg Nesterov
cn_vprintf() looks really overcomplicated and sub-optimal. We do not need vsnprintf(NULL) to calculate the size we need, we can simply try to print into the current buffer and expand/retry only if necessary. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Colin Walters <walters@verbum.org> Cc: Denys Vlasenko <vda.linux@googlemail.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Lennart Poettering <mzxreary@0pointer.de> Cc: Lucas De Marchi <lucas.de.marchi@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03coredump: introduce cn_vprintf()Oleg Nesterov
Turn cn_printf(...) into cn_vprintf(va_list args), reintroduce cn_printf() as a trivial wrapper. This simplifies the next change and cn_vprintf() will have more callers. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Colin Walters <walters@verbum.org> Cc: Denys Vlasenko <vda.linux@googlemail.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Lennart Poettering <mzxreary@0pointer.de> Cc: Lucas De Marchi <lucas.de.marchi@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03coredump: format_corename() can leak cn->corenameOleg Nesterov
do_coredump() assumes that format_corename() can only fail if expand_corename() fails and frees cn->corename. This is not true, for example cn_print_exe_file() can fail and in this case nobody frees cn->corename. Change do_coredump() to always do kfree(cn->corename) after it calls format_corename() (NULL is fine), change expand_corename() to do nothing if kmalloc() fails. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Colin Walters <walters@verbum.org> Cc: Denys Vlasenko <vda.linux@googlemail.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Lennart Poettering <mzxreary@0pointer.de> Cc: Lucas De Marchi <lucas.de.marchi@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03signals: eventpoll: do not use sigprocmask()Oleg Nesterov
sigprocmask() should die. None of the current callers actually need this strange interface. Change fs/eventpoll.c to use set_current_blocked(). This also means we should not worry about SIGKILL/SIGSTOP. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eric Wong <normalperson@yhbt.net> Cc: Jason Baron <jbaron@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03fs/fat: use fat_msg() to replace printk() in __fat_fs_error()Gu Zheng
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03] nilfs2: use atomic64_t type for inodes_count and blocks_count fields in ↵Vyacheslav Dubeyko
nilfs_root struct The cp_inodes_count and cp_blocks_count are represented as __le64 type in on-disk structure (struct nilfs_checkpoint). But analogous fields in in-core structure (struct nilfs_root) are represented by atomic_t type. This patch replaces atomic_t on atomic64_t type in representation of inodes_count and blocks_count fields in struct nilfs_root. Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: Joern Engel <joern@logfs.org> Cc: Clemens Eisserer <linuxhippy@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03nilfs2: implement calculation of free inodes countVyacheslav Dubeyko
Currently, NILFS2 returns 0 as free inodes count (f_ffree) and current used inodes count as total file nodes in file system (f_files): df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/loop0 2 2 0 100% /mnt/nilfs2 This patch implements real calculation of free inodes count. First of all, it is calculated total file nodes in file system as (desc_blocks_count * groups_per_desc_block * entries_per_group). Then, it is calculated free inodes count as difference the total file nodes and used inodes count. As a result, we have such output for NILFS2: df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/loop0 4194304 2114701 2079603 51% /mnt/nilfs2 Reported-by: Clemens Eisserer <linuxhippy@gmail.com> Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Joern Engel <joern@logfs.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03drivers: avoid parsing names as kthread_run() format stringsKees Cook
Calling kthread_run with a single name parameter causes it to be handled as a format string. Many callers are passing potentially dynamic string content, so use "%s" in those cases to avoid any potential accidents. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03clean up scary strncpy(dst, src, strlen(src)) usesKees Cook
Fix various weird constructions of strncpy(dst, src, strlen(src)). Length limits should be about the space available in the destination, not repurposed as a method to either always include or always exclude a trailing NULL byte. Either the NULL should always be copied (using strlcpy), or it should not be copied (using something like memcpy). Readable code should not depend on the weird behavior of strncpy when it hits the length limit. Better to avoid the anti-pattern entirely. [akpm@linux-foundation.org: revert getdelays.c part due to missing bsd/string.h] Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [staging] Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [acpi] Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ursula Braun <ursula.braun@de.ibm.com> Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: use totalram_pages instead of num_physpages at runtimeJiang Liu
The global variable num_physpages is scheduled to be removed, so use totalram_pages instead of num_physpages at runtime. Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: remove lru parameter from __pagevec_lru_add and remove parts of pagevec APIMel Gorman
Now that the LRU to add a page to is decided at LRU-add time, remove the misleading lru parameter from __pagevec_lru_add. A consequence of this is that the pagevec_lru_add_file, pagevec_lru_add_anon and similar helpers are misleading as the caller no longer has direct control over what LRU the page is added to. Unused helpers are removed by this patch and existing users of pagevec_lru_add_file() are converted to use lru_cache_add_file() directly and use the per-cpu pagevecs instead of creating their own pagevec. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com> Cc: Andrew Perepechko <anserper@ya.ru> Cc: Robin Dong <sanbai@taobao.com> Cc: Theodore Tso <tytso@mit.edu> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Bernd Schubert <bernd.schubert@fastmail.fm> Cc: David Howells <dhowells@redhat.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03vmcore: support mmap() on /proc/vmcoreHATAYAMA Daisuke
This patch introduces mmap_vmcore(). Don't permit writable nor executable mapping even with mprotect() because this mmap() is aimed at reading crash dump memory. Non-writable mapping is also requirement of remap_pfn_range() when mapping linear pages on non-consecutive physical pages; see is_cow_mapping(). Set VM_MIXEDMAP flag to remap memory by remap_pfn_range and by remap_vmalloc_range_pertial at the same time for a single vma. do_munmap() can correctly clean partially remapped vma with two functions in abnormal case. See zap_pte_range(), vm_normal_page() and their comments for details. On x86-32 PAE kernels, mmap() supports at most 16TB memory only. This limitation comes from the fact that the third argument of remap_pfn_range(), pfn, is of 32-bit length on x86-32: unsigned long. [akpm@linux-foundation.org: use min(), switch to conventional error-unwinding approach] Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Lisa Mitchell <lisa.mitchell@hp.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Tested-by: Maxim Uvarov <muvarov@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03vmcore: calculate vmcore file size from buffer size and total size of vmcore ↵HATAYAMA Daisuke
objects The previous patches newly added holes before each chunk of memory and the holes need to be count in vmcore file size. There are two ways to count file size in such a way: 1) suppose m is a poitner to the last vmcore object in vmcore_list. Then file size is (m->offset + m->size), or 2) calculate sum of size of buffers for ELF header, program headers, ELF note segments and objects in vmcore_list. Although 1) is more direct and simpler than 2), 2) seems better in that it reflects internal object structure of /proc/vmcore. Thus, this patch changes get_vmcore_size_elf{64, 32} so that it calculates size in the way of 2). As a result, both get_vmcore_size_elf{64, 32} have the same definition. Merge them as get_vmcore_size. Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Lisa Mitchell <lisa.mitchell@hp.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03vmcore: allow user process to remap ELF note segment bufferHATAYAMA Daisuke
Now ELF note segment has been copied in the buffer on vmalloc memory. To allow user process to remap the ELF note segment buffer with remap_vmalloc_page, the corresponding VM area object has to have VM_USERMAP flag set. [akpm@linux-foundation.org: use the conventional comment layout] Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Lisa Mitchell <lisa.mitchell@hp.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03vmcore: allocate ELF note segment in the 2nd kernel vmalloc memoryHATAYAMA Daisuke
The reasons why we don't allocate ELF note segment in the 1st kernel (old memory) on page boundary is to keep backward compatibility for old kernels, and that if doing so, we waste not a little memory due to round-up operation to fit the memory to page boundary since most of the buffers are in per-cpu area. ELF notes are per-cpu, so total size of ELF note segments depends on number of CPUs. The current maximum number of CPUs on x86_64 is 5192, and there's already system with 4192 CPUs in SGI, where total size amounts to 1MB. This can be larger in the near future or possibly even now on another architecture that has larger size of note per a single cpu. Thus, to avoid the case where memory allocation for large block fails, we allocate vmcore objects on vmalloc memory. This patch adds elfnotes_buf and elfnotes_sz variables to keep pointer to the ELF note segment buffer and its size. There's no longer the vmcore object that corresponds to the ELF note segment in vmcore_list. Accordingly, read_vmcore() has new case for ELF note segment and set_vmcore_list_offsets_elf{64,32}() and other helper functions starts calculating offset from sum of size of ELF headers and size of ELF note segment. [akpm@linux-foundation.org: use min(), fix error-path vzalloc() leaks] Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Lisa Mitchell <lisa.mitchell@hp.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03vmcore: treat memory chunks referenced by PT_LOAD program header entries in ↵HATAYAMA Daisuke
page-size boundary in vmcore_list Treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list. Formally, for each range [start, end], we set up the corresponding vmcore object in vmcore_list to [rounddown(start, PAGE_SIZE), roundup(end, PAGE_SIZE)]. This change affects layout of /proc/vmcore. The gaps generated by the rearrangement are newly made visible to applications as holes. Concretely, they are two ranges [rounddown(start, PAGE_SIZE), start] and [end, roundup(end, PAGE_SIZE)]. Suppose variable m points at a vmcore object in vmcore_list, and variable phdr points at the program header of PT_LOAD type the variable m corresponds to. Then, pictorially: m->offset +---------------+ | hole | phdr->p_offset = +---------------+ m->offset + (paddr - start) | |\ | kernel memory | phdr->p_memsz | |/ +---------------+ | hole | m->offset + m->size +---------------+ where m->offset and m->offset + m->size are always page-size aligned. Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Lisa Mitchell <lisa.mitchell@hp.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03vmcore: allocate buffer for ELF headers on page-size alignmentHATAYAMA Daisuke
Allocate ELF headers on page-size boundary using __get_free_pages() instead of kmalloc(). Later patch will merge PT_NOTE entries into a single unique one and decrease the buffer size actually used. Keep original buffer size in variable elfcorebuf_sz_orig to kfree the buffer later and actually used buffer size with rounded up to page-size boundary in variable elfcorebuf_sz separately. The size of part of the ELF buffer exported from /proc/vmcore is elfcorebuf_sz. The merged, removed PT_NOTE entries, i.e. the range [elfcorebuf_sz, elfcorebuf_sz_orig], is filled with 0. Use size of the ELF headers as an initial offset value in set_vmcore_list_offsets_elf{64,32} and process_ptload_program_headers_elf{64,32} in order to indicate that the offset includes the holes towards the page boundary. As a result, both set_vmcore_list_offsets_elf{64,32} have the same definition. Merge them as set_vmcore_list_offsets. [akpm@linux-foundation.org: add free_elfcorebuf(), cleanups] Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Lisa Mitchell <lisa.mitchell@hp.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03vmcore: clean up read_vmcore()HATAYAMA Daisuke
Rewrite part of read_vmcore() that reads objects in vmcore_list in the same way as part reading ELF headers, by which some duplicated and redundant codes are removed. Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Lisa Mitchell <lisa.mitchell@hp.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03fs: nfs: inform the VM about pages being committed or unstableMel Gorman
VM page reclaim uses dirty and writeback page states to determine if flushers are cleaning pages too slowly and that page reclaim should stall waiting on flushers to catch up. Page state in NFS is a bit more complex and a clean page can be unreclaimable due to being unstable which is effectively "dirty" from the perspective of the VM from reclaim context. Similarly, if the inode is currently being committed then it's similar to being under writeback. This patch adds a is_dirty_writeback() handled for NFS that checks if a pages backing inode is being committed and should be accounted as writeback and if a page has private state indicating that it is effectively dirty. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: vmscan: take page buffers dirty and locked state into accountMel Gorman
Page reclaim keeps track of dirty and under writeback pages and uses it to determine if wait_iff_congested() should stall or if kswapd should begin writing back pages. This fails to account for buffer pages that can be under writeback but not PageWriteback which is the case for filesystems like ext3 ordered mode. Furthermore, PageDirty buffer pages can have all the buffers clean and writepage does no IO so it should not be accounted as congested. This patch adds an address_space operation that filesystems may optionally use to check if a page is really dirty or really under writeback. An implementation is provided for for buffer_heads is added and used for block operations and ext3 in ordered mode. By default the page flags are obeyed. Credit goes to Jan Kara for identifying that the page flags alone are not sufficient for ext3 and sanity checking a number of ideas on how the problem could be addressed. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03ncpfs: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFTLibin
(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented as a inline funcion vma_pages() in linux/mm.h, so using it. Signed-off-by: Libin <huawei.libin@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03pagemap: prepare to reuse constant bits with page-shiftPavel Emelyanov
In order to reuse bits from pagemap entries gracefully, we leave the entries as is but on pagemap open emit a warning in dmesg, that bits 55-60 are about to change in a couple of releases. Next, if a user issues soft-dirty clear command via the clear_refs file (it was disabled before v3.9) we assume that he's aware of the new pagemap format, note that fact and report the bits in pagemap in the new manner. The "migration strategy" looks like this then: 1. existing users are not affected -- they don't touch soft-dirty feature, thus see old bits in pagemap, but are warned and have time to fix themselves 2. those who use soft-dirty know about new pagemap format 3. some time soon we get rid of any signs of page-shift in pagemap as well as this trick with clear-soft-dirty affecting pagemap format. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm: soft-dirty bits for user memory changes trackingPavel Emelyanov
The soft-dirty is a bit on a PTE which helps to track which pages a task writes to. In order to do this tracking one should 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs) 2. Wait some time. 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries) To do this tracking, the writable bit is cleared from PTEs when the soft-dirty bit is. Thus, after this, when the task tries to modify a page at some virtual address the #PF occurs and the kernel sets the soft-dirty bit on the respective PTE. Note, that although all the task's address space is marked as r/o after the soft-dirty bits clear, the #PF-s that occur after that are processed fast. This is so, since the pages are still mapped to physical memory, and thus all the kernel does is finds this fact out and puts back writable, dirty and soft-dirty bits on the PTE. Another thing to note, is that when mremap moves PTEs they are marked with soft-dirty as well, since from the user perspective mremap modifies the virtual memory at mremap's new address. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03pagemap: introduce pagemap_entry_t without pmshift bitsPavel Emelyanov
These bits are always constant (== PAGE_SHIFT) and just occupy space in the entry. Moreover, in next patch we will need to report one more bit in the pagemap, but all bits are already busy on it. That said, describe the pagemap entry that has 6 more free zero bits. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03clear_refs: introduce private struct for mm_walkPavel Emelyanov
In the next patch the clear-refs-type will be required in clear_refs_pte_range funciton, so prepare the walk->private to carry this info. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03clear_refs: sanitize accepted commands declarationPavel Emelyanov
This is the implementation of the soft-dirty bit concept that should help keep track of changes in user memory, which in turn is very-very required by the checkpoint-restore project (http://criu.org). To create a dump of an application(s) we save all the information about it to files, and the biggest part of such dump is the contents of tasks' memory. However, there are usage scenarios where it's not required to get _all_ the task memory while creating a dump. For example, when doing periodical dumps, it's only required to take full memory dump only at the first step and then take incremental changes of memory. Another example is live migration. We copy all the memory to the destination node without stopping all tasks, then stop them, check for what pages has changed, dump it and the rest of the state, then copy it to the destination node. This decreases freeze time significantly. That said, some help from kernel to watch how processes modify the contents of their memory is required. The proposal is to track changes with the help of new soft-dirty bit this way: 1. First do "echo 4 > /proc/$pid/clear_refs". At that point kernel clears the soft dirty _and_ the writable bits from all ptes of process $pid. From now on every write to any page will result in #pf and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set the soft dirty flag. 2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there (the 55'th one). If set, the respective pte was written to since last call to clear refs. The soft-dirty bit is the _PAGE_BIT_HIDDEN one. Although it's used by kmemcheck, the latter one marks kernel pages with it, while the former bit is put on user pages so they do not conflict to each other. This patch: A new clear-refs type will be added in the next patch, so prepare code for that. [akpm@linux-foundation.org: don't assume that sizeof(enum clear_refs_types) == sizeof(int)] Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03ocfs2: fix NULL pointer dereference when traversing o2hb_all_regionsXue jiufei
There may exist NULL pointer dereference in config_item_name() when one volume (say Volume A) unmounts while another (say Volume B) mounting. Volume A Volume B already Mounted. Unmounting, call o2hb_heartbeat_group_drop_item() -> config_item_put(item) set reg(A)->item.ci_name to NULL in function config_item_cleanup(). begin mounting, call o2hb_region_pin() and tranverse all regions. When reading reg(A)->item.ci_name, it causes NULL pointer dereference. call o2hb_region_release() and del reg(A) from list. So we should skip accessing regions that is going to release when tranverse o2hb_all_regions. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: joyce <xuejiufei@huawei.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03ocfs2: adjust switch_case syntax at o2net_state_change()Jie Liu
Adjust switch..case syntax at o2net_state_change to meet the kernel coding standard. s/printk/pr_info/. [akpm@linux-foundation.org: revert pr_foo() change] Signed-off-by: Jie Liu <jeff.liu@oracle.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Gurudas Pai <gurudas.pai@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Noboru Iwamatsu <n_iwamatsu@jp.fujitsu.com> Cc: Srinivas Eeeda <srinivas.eeda@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: Tao Ma <tm@tao.ma> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03ocfs2: fix a comments typo at o2quo_hb_still_up()Jie Liu
Fix a comment typo in o2quo_hb_still_up() Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Gurudas Pai <gurudas.pai@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Noboru Iwamatsu <n_iwamatsu@jp.fujitsu.com> Cc: Srinivas Eeeda <srinivas.eeda@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: Tao Ma <tm@tao.ma> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03ocfs2: consolidate o2hb_global_hearbeat_mode_set() naming conventionJie Liu
s/o2hb_global_hearbeat_mode_set/o2hb_global_heartbeat_mode_set/ to make the signature of those routines in a consistent manner with others for heartbeating. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Acked-by: Sunil Mushran <sunil.mushran@gmail.com> Cc: Gurudas Pai <gurudas.pai@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Noboru Iwamatsu <n_iwamatsu@jp.fujitsu.com> Cc: Srinivas Eeeda <srinivas.eeda@oracle.com> Cc: Tao Ma <tm@tao.ma> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03ocfs2: submit disk heartbeat bio using WRITE_SYNCNoboru Iwamatsu
Under heavy I/O load, writing the disk heartbeat can be forced to wait for minutes, and this causes the node to be fenced. This patch tries to use WRITE_SYNC in submitting the heartbeat bio, so that writing the heartbeat will have a priority over other requests. Signed-off-by: Noboru Iwamatsu <n_iwamatsu@jp.fujitsu.com> Acked-by: Tao Ma <tm@tao.ma> Acked-by: Sunil Mushran <sunil.mushran@gmail.com> Cc: Srinivas Eeeda <srinivas.eeda@oracle.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Tested-by: Gurudas Pai <gurudas.pai@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03ocfs2: xattr: fix inlined xattr reflinkJunxiao Bi
Inlined xattr shared free space of inode block with inlined data or data extent record, so the size of the later two should be adjusted when inlined xattr is enabled. See ocfs2_xattr_ibody_init(). But this isn't done well when reflink. For inode with inlined data, its max inlined data size is adjusted in ocfs2_duplicate_inline_data(), no problem. But for inode with data extent record, its record count isn't adjusted. Fix it, or data extent record and inlined xattr may overwrite each other, then cause data corruption or xattr failure. One panic caused by this bug in our test environment is the following: kernel BUG at fs/ocfs2/xattr.c:1435! invalid opcode: 0000 [#1] SMP Pid: 10871, comm: multi_reflink_t Not tainted 2.6.39-300.17.1.el5uek #1 RIP: ocfs2_xa_offset_pointer+0x17/0x20 [ocfs2] RSP: e02b:ffff88007a587948 EFLAGS: 00010283 RAX: 0000000000000000 RBX: 0000000000000010 RCX: 00000000000051e4 RDX: ffff880057092060 RSI: 0000000000000f80 RDI: ffff88007a587a68 RBP: ffff88007a587948 R08: 00000000000062f4 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000010 R13: ffff88007a587a68 R14: 0000000000000001 R15: ffff88007a587c68 FS: 00007fccff7f06e0(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000015cf000 CR3: 000000007aa76000 CR4: 0000000000000660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process multi_reflink_t Call Trace: ocfs2_xa_reuse_entry+0x60/0x280 [ocfs2] ocfs2_xa_prepare_entry+0x17e/0x2a0 [ocfs2] ocfs2_xa_set+0xcc/0x250 [ocfs2] ocfs2_xattr_ibody_set+0x98/0x230 [ocfs2] __ocfs2_xattr_set_handle+0x4f/0x700 [ocfs2] ocfs2_xattr_set+0x6c6/0x890 [ocfs2] ocfs2_xattr_user_set+0x46/0x50 [ocfs2] generic_setxattr+0x70/0x90 __vfs_setxattr_noperm+0x80/0x1a0 vfs_setxattr+0xa9/0xb0 setxattr+0xc3/0x120 sys_fsetxattr+0xa8/0xd0 system_call_fastpath+0x16/0x1b Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03ocfs2: fix readonly issue in ocfs2_unlink()Younger Liu
While deleting a file with ocfs2_unlink(), there is a bug in this function. This bug will result in filesystem read-only. After calling ocfs2_orphan_add(), the file which will be deleted is added into orphan dir. If ocfs2_delete_entry() fails, the file still exists in the parent dir. And this scenario introduces a conflict of metadata. If a file is added into orphan dir, when we put inode of the file with iput(), the inode i_flags is setted (~OCFS2_VALID_FL) in ocfs2_remove_inode(), and then write back to disk. But as previously mentioned, the file still exists in the parent dir. On other nodes, the file can be still accessed. When first read the file with ocfs2_read_blocks() from disk, It will check and avalidate inode using ocfs2_validate_inode_block(). So File system will be readonly because the inode is invalid. In other words, the inode i_flags has been set (~OCFS2_VALID_FL). [akpm@linux-foundation.org: cleanups] [jeff.liu@oracle.com: s/inode_is_unlinkable/ocfs2_inode_is_unlinkable/] Signed-off-by: Younger Liu <younger.liu@huawei.com> Signed-off-by: Jensen <shencanquan@huawei.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>