Age | Commit message (Collapse) | Author |
|
In global heartbeat mode, we have a upper limit for the number of active regions.
This patch adds the facility to track the number of active global heartbeat
regions and fails to start heartbeat if the number exceeds the maximum.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
|
|
Currently we track a global livenode bitmap that keeps track of all nodes
that are heartbeating in all regions.
This patch adds the ability to track the livenode bitmap on a per region basis.
We will use this facility in a later patch to allow us to withstand the loss of
a minority number of regions.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
|
|
o2hb debugfs handling is reorganized to allow for easy expansion.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
|
|
o2hb currently checks slots for configured nodes only. This patch makes
it check the slots for the live nodes too to take care of a race in which
a node is removed from the configuration but not from the live map.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
|
|
Prints messages when the user adds or removes nodes.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
|
|
Prints messages when the user adds or removes heartbeat regions in global
heartbeat mode. These messages are useful when debugging cluster related issues.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
|
|
Adds new dlm message DLM_QUERY_NODEINFO that sends the attributes of all
registered nodes. This message is sent if the negotiated dlm protocol is
1.1 or higher. If the information of the joining node does not match
that of any existing nodes, the join domain request is rejected.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
|
|
In global heartbeat mode, the heartbeat is started by the user. This patch
prints an error if the user attempts to mount a volume without starting the
heartbeat.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
|
|
Adds new dlm message DLM_QUERY_REGION that sends the names of all active
heartbeat regions. This message is only sent in the global heartbeat
mode. If the regions in the joining node do not fully match the ones in
the active nodes, the join domain request is rejected.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
|
|
Export function in o2hb to get a list of heartbeat regions. It also adds an
upper limit to the length of the heartbeat region name.
o2hb_global_heartbeat_active() currently disables global heartbeat. It will
be enabled in a later patch after all the code is added.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
|
|
Add dlm_protocol to the list of info shown by the debugfs file, dlm_state.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
|
|
Adds support for heartbeat=global mount option. It ensures that the heartbeat
mode passed matches the one enabled on disk.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
|
|
OCFS2_FEATURE_INCOMPAT_CLUSTERINFO allows us to use sb->s_cluster_info for
both userspace and o2cb cluster stacks. It also allows us to extend cluster
info to include stack flags.
This patch also adds stackflags to sb->s_clusterinfo. It also introduces a
clusterinfo flag OCFS2_CLUSTER_O2CB_GLOBAL_HEARTBEAT to denote the enabled
global heartbeat mode.
This incompat flag can be set/cleared using tunefs.ocfs2 --fs-features. The
clusterinfo flag is set/cleared using tunefs.ocfs2 --update-cluster-stack.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
|
|
Add heartbeat mode parameter to the configfs tree. This will be used
to set/show the heartbeat mode. The user is free to toggle the mode
between local and global as long as there is no active heartbeat region.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
|
|
The BKL in ocfs2/dlmfs is used in put_super, fill_super and remount_fs
that are all three protected by the superblocks s_umount rw_semaphore.
The use in ocfs2_control_open is evidently unrelated and the function
is protected by ocfs2_control_lock.
Therefore it is safe to remove the BKL entirely.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
|
|
This patch is a preparation necessary to remove the BKL from do_new_mount().
It explicitly adds calls to lock_kernel()/unlock_kernel() around
get_sb/fill_super operations for filesystems that still uses the BKL.
I've read through all the code formerly covered by the BKL inside
do_kern_mount() and have satisfied myself that it doesn't need the BKL
any more.
do_kern_mount() is already called without the BKL when mounting the rootfs
and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called
from various places without BKL: simple_pin_fs(), nfs_do_clone_mount()
through nfs_follow_mountpoint(), afs_mntpt_do_automount() through
afs_mntpt_follow_link(). Both later functions are actually the filesystems
follow_link inode operation. vfs_kern_mount() is calling the specified
get_sb function and lets the filesystem do its job by calling the given
fill_super function.
Therefore I think it is safe to push down the BKL from the VFS to the
low-level filesystems get_sb/fill_super operation.
[arnd: do not add the BKL to those file systems that already
don't use it elsewhere]
Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Christoph Hellwig <hch@infradead.org>
|
|
ocfs2 fast symlinks are NUL terminated strings stored inline in the
inode data area. However, disk corruption or a local attacker could, in
theory, remove that NUL. Because we're using strlen() (my fault,
introduced in a731d1 when removing vfs_follow_link()), we could walk off
the end of that string.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org
|
|
While umounting, a block mle doesn't get freed if dlm is shutdown after
master request is received but before assert master. This results in unclean
shutdown of dlm domain.
This patch frees all mles that lie around after other nodes were notified about
exiting the dlm and marking dlm state as leaving. Only block mles are expected
to be around, so we log ERROR for other mles but still free them.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
We sync our inode flags with ext2 and define them by hex
values. But actually in commit 3669567(4 years ago), all
these values are moved to include/linux/fs.h. So we'd
better also use them as what ext2 did. So sync our inode
flags with ext2 by using FS_*.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
The first time I read the function ocfs2_resmap_resv_bits, I consider
about what 'wanted' will be used and consider about the comments.
Then I find it is only used if the reservation is empty. ;)
So we'd better move it to the parens so that it make the code more
readable, what's more, ocfs2_resmap_resv_bits is used so frequently
and we should save some cpus.
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
e_leaf_clusters is a le16, so use cpu_to_le16 instead
of cpu_to_le32.
What's more, we change 'clusters' to unsigned int to
signify that the size of 'clusters' isn't important here.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
In commit 30e2bab, ext3 fixed it. So change it accordingly in ocfs2.
Steps to reproduce:
# touch aaa
# stat -c %Z aaa
1283760364
# setfacl -m 'u::x,g::x,o::x' aaa
# stat -c %Z aaa
1283760364
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Fix various typos of valid.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
mmotm/fs/ocfs2/cluster/tcp.c: In function ‘o2net_send_message_vec’:
mmotm/fs/ocfs2/cluster/tcp.c:980:6: warning: ‘ret’ may be used uninitialized in this function
It seems a real bug introduced by commit 9af0b38ff3 (ocfs2/net:
Use wait_event() in o2net_send_message_vec()).
cc: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
When CONFIG_OCFS2_DEBUG_MASKLOG is undefined, we don't use the dentry
variable in ocfs2_sync_file(). Let's just move all access to the dentry
inside the logging call.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
This patch tries to handle the case in which list 'dlm->tracking_list' is
empty, to avoid accessing an invalid pointer. It fixes the following oops:
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1287
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
In ocfs2_dx_dir_rebalance(), we need to rejournal_acess the blocks after
calling ocfs2_insert_extent() since growing an extent tree may trigger
ocfs2_extend_trans(), which makes previous journal_access meaningless.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
This patch change mutex_lock to a new subclass and
add a new inode lock subclass for the target inode
which caused this lockdep warning.
=============================================
[ INFO: possible recursive locking detected ]
2.6.35+ #5
---------------------------------------------
reflink/11086 is trying to acquire lock:
(Meta){+++++.}, at: [<ffffffffa06f9d65>] ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
but task is already holding lock:
(Meta){+++++.}, at: [<ffffffffa06f9aa0>] ocfs2_reflink_ioctl+0x5d3/0x1229 [ocfs2]
other info that might help us debug this:
6 locks held by reflink/11086:
#0: (&sb->s_type->i_mutex_key#15/1){+.+.+.}, at: [<ffffffff820e09ec>] lookup_create+0x26/0x97
#1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa06f99a0>] ocfs2_reflink_ioctl+0x4d3/0x1229 [ocfs2]
#2: (Meta){+++++.}, at: [<ffffffffa06f9aa0>] ocfs2_reflink_ioctl+0x5d3/0x1229 [ocfs2]
#3: (&oi->ip_xattr_sem){+.+.+.}, at: [<ffffffffa06f9b58>] ocfs2_reflink_ioctl+0x68b/0x1229 [ocfs2]
#4: (&oi->ip_alloc_sem){+.+.+.}, at: [<ffffffffa06f9b67>] ocfs2_reflink_ioctl+0x69a/0x1229 [ocfs2]
#5: (&sb->s_type->i_mutex_key#15/2){+.+...}, at: [<ffffffffa06f9d4f>] ocfs2_reflink_ioctl+0x882/0x1229 [ocfs2]
stack backtrace:
Pid: 11086, comm: reflink Not tainted 2.6.35+ #5
Call Trace:
[<ffffffff82063dd9>] validate_chain+0x56e/0xd68
[<ffffffff82062275>] ? mark_held_locks+0x49/0x69
[<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
[<ffffffff82065a81>] lock_acquire+0xc6/0xed
[<ffffffffa06f9d65>] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
[<ffffffffa06c9ade>] __ocfs2_cluster_lock+0x975/0xa0d [ocfs2]
[<ffffffffa06f9d65>] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
[<ffffffffa06e107b>] ? ocfs2_wait_for_recovery+0x15/0x8a [ocfs2]
[<ffffffffa06cb6ea>] ocfs2_inode_lock_full_nested+0x1ac/0xdc5 [ocfs2]
[<ffffffffa06f9d65>] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
[<ffffffff820623a0>] ? trace_hardirqs_on_caller+0x10b/0x12f
[<ffffffff82060193>] ? debug_mutex_free_waiter+0x4f/0x53
[<ffffffffa06f9d65>] ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
[<ffffffffa06ce24a>] ? ocfs2_file_lock_res_init+0x66/0x78 [ocfs2]
[<ffffffff820bb2d2>] ? might_fault+0x40/0x8d
[<ffffffffa06df9f6>] ocfs2_ioctl+0x61a/0x656 [ocfs2]
[<ffffffff820ee5d3>] ? mntput_no_expire+0x1d/0xb0
[<ffffffff820e07b3>] ? path_put+0x2c/0x31
[<ffffffff820e53ac>] vfs_ioctl+0x2a/0x9d
[<ffffffff820e5903>] do_vfs_ioctl+0x45d/0x4ae
[<ffffffff8233a7f6>] ? _raw_spin_unlock+0x26/0x2a
[<ffffffff8200299c>] ? sysret_check+0x27/0x62
[<ffffffff820e59ab>] sys_ioctl+0x57/0x7a
[<ffffffff8200296b>] system_call_fastpath+0x16/0x1b
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
As the name shows, we shouldn't have any lock in
ocfs2_xattr_get_nolock. so lift ip_xattr_sem to the caller.
This should be safe for us since the only 2 callers are:
1. ocfs2_xattr_get which will lock the resources.
2. ocfs2_mknod which don't need this locking.
And this also resolves the following lockdep warning.
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.35+ #5
-------------------------------------------------------
reflink/30027 is trying to acquire lock:
(&oi->ip_alloc_sem){+.+.+.}, at: [<ffffffffa0673b67>] ocfs2_reflink_ioctl+0x69a/0x1226 [ocfs2]
but task is already holding lock:
(&oi->ip_xattr_sem){++++..}, at: [<ffffffffa0673b58>] ocfs2_reflink_ioctl+0x68b/0x1226 [ocfs2]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&oi->ip_xattr_sem){++++..}:
[<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
[<ffffffff82065a81>] lock_acquire+0xc6/0xed
[<ffffffff82339650>] down_read+0x34/0x47
[<ffffffffa0691cb8>] ocfs2_xattr_get_nolock+0xa0/0x4e6 [ocfs2]
[<ffffffffa069d64f>] ocfs2_get_acl_nolock+0x5c/0x132 [ocfs2]
[<ffffffffa069d9c7>] ocfs2_init_acl+0x60/0x243 [ocfs2]
[<ffffffffa066499d>] ocfs2_mknod+0xae8/0xfea [ocfs2]
[<ffffffffa0665041>] ocfs2_create+0x9d/0x105 [ocfs2]
[<ffffffff820e1c83>] vfs_create+0x9b/0xf4
[<ffffffff820e20bb>] do_last+0x2fd/0x5be
[<ffffffff820e31c0>] do_filp_open+0x1fb/0x572
[<ffffffff820d6cf6>] do_sys_open+0x5a/0xe7
[<ffffffff820d6dac>] sys_open+0x1b/0x1d
[<ffffffff8200296b>] system_call_fastpath+0x16/0x1b
-> #2 (jbd2_handle){+.+...}:
[<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
[<ffffffff82065a81>] lock_acquire+0xc6/0xed
[<ffffffffa0604ff8>] start_this_handle+0x4a3/0x4bc [jbd2]
[<ffffffffa06051d6>] jbd2__journal_start+0xba/0xee [jbd2]
[<ffffffffa0605218>] jbd2_journal_start+0xe/0x10 [jbd2]
[<ffffffffa065ca34>] ocfs2_start_trans+0xb7/0x19b [ocfs2]
[<ffffffffa06645f3>] ocfs2_mknod+0x73e/0xfea [ocfs2]
[<ffffffffa0665041>] ocfs2_create+0x9d/0x105 [ocfs2]
[<ffffffff820e1c83>] vfs_create+0x9b/0xf4
[<ffffffff820e20bb>] do_last+0x2fd/0x5be
[<ffffffff820e31c0>] do_filp_open+0x1fb/0x572
[<ffffffff820d6cf6>] do_sys_open+0x5a/0xe7
[<ffffffff820d6dac>] sys_open+0x1b/0x1d
[<ffffffff8200296b>] system_call_fastpath+0x16/0x1b
-> #1 (&journal->j_trans_barrier){.+.+..}:
[<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
[<ffffffff82064fa9>] lock_release_non_nested+0x1e5/0x24b
[<ffffffff82065999>] lock_release+0x158/0x17a
[<ffffffff823389f6>] __mutex_unlock_slowpath+0xbf/0x11b
[<ffffffff82338a5b>] mutex_unlock+0x9/0xb
[<ffffffffa0679673>] ocfs2_free_ac_resource+0x31/0x67 [ocfs2]
[<ffffffffa067c6bc>] ocfs2_free_alloc_context+0x11/0x1d [ocfs2]
[<ffffffffa0633de0>] ocfs2_write_begin_nolock+0x141e/0x159b [ocfs2]
[<ffffffffa0635523>] ocfs2_write_begin+0x11e/0x1e7 [ocfs2]
[<ffffffff820a1297>] generic_file_buffered_write+0x10c/0x210
[<ffffffffa0653624>] ocfs2_file_aio_write+0x4cc/0x6d3 [ocfs2]
[<ffffffff820d822d>] do_sync_write+0xc2/0x106
[<ffffffff820d897b>] vfs_write+0xae/0x131
[<ffffffff820d8e55>] sys_write+0x47/0x6f
[<ffffffff8200296b>] system_call_fastpath+0x16/0x1b
-> #0 (&oi->ip_alloc_sem){+.+.+.}:
[<ffffffff82063f92>] validate_chain+0x727/0xd68
[<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
[<ffffffff82065a81>] lock_acquire+0xc6/0xed
[<ffffffff82339694>] down_write+0x31/0x52
[<ffffffffa0673b67>] ocfs2_reflink_ioctl+0x69a/0x1226 [ocfs2]
[<ffffffffa06599f6>] ocfs2_ioctl+0x61a/0x656 [ocfs2]
[<ffffffff820e53ac>] vfs_ioctl+0x2a/0x9d
[<ffffffff820e5903>] do_vfs_ioctl+0x45d/0x4ae
[<ffffffff820e59ab>] sys_ioctl+0x57/0x7a
[<ffffffff8200296b>] system_call_fastpath+0x16/0x1b
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Track negative dentries by recording the generation number of the parent
directory in d_fsdata. The generation number for the parent directory is
recorded in the inode_info, which increments every time the lock on the
directory is dropped.
If the generation number of the parent directory and the negative dentry
matches, there is no need to perform the revalidate, else a revalidate
is forced. This improves performance in situations where nodes look for
the same non-existent file multiple times.
Thanks Mark for explaining the DLM sequence.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Durring orphan scan, if we are slot 0, and we are replaying
orphan_dir:0001, the general process is that for every file
in this dir:
1. we will iget orphan_dir:0001, since there is no inode for it.
we will have to create an inode and read it from the disk.
2. do the normal work, such as delete_inode and remove it from
the dir if it is allowed.
3. call iput orphan_dir:0001 when we are done. In this case,
since we have no dcache for this inode, i_count will
reach 0, and VFS will have to call clear_inode and in
ocfs2_clear_inode we will checkpoint the inode which will let
ocfs2_cmt and journald begin to work.
4. We loop back to 1 for the next file.
So you see, actually for every deleted file, we have to read the
orphan dir from the disk and checkpoint the journal. It is very
time consuming and cause a lot of journal checkpoint I/O.
A better solution is that we can have another reference for these
inodes in ocfs2_super. So if there is no other race among
nodes(which will let dlmglue to checkpoint the inode), for step 3,
clear_inode won't be called and for step 1, we may only need to
read the inode for the 1st time. This is a big win for us.
So this patch will try to cache system inodes of other slots so
that we will have one more reference for these inodes and avoid
the extra inode read and journal checkpoint.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
The OCFS2 developers have already done all of the hard work to allow
volumes larger than 16 TiB. But there is still a "sanity check" in
fs/ocfs2/super.c that prevents the mounting of such volumes, even when
the cluster size and journal options would allow it.
This patch replaces that sanity check with a more sophisticated one to
mount a huge volume provided that (a) it is addressable by the raw
word/address size of the system (borrowing a test from ext4); (b) the
volume is using JBD2; and (c) the JBD2_FEATURE_INCOMPAT_64BIT flag is
set on the journal.
I factored out the sanity check into its own function. I also moved it
from ocfs2_initialize_super() down to ocfs2_check_volume(); any earlier,
and the journal will not have been initialized yet.
This patch is one of a pair, and it depends on the other ("JBD2: Allow
feature checks before journal recovery").
I have tested this patch on small volumes, huge volumes, and huge
volumes without 64-bit block support in the journal. All of them appear
to work or to fail gracefully, as appropriate.
Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
merge-2
|
|
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
ocfs2_sync_inode() is used only from ocfs2_sync_file(). But all data has
already been written before calling ocfs2_sync_file() and ocfs2 doesn't use
inode's private_list for tracking metadata buffers thus sync_mapping_buffers()
is superfluous as well.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Thanks for the comments. I have incorportated them all.
CONFIG_OCFS2_FS_STATS is enabled and CONFIG_DEBUG_LOCK_ALLOC is disabled.
Statistics now look like -
ocfs2_write_ctxt: 2144 - 2136 = 8
ocfs2_inode_info: 1960 - 1848 = 112
ocfs2_journal: 168 - 160 = 8
ocfs2_lock_res: 336 - 304 = 32
ocfs2_refcount_tree: 512 - 472 = 40
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
In ocfs2, actually we don't allow any direct write pass i_size,
see the function ocfs2_prepare_inode_for_write. So we don't
need the bogus simple_setsize.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Now orphan scan worker has no trace log, so it is
very hard to tell whether it is finished or blocked.
So add 2 mlog trace log so that we can tell whether
the current orphan scan worker is blocked or not.
It does help when I analyzed a orphan scan bug.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
The reason why we need this ioctl is to offer the none-privileged
end-user a possibility to get filesys info gathering.
We use OCFS2_IOC_INFO to manipulate the new ioctl, userspace passes a
structure to kernel containing an array of request pointers and request
count, such as,
* From userspace:
struct ocfs2_info_blocksize oib = {
.ib_req = {
.ir_magic = OCFS2_INFO_MAGIC,
.ir_code = OCFS2_INFO_BLOCKSIZE,
...
}
...
}
struct ocfs2_info_clustersize oic = {
...
}
uint64_t reqs[2] = {(unsigned long)&oib,
(unsigned long)&oic};
struct ocfs2_info info = {
.oi_requests = reqs,
.oi_count = 2,
}
ret = ioctl(fd, OCFS2_IOC_INFO, &info);
* In kernel:
Get the request pointers from *info*, then handle each request one bye one.
Idea here is to make the spearated request small enough to guarantee
a better backward&forward compatibility since a small piece of request
would be less likely to be broken if filesys on raw disk get changed.
Currently, the following 7 requests are supported per the requirement from
userspace tool o2info, and I believe it will grow over time:-)
OCFS2_INFO_CLUSTERSIZE
OCFS2_INFO_BLOCKSIZE
OCFS2_INFO_MAXSLOTS
OCFS2_INFO_LABEL
OCFS2_INFO_UUID
OCFS2_INFO_FS_FEATURES
OCFS2_INFO_JOURNAL_SIZE
This ioctl is only specific to OCFS2.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
ocfs2_create_inode_in_orphan() is used by reflink to create the newly
reflinked inode simultaneously in the orphan dir. This allows us to easily
handle partially-reflinked files during recovery cleanup.
We have a problem though - the orphan dir stringifies inode # to determine
a unique name under which the orphan entry dirent can be created. Since
ocfs2_create_inode_in_orphan() needs the space allocated in the orphan dir
before it can allocate the inode, we currently call into the orphan code:
/*
* We give the orphan dir the root blkno to fake an orphan name,
* and allocate enough space for our insertion.
*/
status = ocfs2_prepare_orphan_dir(osb, &orphan_dir,
osb->root_blkno,
orphan_name, &orphan_insert);
Using osb->root_blkno might work fine on unindexed directories, but the
orphan dir can have an index. When it has that index, the above code fails
to allocate the proper index entry. Later, when we try to remove the file
from the orphan dir (using the actual inode #), the reflink operation will
fail.
To fix this, I created a function ocfs2_alloc_orphaned_file() which uses the
newly split out orphan and inode alloc code to figure out what the inode
block number will be (once allocated) and then prepare the orphan dir from
that data.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
|
|
We do this because ocfs2_create_inode_in_orphan() wants to order locking of
the orphan dir with respect to locking of the inode allocator *before*
making any changes to the directory.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
|
|
This allows code which needs to know the eventual block number of an inode
but can't allocate it yet due to transaction or lock ordering. For example,
ocfs2_create_inode_in_orphan() currently gives a junk blkno for preparation
of the orphan dir because it can't yet know where the actual inode is placed
- that code is actually in ocfs2_mknod_locked. This is a problem when the
orphan dirs are indexed as the junk inode number will create an index entry
which goes unused (and fails the later removal from the orphan dir). Now
with these interfaces, ocfs2_create_inode_in_orphan() can run the block
group search (and get back the inode block number) *before* any actual
allocation occurs.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
|
|
ocfs2_search_chain() makes the same updates as
ocfs2_alloc_dinode_update_counts to the alloc inode. Instead of open coding
the bitmap update, use our helper function.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
|
|
Do this by splitting the bulk of the function away from the inode allocation
code at the very tom of ocfs2_mknod_locked(). Existing callers don't need to
change and won't see any difference. The new function created,
__ocfs2_mknod_locked() will be used shortly.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
|
|
commit(6b933c8e6f1a2f3118082c455eef25f9b1ac7b45).
The patch is to fix the regression bug brought from commit 6b933c8...( 'ocfs2:
Avoid direct write if we fall back to buffered I/O'):
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1285
The commit 6b933c8e6f1a2f3118082c455eef25f9b1ac7b45 changed __generic_file_aio_write
to generic_file_buffered_write, which didn't call filemap_{write,wait}_range to flush
the pagecaches when we were falling O_DIRECT writes back to buffered ones. it did hurt
the O_DIRECT semantics somehow in extented odirect writes.
This patch tries to guarantee O_DIRECT writes of 'fall back to buffered' to be correctly
flushed.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
|
|
We cannot call grab_cache_page() when holding filesystem locks or with
a transaction started as grab_cache_page() calls page allocation with
GFP_KERNEL flag and thus page reclaim can recurse back into the filesystem
causing deadlocks or various assertion failures. We have to use
find_or_create_page() instead and pass it GFP_NOFS as we do with other
allocations.
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
|
|
We were setting ac->ac_last_group in ocfs2_claim_suballoc_bits from
res->sr_bg_blkno. Unfortunately, res->sr_bg_blkno is going to be zero under
normal (non-fragmented) circumstances. The discontig block group patches
effectively turned off that feature. Fix this by correctly calculating what
the next group hint should be.
Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
|
|
We have added discontig block group now, and now an inode
can be allocated in an discontig block group. So get
it in ocfs2_get_suballoc_slot_bit.
The old ocfs2_test_suballoc_bit gets group block no
from the allocation inode which is wrong. Fix it by
passing the right group.
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
|