Description of problem:
I have a 3 node cluster running ccsd/cman/dlm/clvmd/gfs. Mount a
single GFS on each of the three nodes. Traffic does not need to be
running. On nodeC run 'gfs_tool withdraw /mnt/gfs1'. Then 'umount
/mnt/gfs1'. The node is happy and the GFS can be remounted.
On nodeB run 'gfs_tool withdraw /mnt/gfs1' then 'mount' then 'umount
/mnt/gfs1' and get the following panic:
GFS: fsid=MILTON:data1.1: withdrawing from cluster at user's request
GFS: fsid=MILTON:data1.1: about to withdraw from the cluster
GFS: fsid=MILTON:data1.1: waiting for outstanding I/O
GFS: fsid=MILTON:data1.1: telling LM to withdraw
Jan 31 15:53:02 link-12 kernel: GFS: fsid=MILTON:data1.1: withdrawing
from cluster at user's request
Jan 31 15:53:02 link-12 kernel: GFS: fsid=MILTON:data1.1: about to
withdraw from the cluster
Jan 31 15:53:02 link-12 kernel: GFS: fsid=MILTON:data1.1: waiting for
Jan 31 15:53:02 link-12 kernel: GFS: fsid=MILTON:data1.1: telling LM
lock_dlm: withdraw abandoned memory
GFS: fsid=MILTON:data1.1: withdrawn
Jan 31 15:53:03 link-12 kernel: lock_dlm: withdraw abandoned memory
Jan 31 15:53:03 link-12 kernel: GFS: fsid=MILTON:data1.1: withdrawn
[<e0483a32>] gfs_assert_i+0x48/0x69 [gfs]
[<e04686b7>] disk_commit+0xdd/0x260 [gfs]
[<e04695a2>] gfs_log_dump+0x3c5/0x550 [gfs]
[<e048132f>] gfs_make_fs_ro+0x44/0x80 [gfs]
[<e04772e3>] gfs_put_super+0x263/0x331 [gfs]
[<e0474d6e>] gfs_kill_sb+0x1f/0x43 [gfs]
Kernel panic - not syncing: GFS: fsid=MILTON:data1.1: assertion
"gfs_log_is_header(sdp, tr->tr_first_head)" failed
GFS: fsid=MILTON:data1.1: function = disk_commit
GFS: fsid=MILTON:data1.1: file =
line = 734
GFS: fsid=MILTON:data1.1: time = 1107204803
Version-Release number of selected component (if applicable):
RPMs from cluster-i686-2005-01-28-1544.tar
Steps to Reproduce:
2nd node panics.
gfs_tool withdraw; umount /gfs/mntpoint doesn't kill the node.
Adding to blocker list for release
This should be fixed now.
Ran the same test on the build from 2005-02-10-1033 and got the same results
as a side note, with lock_gulm, you cannot remount a fs without first rmmod
lock_gulm. (ie, mount, withdraw, umount, mount will cause problems with
lock_gulm. Where as mount, withdraw, umount, rmmod, mount should work.)
know this bug is with dlm, but sooner or later someone will try it wil gulm.
comment #4 was made before I realized that this bug was about multiple nodes.
definitly a gfs-ism. you get the same backtrace with dlm|gulm.
I was under the impression that once withdrawn, gfs would no longer try to write to the device, and
here it looks like it is trying to do that. Could be just missing something too. gonna read more code...
A: gfs_tool withdraw /mnt
B: gfs_tool withdraw /mnt
B: umount /mnt
you dont need to unmount the first node, just withdraw it.
well. if you set teh ROFS bit along with the SHUTDOWN bit, it kinda looks like things work. But I don't
quite believe that it is really the answer. more proding required.
comment #8 is definitly not the answer. causes other oopses.
Three nodes, no io. (must have no io.)
A: reboot -fn
wait for others to replay A's journal.
B: gfs_tool withdraw /mnt
B: umount /mnt
So, Withdraw *after* a node tries to replay some other node's journal, followed
by umount causes oops.
What it looks like is happening is that the other nodes give up the transaction
lock so one node can replay a journal. In doing so, they close out their
journals. Now, *any* activity other than withdraw causes the trans lock to be
reheld and the journal to be reopened. so all is good. However a withdraw does
not seem to rehold the trans lock, but still tries to calculate log entries.
Since it didn't reopen the journal, the calculations are off, and assert with
No idea how to fix. But think I've finnal found what causes the problem.
Think I've got a fix.
If a withdraw is called before we've a chance to relock the trans
lock, the sd_log_head points to the wrong place, and a umount will
fail on asserts because of this.
Adding one puts sd_log_head at a value that passes the assert. The
value may not be correct for on disk, but we've withdrawn so there is
no more disk io.
If we're not withdrawn, the next io will grab the trans lock, which
will fill sd_log_head with the correct value.
Verified in gfs_tool 6.1-0.pre19.