Bug 146711 - gfs_tool withdraw: 2nd node withdrawn and umounted panics.
gfs_tool withdraw: 2nd node withdrawn and umounted panics.
Status: CLOSED NEXTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: michael conrad tadpol tilstra
GFS Bugs
:
Depends On:
Blocks: 144795
  Show dependency treegraph
 
Reported: 2005-01-31 17:18 EST by Derek Anderson
Modified: 2010-01-11 22:02 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-03-08 13:03:43 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Derek Anderson 2005-01-31 17:18:19 EST
Description of problem:
I have a 3 node cluster running ccsd/cman/dlm/clvmd/gfs.  Mount a
single GFS on each of the three nodes.  Traffic does not need to be
running.  On nodeC run 'gfs_tool withdraw /mnt/gfs1'.  Then 'umount
/mnt/gfs1'.  The node is happy and the GFS can be remounted.

On nodeB run 'gfs_tool withdraw /mnt/gfs1' then 'mount' then 'umount
/mnt/gfs1' and get the following panic:

GFS: fsid=MILTON:data1.1: withdrawing from cluster at user's request
GFS: fsid=MILTON:data1.1: about to withdraw from the cluster
GFS: fsid=MILTON:data1.1: waiting for outstanding I/O
GFS: fsid=MILTON:data1.1: telling LM to withdraw
Jan 31 15:53:02 link-12 kernel: GFS: fsid=MILTON:data1.1: withdrawing
from cluster at user's request
Jan 31 15:53:02 link-12 kernel: GFS: fsid=MILTON:data1.1: about to
withdraw from the cluster
Jan 31 15:53:02 link-12 kernel: GFS: fsid=MILTON:data1.1: waiting for
outstanding I/O
Jan 31 15:53:02 link-12 kernel: GFS: fsid=MILTON:data1.1: telling LM
to withdraw
lock_dlm: withdraw abandoned memory
GFS: fsid=MILTON:data1.1: withdrawn
Jan 31 15:53:03 link-12 kernel: lock_dlm: withdraw abandoned memory
Jan 31 15:53:03 link-12 kernel: GFS: fsid=MILTON:data1.1: withdrawn
 [<e0483a32>] gfs_assert_i+0x48/0x69 [gfs]
 [<e04686b7>] disk_commit+0xdd/0x260 [gfs]
 [<e04695a2>] gfs_log_dump+0x3c5/0x550 [gfs]
 [<e048132f>] gfs_make_fs_ro+0x44/0x80 [gfs]
 [<e04772e3>] gfs_put_super+0x263/0x331 [gfs]
 [<c0168af1>] generic_shutdown_super+0x119/0x2eb
 [<e0474d6e>] gfs_kill_sb+0x1f/0x43 [gfs]
 [<c01687c5>] deactivate_super+0xc5/0xda
 [<c01833f0>] sys_umount+0x65/0x6c
 [<c01549dc>] unmap_vma_list+0xe/0x17
 [<c0154d8b>] do_munmap+0x1c8/0x1d2
 [<c0183402>] sys_oldumount+0xb/0xe
 [<c0301bfb>] syscall_call+0x7/0xb
Kernel panic - not syncing: GFS: fsid=MILTON:data1.1: assertion
"gfs_log_is_header(sdp, tr->tr_first_head)" failed
GFS: fsid=MILTON:data1.1:   function = disk_commit
GFS: fsid=MILTON:data1.1:   file =
/usr/src/build/512195-i686/BUILD/gfs-kernel-2.6.9-14/src/gfs/log.c,
line = 734
GFS: fsid=MILTON:data1.1:   time = 1107204803

Version-Release number of selected component (if applicable):
RPMs from cluster-i686-2005-01-28-1544.tar

How reproducible:
Yes.


Steps to Reproduce:
1.
2.
3.
  
Actual results:
2nd node panics.

Expected results:
gfs_tool withdraw; umount /gfs/mntpoint doesn't kill the node.

Additional info:
Comment 1 Kiersten (Kerri) Anderson 2005-02-01 11:09:21 EST
Adding to blocker list for release
Comment 2 Ken Preslan 2005-02-09 19:05:24 EST
This should be fixed now.
Comment 3 Derek Anderson 2005-02-10 12:48:10 EST
Ran the same test on the build from 2005-02-10-1033 and got the same results 
as originally. 
Comment 4 michael conrad tadpol tilstra 2005-02-23 10:44:00 EST
as a side note, with lock_gulm, you cannot remount a fs without first rmmod
lock_gulm.  (ie, mount, withdraw, umount, mount will cause problems with
lock_gulm.  Where as mount, withdraw, umount, rmmod, mount should work.)

know this bug is with dlm, but sooner or later someone will try it wil gulm.
Comment 5 michael conrad tadpol tilstra 2005-02-23 16:51:51 EST
comment #4 was made before I realized that this bug was about multiple nodes.


definitly a gfs-ism. you get the same backtrace with dlm|gulm.
Comment 6 michael conrad tadpol tilstra 2005-02-24 12:43:48 EST
I was under the impression that once withdrawn, gfs would no longer try to write to the device, and 
here it looks like it is trying to do that. Could be just missing something too.  gonna read more code...
Comment 7 michael conrad tadpol tilstra 2005-02-24 14:26:41 EST
A: gfs_tool withdraw /mnt
B: gfs_tool withdraw /mnt
B: umount /mnt
 ==> oops.

you dont need to unmount the first node, just withdraw it.
Comment 8 michael conrad tadpol tilstra 2005-02-24 16:03:13 EST
well. if you set teh ROFS bit along with the SHUTDOWN bit, it kinda looks like things work.  But I don't 
quite believe that it is really the answer.  more proding required.
Comment 9 michael conrad tadpol tilstra 2005-02-28 14:59:24 EST
comment #8 is definitly not the answer.  causes other oopses.


Also:
Three nodes, no io. (must have no io.)
A: reboot -fn
wait for others to replay A's journal.
B: gfs_tool withdraw /mnt
B: umount /mnt
  ==> oops.


So, Withdraw *after* a node tries to replay some other node's journal, followed
by umount causes oops.

What it looks like is happening is that the other nodes give up the transaction
lock so one node can replay a journal.  In doing so, they close out their
journals.  Now, *any* activity other than withdraw causes the trans lock to be
reheld and the journal to be reopened.  so all is good.  However a withdraw does
not seem to rehold the trans lock, but still tries to calculate log entries. 
Since it didn't reopen the journal, the calculations are off, and assert with
backtrace. 

No idea how to fix.  But think I've finnal found what causes the problem.
Comment 10 michael conrad tadpol tilstra 2005-03-01 12:06:42 EST
Think I've got a fix.

If a withdraw is called before we've a chance to relock the trans 
lock, the sd_log_head points to the wrong place, and a umount will
fail on asserts because of this.
Adding one puts sd_log_head at a value that passes the assert.  The
value may not be correct for on disk, but we've withdrawn so there is
no more disk io.
If we're not withdrawn, the next io will grab the trans lock, which
will fill sd_log_head with the correct value.

Comment 11 Derek Anderson 2005-03-08 13:03:43 EST
Verified in gfs_tool 6.1-0.pre19.

Note You need to log in before you can comment on or make changes to this bug.