Bug 146711 - gfs_tool withdraw: 2nd node withdrawn and umounted panics.
Summary: gfs_tool withdraw: 2nd node withdrawn and umounted panics.
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs   
(Show other bugs)
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: michael conrad tadpol tilstra
QA Contact: GFS Bugs
URL:
Whiteboard:
Keywords:
Depends On:
Blocks: 144795
TreeView+ depends on / blocked
 
Reported: 2005-01-31 22:18 UTC by Derek Anderson
Modified: 2010-01-12 03:02 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-03-08 18:03:43 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

Description Derek Anderson 2005-01-31 22:18:19 UTC
Description of problem:
I have a 3 node cluster running ccsd/cman/dlm/clvmd/gfs.  Mount a
single GFS on each of the three nodes.  Traffic does not need to be
running.  On nodeC run 'gfs_tool withdraw /mnt/gfs1'.  Then 'umount
/mnt/gfs1'.  The node is happy and the GFS can be remounted.

On nodeB run 'gfs_tool withdraw /mnt/gfs1' then 'mount' then 'umount
/mnt/gfs1' and get the following panic:

GFS: fsid=MILTON:data1.1: withdrawing from cluster at user's request
GFS: fsid=MILTON:data1.1: about to withdraw from the cluster
GFS: fsid=MILTON:data1.1: waiting for outstanding I/O
GFS: fsid=MILTON:data1.1: telling LM to withdraw
Jan 31 15:53:02 link-12 kernel: GFS: fsid=MILTON:data1.1: withdrawing
from cluster at user's request
Jan 31 15:53:02 link-12 kernel: GFS: fsid=MILTON:data1.1: about to
withdraw from the cluster
Jan 31 15:53:02 link-12 kernel: GFS: fsid=MILTON:data1.1: waiting for
outstanding I/O
Jan 31 15:53:02 link-12 kernel: GFS: fsid=MILTON:data1.1: telling LM
to withdraw
lock_dlm: withdraw abandoned memory
GFS: fsid=MILTON:data1.1: withdrawn
Jan 31 15:53:03 link-12 kernel: lock_dlm: withdraw abandoned memory
Jan 31 15:53:03 link-12 kernel: GFS: fsid=MILTON:data1.1: withdrawn
 [<e0483a32>] gfs_assert_i+0x48/0x69 [gfs]
 [<e04686b7>] disk_commit+0xdd/0x260 [gfs]
 [<e04695a2>] gfs_log_dump+0x3c5/0x550 [gfs]
 [<e048132f>] gfs_make_fs_ro+0x44/0x80 [gfs]
 [<e04772e3>] gfs_put_super+0x263/0x331 [gfs]
 [<c0168af1>] generic_shutdown_super+0x119/0x2eb
 [<e0474d6e>] gfs_kill_sb+0x1f/0x43 [gfs]
 [<c01687c5>] deactivate_super+0xc5/0xda
 [<c01833f0>] sys_umount+0x65/0x6c
 [<c01549dc>] unmap_vma_list+0xe/0x17
 [<c0154d8b>] do_munmap+0x1c8/0x1d2
 [<c0183402>] sys_oldumount+0xb/0xe
 [<c0301bfb>] syscall_call+0x7/0xb
Kernel panic - not syncing: GFS: fsid=MILTON:data1.1: assertion
"gfs_log_is_header(sdp, tr->tr_first_head)" failed
GFS: fsid=MILTON:data1.1:   function = disk_commit
GFS: fsid=MILTON:data1.1:   file =
/usr/src/build/512195-i686/BUILD/gfs-kernel-2.6.9-14/src/gfs/log.c,
line = 734
GFS: fsid=MILTON:data1.1:   time = 1107204803

Version-Release number of selected component (if applicable):
RPMs from cluster-i686-2005-01-28-1544.tar

How reproducible:
Yes.


Steps to Reproduce:
1.
2.
3.
  
Actual results:
2nd node panics.

Expected results:
gfs_tool withdraw; umount /gfs/mntpoint doesn't kill the node.

Additional info:

Comment 1 Kiersten (Kerri) Anderson 2005-02-01 16:09:21 UTC
Adding to blocker list for release

Comment 2 Ken Preslan 2005-02-10 00:05:24 UTC
This should be fixed now.


Comment 3 Derek Anderson 2005-02-10 17:48:10 UTC
Ran the same test on the build from 2005-02-10-1033 and got the same results 
as originally. 

Comment 4 michael conrad tadpol tilstra 2005-02-23 15:44:00 UTC
as a side note, with lock_gulm, you cannot remount a fs without first rmmod
lock_gulm.  (ie, mount, withdraw, umount, mount will cause problems with
lock_gulm.  Where as mount, withdraw, umount, rmmod, mount should work.)

know this bug is with dlm, but sooner or later someone will try it wil gulm.

Comment 5 michael conrad tadpol tilstra 2005-02-23 21:51:51 UTC
comment #4 was made before I realized that this bug was about multiple nodes.


definitly a gfs-ism. you get the same backtrace with dlm|gulm.


Comment 6 michael conrad tadpol tilstra 2005-02-24 17:43:48 UTC
I was under the impression that once withdrawn, gfs would no longer try to write to the device, and 
here it looks like it is trying to do that. Could be just missing something too.  gonna read more code...

Comment 7 michael conrad tadpol tilstra 2005-02-24 19:26:41 UTC
A: gfs_tool withdraw /mnt
B: gfs_tool withdraw /mnt
B: umount /mnt
 ==> oops.

you dont need to unmount the first node, just withdraw it.

Comment 8 michael conrad tadpol tilstra 2005-02-24 21:03:13 UTC
well. if you set teh ROFS bit along with the SHUTDOWN bit, it kinda looks like things work.  But I don't 
quite believe that it is really the answer.  more proding required.

Comment 9 michael conrad tadpol tilstra 2005-02-28 19:59:24 UTC
comment #8 is definitly not the answer.  causes other oopses.


Also:
Three nodes, no io. (must have no io.)
A: reboot -fn
wait for others to replay A's journal.
B: gfs_tool withdraw /mnt
B: umount /mnt
  ==> oops.


So, Withdraw *after* a node tries to replay some other node's journal, followed
by umount causes oops.

What it looks like is happening is that the other nodes give up the transaction
lock so one node can replay a journal.  In doing so, they close out their
journals.  Now, *any* activity other than withdraw causes the trans lock to be
reheld and the journal to be reopened.  so all is good.  However a withdraw does
not seem to rehold the trans lock, but still tries to calculate log entries. 
Since it didn't reopen the journal, the calculations are off, and assert with
backtrace. 

No idea how to fix.  But think I've finnal found what causes the problem.

Comment 10 michael conrad tadpol tilstra 2005-03-01 17:06:42 UTC
Think I've got a fix.

If a withdraw is called before we've a chance to relock the trans 
lock, the sd_log_head points to the wrong place, and a umount will
fail on asserts because of this.
Adding one puts sd_log_head at a value that passes the assert.  The
value may not be correct for on disk, but we've withdrawn so there is
no more disk io.
If we're not withdrawn, the next io will grab the trans lock, which
will fill sd_log_head with the correct value.



Comment 11 Derek Anderson 2005-03-08 18:03:43 UTC
Verified in gfs_tool 6.1-0.pre19.


Note You need to log in before you can comment on or make changes to this bug.