Bug 699082 - GFS2 partition hangs after successful fence
Summary: GFS2 partition hangs after successful fence
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: gfs2-utils
Version: 5.6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Robert Peterson
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-04-22 23:58 UTC by Madison Kelly
Modified: 2011-04-28 21:39 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-04-28 21:39:13 UTC
Target Upstream Version:


Attachments (Terms of Use)
cluster.conf of the effected cluster. (2.63 KB, application/octet-stream)
2011-04-22 23:58 UTC, Madison Kelly
no flags Details

Description Madison Kelly 2011-04-22 23:58:13 UTC
Created attachment 494347 [details]
cluster.conf of the effected cluster.

Description of problem:

In a two-node cluster, killing one node causes the GFS2 partition to block until the lost node rejoins the cluster, despite a successful fence call.

Version-Release number of selected component (if applicable):

cman-2.0.115-68.el5
rgmanager-2.0.52-9.el5
gfs2-utils-0.1.62-28.el5

How reproducible:

100%

Steps to Reproduce:
1. Setup a two-node cluster (example cluster.conf attached)
2. Hang or power-off one of the nodes (ie: 'echo c > /proc/sysrq-trigger' or pull the power)
3. Try to 'ls -lah' a mounted gfs2 partition.
  
Actual results:

GFS2 partitions block.

Expected results:

GFS2 partitions return to use once the fence succeeds.

Additional info:

Excerpt from /var/log/messages on the surviving node.

====
Apr 22 19:49:04 an-node01 fenced[5270]: fencing node "an-node02.alteeve.com"
Apr 22 19:49:17 an-node01 fenced[5270]: fence "an-node02.alteeve.com" success
Apr 22 19:49:17 an-node01 kernel: GFS2: fsid=an-cluster:xen_shared.1: jid=0: Trying to acquire journal lock...
Apr 22 19:49:18 an-node01 clurgmgrd[5632]: <notice> Marking service:an2_storage as stopped: Restricted domain unavailable 
Apr 22 19:49:19 an-node01 clurgmgrd[5632]: <notice> Taking over service vm:vm0001_c5_ws1 from down member an-node02.alteeve.com 
Apr 22 19:51:46 an-node01 kernel: INFO: task gfs2_recoverd:6624 blocked for more than 120 seconds.
Apr 22 19:51:46 an-node01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 22 19:51:46 an-node01 kernel: gfs2_recoverd D ffff8800b3501c90     0  6624     11          6627  6614 (L-TLB)
Apr 22 19:51:46 an-node01 kernel:  ffff8800b3501c30  0000000000000246  0000000000000000  ffff8800bc13c800 
Apr 22 19:51:46 an-node01 kernel:  000000000000000a  ffff8800b8b84040  ffff8800c2a960c0  00000000000092c9 
Apr 22 19:51:46 an-node01 kernel:  ffff8800b8b84228  0000000000000000 
Apr 22 19:51:46 an-node01 kernel: Call Trace:
Apr 22 19:51:46 an-node01 kernel:  [<ffffffff888fa7b8>] :dlm:dlm_put_lockspace+0x10/0x1f
Apr 22 19:51:46 an-node01 kernel:  [<ffffffff888f8e5f>] :dlm:dlm_lock+0x117/0x129
Apr 22 19:51:46 an-node01 kernel:  [<ffffffff8899e556>] :lock_dlm:gdlm_ast+0x0/0x311
Apr 22 19:51:46 an-node01 kernel:  [<ffffffff8899e2c1>] :lock_dlm:gdlm_bast+0x0/0x8d
Apr 22 19:51:46 an-node01 kernel:  [<ffffffff88922efc>] :gfs2:just_schedule+0x0/0xe
Apr 22 19:51:46 an-node01 kernel:  [<ffffffff88922f05>] :gfs2:just_schedule+0x9/0xe
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff80263805>] __wait_on_bit+0x40/0x6e
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8029de00>] kthread_bind+0x48/0x62
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff88922efc>] :gfs2:just_schedule+0x0/0xe
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8029de1a>] keventd_create_kthread+0x0/0xc4
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8026389f>] out_of_line_wait_on_bit+0x6c/0x78
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8029e060>] wake_bit_function+0x0/0x23
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8029de1a>] keventd_create_kthread+0x0/0xc4
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff88922ef7>] :gfs2:gfs2_glock_wait+0x2b/0x30
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff88935bac>] :gfs2:gfs2_recover_journal+0xd6/0x849
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff80262dcb>] thread_return+0x6c/0x113
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff88935ba4>] :gfs2:gfs2_recover_journal+0xce/0x849
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff889246cf>] :gfs2:gfs2_glock_nq_num+0x3b/0x68
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8025dfd6>] del_timer_sync+0xc/0x16
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff802636a2>] schedule_timeout+0x92/0xad
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8029de1a>] keventd_create_kthread+0x0/0xc4
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff88936348>] :gfs2:gfs2_recoverd+0x29/0x78
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8893631f>] :gfs2:gfs2_recoverd+0x0/0x78
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff80233dc4>] kthread+0xfe/0x132
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff80260b2c>] child_rip+0xa/0x12
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8029de1a>] keventd_create_kthread+0x0/0xc4
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff80233cc6>] kthread+0x0/0x132
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff80260b22>] child_rip+0x0/0x12
====

Comment 1 Robert Peterson 2011-04-25 13:16:29 UTC
Since this is gfs2_recoverd and it's in a glock wait,
this may be a duplicate of bug #553803.  Can we get a glock
dump when this occurs to check it?  Also, can we get the
kernel version?

Comment 2 Madison Kelly 2011-04-25 13:51:19 UTC
Sure, I can do that. I need to know how to get a glock dump though. Is this an option in cluster.conf or elsewhere?

Comment 3 Robert Peterson 2011-04-25 13:58:59 UTC
To collect a glock dump:

(1) Make sure debugfs is mounted:
mount -t debugfs none /sys/kernel/debug
or add this line to /etc/fstab and mount -a:
debugfs       /sys/kernel/debug      debugfs  defaults        0 0

(2) Save off GFS2's glocks files from debugfs:
cat /sys/kernel/debug/gfs2/<file system ID>/glocks > ~/glocks.digimer.out

Comment 4 Madison Kelly 2011-04-25 14:09:36 UTC
I'll do this tonight and post the glock dump and kernel version. Thanks for the detailed response.

Comment 5 Madison Kelly 2011-04-28 21:20:26 UTC
Sorry for the delay in getting back to you on this. I've been messing with the crash and I think it's outside of GFS2. I'm going to close this. If it turns out to actually be GFS2 related, I'll re-open it with the debug info.

Thanks!

Comment 6 Madison Kelly 2011-04-28 21:34:35 UTC
I don't seem to be able to close this. Could someone with access close this as NOTABUG? Thanks. :)


Note You need to log in before you can comment on or make changes to this bug.