Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 699082

Summary:

GFS2 partition hangs after successful fence

Product:

Red Hat Enterprise Linux 5

Reporter:

Madison Kelly <mkelly>

Component:

gfs2-utils

Assignee:

Robert Peterson <rpeterso>

Status:

CLOSED NOTABUG

QA Contact:

Cluster QE <mspqa-list>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

5.6

CC:

edamato

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-04-28 21:39:13 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
cluster.conf of the effected cluster.	none

Description Madison Kelly 2011-04-22 23:58:13 UTC

Created attachment 494347 [details]
cluster.conf of the effected cluster.

Description of problem:

In a two-node cluster, killing one node causes the GFS2 partition to block until the lost node rejoins the cluster, despite a successful fence call.

Version-Release number of selected component (if applicable):

cman-2.0.115-68.el5
rgmanager-2.0.52-9.el5
gfs2-utils-0.1.62-28.el5

How reproducible:

100%

Steps to Reproduce:
1. Setup a two-node cluster (example cluster.conf attached)
2. Hang or power-off one of the nodes (ie: 'echo c > /proc/sysrq-trigger' or pull the power)
3. Try to 'ls -lah' a mounted gfs2 partition.
  
Actual results:

GFS2 partitions block.

Expected results:

GFS2 partitions return to use once the fence succeeds.

Additional info:

Excerpt from /var/log/messages on the surviving node.

====
Apr 22 19:49:04 an-node01 fenced[5270]: fencing node "an-node02.alteeve.com"
Apr 22 19:49:17 an-node01 fenced[5270]: fence "an-node02.alteeve.com" success
Apr 22 19:49:17 an-node01 kernel: GFS2: fsid=an-cluster:xen_shared.1: jid=0: Trying to acquire journal lock...
Apr 22 19:49:18 an-node01 clurgmgrd[5632]: <notice> Marking service:an2_storage as stopped: Restricted domain unavailable 
Apr 22 19:49:19 an-node01 clurgmgrd[5632]: <notice> Taking over service vm:vm0001_c5_ws1 from down member an-node02.alteeve.com 
Apr 22 19:51:46 an-node01 kernel: INFO: task gfs2_recoverd:6624 blocked for more than 120 seconds.
Apr 22 19:51:46 an-node01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 22 19:51:46 an-node01 kernel: gfs2_recoverd D ffff8800b3501c90     0  6624     11          6627  6614 (L-TLB)
Apr 22 19:51:46 an-node01 kernel:  ffff8800b3501c30  0000000000000246  0000000000000000  ffff8800bc13c800 
Apr 22 19:51:46 an-node01 kernel:  000000000000000a  ffff8800b8b84040  ffff8800c2a960c0  00000000000092c9 
Apr 22 19:51:46 an-node01 kernel:  ffff8800b8b84228  0000000000000000 
Apr 22 19:51:46 an-node01 kernel: Call Trace:
Apr 22 19:51:46 an-node01 kernel:  [<ffffffff888fa7b8>] :dlm:dlm_put_lockspace+0x10/0x1f
Apr 22 19:51:46 an-node01 kernel:  [<ffffffff888f8e5f>] :dlm:dlm_lock+0x117/0x129
Apr 22 19:51:46 an-node01 kernel:  [<ffffffff8899e556>] :lock_dlm:gdlm_ast+0x0/0x311
Apr 22 19:51:46 an-node01 kernel:  [<ffffffff8899e2c1>] :lock_dlm:gdlm_bast+0x0/0x8d
Apr 22 19:51:46 an-node01 kernel:  [<ffffffff88922efc>] :gfs2:just_schedule+0x0/0xe
Apr 22 19:51:46 an-node01 kernel:  [<ffffffff88922f05>] :gfs2:just_schedule+0x9/0xe
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff80263805>] __wait_on_bit+0x40/0x6e
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8029de00>] kthread_bind+0x48/0x62
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff88922efc>] :gfs2:just_schedule+0x0/0xe
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8029de1a>] keventd_create_kthread+0x0/0xc4
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8026389f>] out_of_line_wait_on_bit+0x6c/0x78
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8029e060>] wake_bit_function+0x0/0x23
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8029de1a>] keventd_create_kthread+0x0/0xc4
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff88922ef7>] :gfs2:gfs2_glock_wait+0x2b/0x30
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff88935bac>] :gfs2:gfs2_recover_journal+0xd6/0x849
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff80262dcb>] thread_return+0x6c/0x113
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff88935ba4>] :gfs2:gfs2_recover_journal+0xce/0x849
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff889246cf>] :gfs2:gfs2_glock_nq_num+0x3b/0x68
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8025dfd6>] del_timer_sync+0xc/0x16
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff802636a2>] schedule_timeout+0x92/0xad
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8029de1a>] keventd_create_kthread+0x0/0xc4
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff88936348>] :gfs2:gfs2_recoverd+0x29/0x78
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8893631f>] :gfs2:gfs2_recoverd+0x0/0x78
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff80233dc4>] kthread+0xfe/0x132
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff80260b2c>] child_rip+0xa/0x12
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff8029de1a>] keventd_create_kthread+0x0/0xc4
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff80233cc6>] kthread+0x0/0x132
Apr 22 19:51:47 an-node01 kernel:  [<ffffffff80260b22>] child_rip+0x0/0x12
====

Comment 1 Robert Peterson 2011-04-25 13:16:29 UTC

Since this is gfs2_recoverd and it's in a glock wait,
this may be a duplicate of bug #553803.  Can we get a glock
dump when this occurs to check it?  Also, can we get the
kernel version?

Comment 2 Madison Kelly 2011-04-25 13:51:19 UTC

Sure, I can do that. I need to know how to get a glock dump though. Is this an option in cluster.conf or elsewhere?

Comment 3 Robert Peterson 2011-04-25 13:58:59 UTC

To collect a glock dump:

(1) Make sure debugfs is mounted:
mount -t debugfs none /sys/kernel/debug
or add this line to /etc/fstab and mount -a:
debugfs       /sys/kernel/debug      debugfs  defaults        0 0

(2) Save off GFS2's glocks files from debugfs:
cat /sys/kernel/debug/gfs2/<file system ID>/glocks > ~/glocks.digimer.out

Comment 4 Madison Kelly 2011-04-25 14:09:36 UTC

I'll do this tonight and post the glock dump and kernel version. Thanks for the detailed response.

Comment 5 Madison Kelly 2011-04-28 21:20:26 UTC

Sorry for the delay in getting back to you on this. I've been messing with the crash and I think it's outside of GFS2. I'm going to close this. If it turns out to actually be GFS2 related, I'll re-open it with the debug info.

Thanks!

Comment 6 Madison Kelly 2011-04-28 21:34:35 UTC

I don't seem to be able to close this. Could someone with access close this as NOTABUG? Thanks. :)