Hide Forgot
Created attachment 494347 [details] cluster.conf of the effected cluster. Description of problem: In a two-node cluster, killing one node causes the GFS2 partition to block until the lost node rejoins the cluster, despite a successful fence call. Version-Release number of selected component (if applicable): cman-2.0.115-68.el5 rgmanager-2.0.52-9.el5 gfs2-utils-0.1.62-28.el5 How reproducible: 100% Steps to Reproduce: 1. Setup a two-node cluster (example cluster.conf attached) 2. Hang or power-off one of the nodes (ie: 'echo c > /proc/sysrq-trigger' or pull the power) 3. Try to 'ls -lah' a mounted gfs2 partition. Actual results: GFS2 partitions block. Expected results: GFS2 partitions return to use once the fence succeeds. Additional info: Excerpt from /var/log/messages on the surviving node. ==== Apr 22 19:49:04 an-node01 fenced[5270]: fencing node "an-node02.alteeve.com" Apr 22 19:49:17 an-node01 fenced[5270]: fence "an-node02.alteeve.com" success Apr 22 19:49:17 an-node01 kernel: GFS2: fsid=an-cluster:xen_shared.1: jid=0: Trying to acquire journal lock... Apr 22 19:49:18 an-node01 clurgmgrd[5632]: <notice> Marking service:an2_storage as stopped: Restricted domain unavailable Apr 22 19:49:19 an-node01 clurgmgrd[5632]: <notice> Taking over service vm:vm0001_c5_ws1 from down member an-node02.alteeve.com Apr 22 19:51:46 an-node01 kernel: INFO: task gfs2_recoverd:6624 blocked for more than 120 seconds. Apr 22 19:51:46 an-node01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 22 19:51:46 an-node01 kernel: gfs2_recoverd D ffff8800b3501c90 0 6624 11 6627 6614 (L-TLB) Apr 22 19:51:46 an-node01 kernel: ffff8800b3501c30 0000000000000246 0000000000000000 ffff8800bc13c800 Apr 22 19:51:46 an-node01 kernel: 000000000000000a ffff8800b8b84040 ffff8800c2a960c0 00000000000092c9 Apr 22 19:51:46 an-node01 kernel: ffff8800b8b84228 0000000000000000 Apr 22 19:51:46 an-node01 kernel: Call Trace: Apr 22 19:51:46 an-node01 kernel: [<ffffffff888fa7b8>] :dlm:dlm_put_lockspace+0x10/0x1f Apr 22 19:51:46 an-node01 kernel: [<ffffffff888f8e5f>] :dlm:dlm_lock+0x117/0x129 Apr 22 19:51:46 an-node01 kernel: [<ffffffff8899e556>] :lock_dlm:gdlm_ast+0x0/0x311 Apr 22 19:51:46 an-node01 kernel: [<ffffffff8899e2c1>] :lock_dlm:gdlm_bast+0x0/0x8d Apr 22 19:51:46 an-node01 kernel: [<ffffffff88922efc>] :gfs2:just_schedule+0x0/0xe Apr 22 19:51:46 an-node01 kernel: [<ffffffff88922f05>] :gfs2:just_schedule+0x9/0xe Apr 22 19:51:47 an-node01 kernel: [<ffffffff80263805>] __wait_on_bit+0x40/0x6e Apr 22 19:51:47 an-node01 kernel: [<ffffffff8029de00>] kthread_bind+0x48/0x62 Apr 22 19:51:47 an-node01 kernel: [<ffffffff88922efc>] :gfs2:just_schedule+0x0/0xe Apr 22 19:51:47 an-node01 kernel: [<ffffffff8029de1a>] keventd_create_kthread+0x0/0xc4 Apr 22 19:51:47 an-node01 kernel: [<ffffffff8026389f>] out_of_line_wait_on_bit+0x6c/0x78 Apr 22 19:51:47 an-node01 kernel: [<ffffffff8029e060>] wake_bit_function+0x0/0x23 Apr 22 19:51:47 an-node01 kernel: [<ffffffff8029de1a>] keventd_create_kthread+0x0/0xc4 Apr 22 19:51:47 an-node01 kernel: [<ffffffff88922ef7>] :gfs2:gfs2_glock_wait+0x2b/0x30 Apr 22 19:51:47 an-node01 kernel: [<ffffffff88935bac>] :gfs2:gfs2_recover_journal+0xd6/0x849 Apr 22 19:51:47 an-node01 kernel: [<ffffffff80262dcb>] thread_return+0x6c/0x113 Apr 22 19:51:47 an-node01 kernel: [<ffffffff88935ba4>] :gfs2:gfs2_recover_journal+0xce/0x849 Apr 22 19:51:47 an-node01 kernel: [<ffffffff889246cf>] :gfs2:gfs2_glock_nq_num+0x3b/0x68 Apr 22 19:51:47 an-node01 kernel: [<ffffffff8025dfd6>] del_timer_sync+0xc/0x16 Apr 22 19:51:47 an-node01 kernel: [<ffffffff802636a2>] schedule_timeout+0x92/0xad Apr 22 19:51:47 an-node01 kernel: [<ffffffff8029de1a>] keventd_create_kthread+0x0/0xc4 Apr 22 19:51:47 an-node01 kernel: [<ffffffff88936348>] :gfs2:gfs2_recoverd+0x29/0x78 Apr 22 19:51:47 an-node01 kernel: [<ffffffff8893631f>] :gfs2:gfs2_recoverd+0x0/0x78 Apr 22 19:51:47 an-node01 kernel: [<ffffffff80233dc4>] kthread+0xfe/0x132 Apr 22 19:51:47 an-node01 kernel: [<ffffffff80260b2c>] child_rip+0xa/0x12 Apr 22 19:51:47 an-node01 kernel: [<ffffffff8029de1a>] keventd_create_kthread+0x0/0xc4 Apr 22 19:51:47 an-node01 kernel: [<ffffffff80233cc6>] kthread+0x0/0x132 Apr 22 19:51:47 an-node01 kernel: [<ffffffff80260b22>] child_rip+0x0/0x12 ====
Since this is gfs2_recoverd and it's in a glock wait, this may be a duplicate of bug #553803. Can we get a glock dump when this occurs to check it? Also, can we get the kernel version?
Sure, I can do that. I need to know how to get a glock dump though. Is this an option in cluster.conf or elsewhere?
To collect a glock dump: (1) Make sure debugfs is mounted: mount -t debugfs none /sys/kernel/debug or add this line to /etc/fstab and mount -a: debugfs /sys/kernel/debug debugfs defaults 0 0 (2) Save off GFS2's glocks files from debugfs: cat /sys/kernel/debug/gfs2/<file system ID>/glocks > ~/glocks.digimer.out
I'll do this tonight and post the glock dump and kernel version. Thanks for the detailed response.
Sorry for the delay in getting back to you on this. I've been messing with the crash and I think it's outside of GFS2. I'm going to close this. If it turns out to actually be GFS2 related, I'll re-open it with the debug info. Thanks!
I don't seem to be able to close this. Could someone with access close this as NOTABUG? Thanks. :)