Bug 222335
| Summary: | stuck transaction lock causes journal recovery to hang in gfs | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||||||||||||
| Component: | gfs-kmod | Assignee: | Robert Peterson <rpeterso> | ||||||||||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||||
| Priority: | medium | ||||||||||||||||
| Version: | 5.0 | CC: | ccaulfie, iannis, swhiteho, teigland | ||||||||||||||
| Target Milestone: | --- | ||||||||||||||||
| Target Release: | --- | ||||||||||||||||
| Hardware: | All | ||||||||||||||||
| OS: | Linux | ||||||||||||||||
| Whiteboard: | |||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||
| Last Closed: | 2011-01-24 13:16:08 UTC | Type: | --- | ||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
| Embargoed: | |||||||||||||||||
| Attachments: |
|
||||||||||||||||
|
Description
Corey Marthaler
2007-01-11 18:53:39 UTC
There was a bug (now fixed) in the group_tool -v display code that caused the FAIL_START_WAIT state to be reported for most of these groups when in fact the state is "none" for all but group 2:gfs1. 2:gfs1 is still in FAIL_START_WAIT nodes 3 and 4 have started 2:gfs1 (have 1 under local_done) node 1 has _not_ started 2:gfs1 (has 0 under local_done) node 1 has not started the group because gfs-kernel is blocked trying to acquire the transaction lock (1,2) in EX mode. The last gfs message related to fs gfs1 on node1: GFS: fsid=TAFT_CLUSTER:gfs1.2: jid=0: Acquiring the transaction lock... taft-01: Resource ffff810214a495c0 Name (len=24) " 1 2" Master Copy Granted Queue 0001012a PR Remote: 4 00010232 Conversion Queue 00010357 PR (EX) Waiting Queue taft-04: Resource ffff8102062d69c0 Name (len=24) " 1 2" Local Copy, Master is node 1 Granted Queue 00010232 PR Master: 0001012a Conversion Queue Waiting Queue Unknown is why taft-04 is not releasing its PR lock on 1,2. From studying this bug on other occasions, it doesn't appear that taft-04 is using 1,2 for anything and wouldn't be able to release it. Created attachment 145380 [details]
lock_dlm patch for debugging
attached a patch the prints info about the transaction lock in lock_dlm.
goal is to figure out why it's not being released when we expect it to.
This is the patch I sent to Bob to try on Dec 8 when we hit this bug
on the marathon cluster. But it didn't happen again.
This looks similar to bug 218511, could be related. Hit this again today. 2.6.18-8.el5 cman-2.0.60-1.el5 It would be good to get the smoke cluster set up to try to reproduce this one with the debugging patch above applied. Fixing product name. Cluster Suite components were integrated into Enterprise Linux for verion 5.0. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. There's a possibility that this was a case of conversion deadlock resolution gone wrong which was fixed in bug 234086. If that's the case there won't be a clear resolution of this bug, it will just have to expire over time. But, if this problem still exists, we'll need to reproduce it with the debugging patch above. Can't currently reproduce, moving off the release criteria. I was looking for a gfs bug before, but I think this is a dlm bug. On taft-01 dlm recovery isn't finished either, so that would be what's blocking the gfs recovery. We need to enable dlm debugging to figure out how where dlm recovery is at. comment 10 wasn't taking into account comment 1 where I discovered that the group_tool output is incorrect, so the diagnosis in comment 1 still stands as correct I believe. Corey ran into another instance of this same bug today, I'll post the relevant bits of the log tomorrow, but I believe we end up no further than we were before. (Why did I get caught by the group_tool misrepresentation again? I thought I'd fixed that, or is the fixed version still not on the taft cluster?) In yesterday's instance of this bug we have the same situation as before: taft-01 Resource ffff810211377780 Name (len=24) " 1 2" Master Copy Granted Queue 00cd0001 PR Remote: 3 01610001 Conversion Queue 00160001 PR (EX) Waiting Queue taft-03 Resource ffff81020beeecc0 Name (len=24) " 1 2" Local Copy, Master is node 1 Granted Queue 01610001 PR Master: 00cd0001 Conversion Queue Waiting Queue And gfs's view of 1,2 on taft-03 is: Glock (1, 2) gl_flags = 2 gl_count = 2 gl_state = 3 req_gh = no req_bh = no lvb_count = 0 object = no new_le = no incore_le = no reclaim = no aspace = no ail_bufs = no A couple of possible reasons why taft-03 isn't releasing its lock: - the blocking callback is in-progress in gfs for the trans lock, but stuck somewhere prior to the lock being released - the blocking callback has been delivered to lock_dlm, but isn't being picked up by a lock_dlm thread (are they both busy?) for delivery back to gfs A sysrq-t the next time this happens might give us some more clues. Appear to have hit this bug, attaching the sysrq-t dumps. Created attachment 174441 [details]
log from taft-01
Created attachment 174461 [details]
log from taft-02
Created attachment 174501 [details]
log from taft-03
Created attachment 174521 [details]
log from taft-04
[root@taft-01 tmp]# group_tool -v type level name id state node id local_done fence 0 default 00010004 none [2 3 4 5] dlm 1 clvmd 00020004 none [2 3 4 5] dlm 1 gfs1 00020002 none [2 4 5] dlm 1 gfs2 00040002 none [2 4 5] gfs 2 gfs1 00010002 none [2 4 5] gfs 2 gfs2 00030002 FAIL_START_WAIT 3 300030003 1 [2 4 5] [root@taft-02 tmp]# group_tool -v type level name id state node id local_done fence 0 default 00010004 none [2 3 4 5] dlm 1 clvmd 00020004 none [2 3 4 5] dlm 1 gfs1 00020002 none [2 4 5] dlm 1 gfs2 00040002 none [2 4 5] gfs 2 gfs1 00010002 none [2 4 5] gfs 2 gfs2 00030002 FAIL_START_WAIT 3 300030003 0 [2 4 5] [root@taft-03 tmp]# group_tool -v type level name id state node id local_done fence 0 default 00010004 none [2 3 4 5] dlm 1 clvmd 00020004 none [2 3 4 5] [root@taft-04 tmp]# group_tool -v type level name id state node id local_done fence 0 default 00010004 none [2 3 4 5] dlm 1 clvmd 00020004 none [2 3 4 5] dlm 1 gfs1 00020002 none [2 4 5] dlm 1 gfs2 00040002 none [2 4 5] gfs 2 gfs1 00010002 none [2 4 5] gfs 2 gfs2 00030002 FAIL_START_WAIT 3 300030003 1 [2 4 5] Will close in August after a year of inactivity. Corey reproduced this on grant-01,02,03.
grant-01
type level name id state node id local_done
fence 0 default 00010001 none
[1 2 3]
dlm 1 clvmd 00070001 none
[1 3]
dlm 1 A 00090001 none
[1 3]
dlm 1 B 00020002 none
[1 3]
gfs 2 A 00080001 FAIL_START_WAIT 2 200020003 0
[1 3]
gfs 2 B 00010002 none
[1 3]
grant-03
type level name id state node id local_done
fence 0 default 00010001 none
[1 2 3]
dlm 1 clvmd 00070001 none
[1 3]
dlm 1 A 00090001 none
[1 3]
dlm 1 B 00020002 none
[1 3]
gfs 2 A 00080001 FAIL_START_WAIT 2 200020003 1
[1 3]
gfs 2 B 00010002 none
[1 3]
So gfs recovery is stuck for fs A on grant-01 (local_done is 0). This is
confirmed by GFS recovery messages:
grant-01
GFS: fsid=GRANT:A.0: jid=1: Trying to acquire journal lock...
GFS: fsid=GRANT:B.0: jid=2: Trying to acquire journal lock...
GFS: fsid=GRANT:A.0: jid=1: Looking at journal...
GFS: fsid=GRANT:B.0: jid=2: Looking at journal...
GFS: fsid=GRANT:A.0: jid=1: Acquiring the transaction lock...
GFS: fsid=GRANT:B.0: jid=2: Acquiring the transaction lock...
GFS: fsid=GRANT:B.0: jid=2: Replaying journal...
GFS: fsid=GRANT:B.0: jid=2: Replayed 1 of 3 blocks
GFS: fsid=GRANT:B.0: jid=2: replays = 1, skips = 1, sames = 1
GFS: fsid=GRANT:B.0: jid=2: Journal replayed in 1s
GFS: fsid=GRANT:B.0: jid=2: Done
grant-03
GFS: fsid=GRANT:A.2: jid=1: Trying to acquire journal lock...
GFS: fsid=GRANT:B.1: jid=2: Trying to acquire journal lock...
GFS: fsid=GRANT:B.1: jid=2: Busy
GFS: fsid=GRANT:A.2: jid=1: Busy
The last message for fs A on grant-01 is "Trying to acquire journal lock".
And confirmed by ps ax -o pid,stat,cmd,wchan which shows gfs_recoverd on
grant-01:
11325 D< [gfs_recoverd] glock_wait_internal
We know that the first lock gfs_controld tries to acquire is the transaction
lock. Looking in the dlm and gfs lock dumps confirms that gfs_recoverd for
fs A on grant-01 is waiting for the transaction lock (1,2):
grant-01
Resource ffff8101154fd680 Name (len=24) " 1 2"
Master Copy
Granted Queue
00d30001 PR Remote: 3 00ce0001
Conversion Queue
032c0001 PR (EX)
Waiting Queue
grant-03
Resource ffff81021c530a80 Name (len=24) " 1 2"
Local Copy, Master is node 1
Granted Queue
00ce0001 PR Master: 00d30001
Conversion Queue
Waiting Queue
grant-01
Glock (1, 2)
gl_flags = 1 2
gl_count = 4
gl_state = 3
req_gh = yes
req_bh = yes
lvb_count = 0
object = no
new_le = no
incore_le = no
reclaim = no
aspace = no
ail_bufs = no
Request
owner = 11325
gh_state = 1
gh_flags = 2 4 5 10 12
error = 0
gh_iflags = 1
Waiter3
owner = 11325
gh_state = 1
gh_flags = 2 4 5 10 12
error = 0
gh_iflags = 1
grant-03
Glock (1, 2)
gl_flags = 1 2
gl_count = 4
gl_state = 3
req_gh = yes
req_bh = no
lvb_count = 0
object = no
new_le = no
incore_le = no
reclaim = no
aspace = no
ail_bufs = no
Request
owner = -1
gh_state = 0
gh_flags = 0
error = 0
gh_iflags = 2 4 5
Waiter2
owner = -1
gh_state = 0
gh_flags = 0
error = 0
gh_iflags = 2 4 5
So the reason that gfs_recoverd for fs A on grant-01 can't acquire the
transaction lock in EX, is because gfs on grant-03 is still holding
it in PR.
What we didn't know before is whether or not grant-03 had received a
blocking callback for it, and if so why it wasn't releasing it (the
transaction lock). Now, we have the following backtraces which show
that grant-03 has gotten the callback and is trying to release it.
gfs's lock_dlm1 thread handles blocking callbacks from the dlm, and
it's in the process of dropping the transaction lock, but for some
reason is stuck trying to flush the log, which is in completely
unfamiliar terrority for me.
grant-03
lock_dlm1 D ffff810123849980 0 11291 87 11292 11219
(L-TLB)
ffff810110c9bd70 0000000000000046 ffff810110c9be2c 0000000280075ab9
ffff810110c9be20 000000000000000a ffff81021d0d0860 ffff8101239cc0c0
0000008181757811 00000000000016d8 ffff81021d0d0a48 0000000300000000
Call Trace:
[<ffffffff800646ca>] __down_write_nested+0x7a/0x92
[<ffffffff886197c5>] :gfs:log_flush_internal+0x16/0x26f
[<ffffffff88602793>] :gfs:gfs_sync_meta+0x9/0x41
[<ffffffff8860eac9>] :gfs:trans_go_drop_th+0x22/0x35
[<ffffffff8860c480>] :gfs:run_queue+0x12e/0x35d
[<ffffffff8860db32>] :gfs:blocking_cb+0x44/0x91
[<ffffffff8861824c>] :gfs:gdlm_thread+0x609/0x668
[<ffffffff8009ea08>] autoremove_wake_function+0x0/0x2e
[<ffffffff8009e7f0>] keventd_create_kthread+0x0/0xc4
[<ffffffff886182b2>] :gfs:gdlm_thread1+0x0/0xa
[<ffffffff8009e7f0>] keventd_create_kthread+0x0/0xc4
[<ffffffff800324f3>] kthread+0xfe/0x132
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff8009e7f0>] keventd_create_kthread+0x0/0xc4
[<ffffffff800323f5>] kthread+0x0/0x132
[<ffffffff8005dfa7>] child_rip+0x0/0x11
gfs_logd D ffff810123841400 0 11310 87 11311 11309
(L-TLB)
ffff810110c75de0 0000000000000046 ffff81011da78b80 ffff81011da78b80
0000000010000042 000000000000000a ffff81021ffa67e0 ffff810123925100
0000005ea977e062 00000000000004c9 ffff81021ffa69c8 000000010000000a
Call Trace:
[<ffffffff800646ca>] __down_write_nested+0x7a/0x92
[<ffffffff88619029>] :gfs:gfs_ail_empty+0x27/0x11a
[<ffffffff8009e7f0>] keventd_create_kthread+0x0/0xc4
[<ffffffff8860239d>] :gfs:gfs_logd+0x18/0xc2
[<ffffffff88602385>] :gfs:gfs_logd+0x0/0xc2
[<ffffffff800324f3>] kthread+0xfe/0x132
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff8009e7f0>] keventd_create_kthread+0x0/0xc4
[<ffffffff800323f5>] kthread+0x0/0x132
[<ffffffff8005dfa7>] child_rip+0x0/0x11
gfs_quotad D ffff810123841400 0 11311 87 11312 11310
(L-TLB)
ffff8101111ffe30 0000000000000046 ffff81011ced8c00 ffff81011ced8c00
0000000010000042 000000000000000a ffff81021d2c2040 ffff810123925100
0000006c5f559016 0000000000000a08 ffff81021d2c2228 000000010000000a
Call Trace:
[<ffffffff800646ca>] __down_write_nested+0x7a/0x92
[<ffffffff88626630>] :gfs:gfs_quota_sync+0xce/0x1c4
[<ffffffff8009e7f0>] keventd_create_kthread+0x0/0xc4
[<ffffffff88602534>] :gfs:gfs_quotad+0xed/0x16d
[<ffffffff88602447>] :gfs:gfs_quotad+0x0/0x16d
[<ffffffff800324f3>] kthread+0xfe/0x132
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff8009e7f0>] keventd_create_kthread+0x0/0xc4
[<ffffffff800323f5>] kthread+0x0/0x132
[<ffffffff8005dfa7>] child_rip+0x0/0x11
gfs_inoded D ffff81021d2c2040 0 11312 87 11327 11311
(L-TLB)
ffff810110c17e40 0000000000000046 ffff81021d5e7040 ffff81021d5e7078
ffff81011ced8c00 000000000000000a ffff81021d5e7040 ffff81021d2c2040
00000061ea509c8a 00000000000008f4 ffff81021d5e7228 0000000136a54457
Call Trace:
[<ffffffff8009e7f0>] keventd_create_kthread+0x0/0xc4
[<ffffffff800646ca>] __down_write_nested+0x7a/0x92
[<ffffffff8862ce87>] :gfs:unlinked_find+0x54/0xb3
[<ffffffff886021e8>] :gfs:gfs_inoded+0x0/0x44
[<ffffffff8862ceff>] :gfs:gfs_unlinked_dealloc+0x19/0xb3
[<ffffffff886021e8>] :gfs:gfs_inoded+0x0/0x44
[<ffffffff886021f6>] :gfs:gfs_inoded+0xe/0x44
[<ffffffff800324f3>] kthread+0xfe/0x132
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff8009e7f0>] keventd_create_kthread+0x0/0xc4
[<ffffffff800323f5>] kthread+0x0/0x132
[<ffffffff8005dfa7>] child_rip+0x0/0x11
d_doio D ffff81000100c980 0 11362 1 11337
(NOTLB)
ffff810110f3ba78 0000000000000086 00000000000003e8 ffffffff880f43f2
0000000000000000 0000000000000009 ffff81011c7df820 ffff810123960040
0000005e70a722b3 000000000000fa87 ffff81011c7dfa08 0000000223ae6e00
Call Trace:
[<ffffffff880f43f2>] :dm_mod:__map_bio+0x48/0x119
[<ffffffff8006dd27>] do_gettimeofday+0x40/0x8f
[<ffffffff8001517e>] sync_buffer+0x0/0x3f
[<ffffffff800638a1>] io_schedule+0x3f/0x67
[<ffffffff800151b9>] sync_buffer+0x3b/0x3f
[<ffffffff80063acd>] __wait_on_bit+0x40/0x6e
[<ffffffff8001517e>] sync_buffer+0x0/0x3f
[<ffffffff80063b67>] out_of_line_wait_on_bit+0x6c/0x78
[<ffffffff8009ea36>] wake_bit_function+0x0/0x23
[<ffffffff88603ef6>] :gfs:gfs_logbh_wait+0x26/0x70
[<ffffffff88618ea2>] :gfs:disk_commit+0x2eb/0x44b
[<ffffffff88619939>] :gfs:log_flush_internal+0x18a/0x26f
[<ffffffff8861f7d7>] :gfs:do_write_buf+0x52b/0x67e
[<ffffffff8861ef04>] :gfs:walk_vm+0x10e/0x311
[<ffffffff8861f2ac>] :gfs:do_write_buf+0x0/0x67e
[<ffffffff8861f1b3>] :gfs:__gfs_write+0xac/0xc6
[<ffffffff800dc9bb>] do_readv_writev+0x198/0x295
[<ffffffff8861f1f6>] :gfs:gfs_write+0x0/0x8
[<ffffffff8861fd2a>] :gfs:gfs_llseek+0x75/0x84
[<ffffffff800b54f0>] audit_syscall_entry+0x16e/0x1a1
[<ffffffff800dcb42>] sys_writev+0x45/0x93
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
typo in comment 20: "We know that the first lock gfs_controld tries to acquire" should be "We know that the first lock gfs_recoverd tries to acquire" Created attachment 323268 [details]
/var/log/messages from grant-03
This is the entire /var/log/messages file from grant-03 which
shows all backtraces from the sysrq-t.
The reproduction of this bug in comments 13-18 include sysrq backtraces, but do not show the lock_dlm1 thread trying to drop the transaction lock like the latest backtrace does. I don't have an explanation for that. The lock_dlm1 thread is waiting for the sd_log_lock rw_sem which prevents simultaneous calls to log_flush_internal. The rw_sem is held by d_doio, which seems to be in an I/O wait, waiting for its log flush disk-commit to complete. So why is its I/O blocked? What has happened to this bug? It seems to have a lot of work which just stops with no conclusion. Is this still required, can we close it? It looks to me like the issue is that waiting on I/O has not returned. Most likely this is due to waiting for some I/O that was not submitted. If this is recreatable, it ought to be possible to see with blktrace what the last bits of I/O were before it stops.
There are also some other things I spotted too:
In gfs_logbh_wait():
if (!buffer_uptodate(bh) || buffer_dirty(bh)) {
shouldn't be testing for buffer_dirty(), but it might reasonably test for buffer_error() which makes more sense.
In gfs_logbh_init:
((unsigned long)data) & (PAGE_SIZE - 1) ought to be page_offset(data)
I can't see why there should be an issue with waiting for I/O though, so it might be more subtle.
I have recently created a patch to fix something very similar in gfs2. I wondered why we'd never seen it before and this bug makes me think that perhaps we have.... anyway the bz in question is #483541. IMHO, the biggest part of the problem here is that the transaction glock is being bounced around when, in fact, it should remain held. As part of RHEL4 bug #455696 I prevented that from happening with this code: @@ -2667,6 +2666,11 @@ clear_glock(struct gfs_glock *gl, unsigned int *unused) struct gfs_sbd *sdp = gl->gl_sbd; struct gfs_gl_hash_bucket *bucket = gl->gl_bucket; + /* If this isn't shutdown, keep the transaction glock around. */ + if (sdp->sd_glockd_num && gl == sdp->sd_trans_gl) { + glock_put(gl); /* see examine_bucket() */ + return; + } spin_lock(&sdp->sd_reclaim_lock); if (!list_empty(&gl->gl_reclaim)) { list_del_init(&gl->gl_reclaim); As of this writing, this patch was never shipped in a RHEL5 release but exists as a patch for rhel5 crosswrite bug #491369. So my theory is that this bug is fixed by bug #491369 and the problem should go away. So I'm tempted to just close this as a duplicate. It isn't clear whether we still need this bug or not, either way I'm clear the devel ack until a solution is clearer. Is it reasonable at this stage to close this as a dup of #491369? The code from comment #29 appears to be already in RHEL5, so is there anything more to do here I wonder? |