Bug 495514
Summary: | gfs deadlocks during 4.7.z testing | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> | ||||||||||||||||
Component: | gfs | Assignee: | Robert Peterson <rpeterso> | ||||||||||||||||
Status: | CLOSED NOTABUG | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||||
Priority: | urgent | ||||||||||||||||||
Version: | 4 | CC: | cfeist, edamato, swhiteho | ||||||||||||||||
Target Milestone: | rc | Keywords: | Regression | ||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | All | ||||||||||||||||||
OS: | Linux | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2009-04-17 16:16:57 UTC | Type: | --- | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Attachments: |
|
Description
Corey Marthaler
2009-04-13 15:53:27 UTC
Created attachment 339334 [details]
kern dump from z1
Created attachment 339335 [details]
kern dump from z2
Created attachment 339336 [details]
kern dump from z3
Corey, did this test pass on GFS-kernel-2.6.9-80.9.el4_7.13? Also, can I get a gfs lock dump from the hang? Created attachment 339391 [details]
gfs_tool lockdump from z1
Bob, I grabbed gfs_tool lockdump output for z1 and z2.
Created attachment 339392 [details]
gfs_tool lockdump from z2
On z1: Glock (3, 3656288) gl_flags = 1 gl_count = 4 gl_state = 0 req_gh = yes req_bh = yes lvb_count = 1 object = yes new_le = no incore_le = no reclaim = no aspace = 0 ail_bufs = no Request owner = 31219 gh_state = 3 gh_flags = 5 6 8 error = 0 gh_iflags = 1 Waiter3 owner = 31219 gh_state = 3 gh_flags = 5 6 8 error = 0 gh_iflags = 1 This seems to be waiting for a shared lock, but on z2 nothing seems to be holding this lock. Are there lock dumps from z3? DLM lock dumps would also be useful as they might have more info about whats happening on other nodes if you can find the master for that lock. This reminds me just why the gfs2 lock dump format is so much better. I have no idea what the two lock requests are actually doing since we don't have the address of the function which initialised them, unlike with gfs2. The only difference between this 4.7.z and the previous is the fix for bug #455696 and its z-stream counterpart. This will likely be closed as a DUP and that one set to FAILS_QA. I need to adjust that fix. Created attachment 339692 [details]
logs from hung tankmorph cluster
I hit this during 4.8 testing too, attached are logs from all nodes with SysRq-T, gfs_tool lockdump, and DLM lock dumps.
Afaict, the problem seems to be with the new sd_log_flush_lock I introduced with the patch for bug #455696. There's a circular lock problem that involves function gfs_log_dump. If I'm not mistaken, the hanging calling sequence looks something like this: gfs_log_flush_glock down(&sdp->sd_log_flush_lock) -->gfs_log_flush_internal -->gfs_log_dump -->gfs_log_reserve -->gfs_log_flush down(&sdp->sd_log_flush_lock); That brings up the question of how the code did this properly before my patch. I need to investigate that, and hopefully come up with a solution. Since I can recreate the hang now, I should at least be able to tell if my solution will work. Created attachment 339708 [details]
Proposed patch
With this patch, I'm no longer able to recreate the failure on my
system. I'd like to get Corey and possible Nate to test this in
patch form before I push it to the git repositories.
Waiting to hear the outcome of the patch. Setting NEEDINFO. The scratch build with the patch does pass our GFS regression tests. I pushed the patch to the RHEL4, RHEL47 and RHEL48 branches of the cluster git tree. Now it's a matter of rebuilding those branches. I'm not sure of the best way to handle the bugzilla paperwork for this problem. I've already got Modified bugzillas for all three releases that will need to be respun. I'm tempted to just close this as a duplicate of one of those, otherwise I'll have to go through the pain of requesting bugzillas and ack flags for all three releases again. I guess I need to talk to Mr. Feist about it. This bug was a problem with the patch for bug #455696 and its z-stream counterparts. So I'm closing this as NOTABUG and I'll let Mr. Feist rebuild the appropriate z-stream rpms using those bugs. |