Bug 459820
Summary: | gfs mount attempt dead locks after recovery | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||||||||||||||||
Component: | gfs-kmod | Assignee: | Robert Peterson <rpeterso> | ||||||||||||||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||||||
Priority: | medium | ||||||||||||||||||||
Version: | 5.3 | CC: | djansa, edamato, rpeterso, syeghiay, teigland | ||||||||||||||||||
Target Milestone: | rc | ||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||
Hardware: | All | ||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||
Last Closed: | 2009-03-05 21:30:29 UTC | Type: | --- | ||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||
Embargoed: | |||||||||||||||||||||
Attachments: |
|
Description
Corey Marthaler
2008-08-22 17:33:10 UTC
Created attachment 314832 [details]
log from taft-01
Created attachment 314833 [details]
log from taft-02
Created attachment 314834 [details]
log from taft-03
Created attachment 314835 [details]
log from taft-04
Reproduced this again while running w/ cmirrors. This shouldn't block beta, but should probably be fixed for 5.3 rc. I'll once again attach the kern dumps. ================================================================================ Senario iteration 8.3 started at Fri Aug 22 18:14:21 CDT 2008 Sleeping 2 minute(s) to let the I/O get its lock count up... Senario: DLM kill Quorum minus one Those picked to face the revolver... taft-04 Feeling lucky taft-04? Well do ya? Go'head make my day... Didn't receive heartbeat for 2 seconds Verify that taft-04 has been removed from cluster on remaining nodes Verifying that the dueler(s) are alive still not all alive, sleeping another 10 seconds still not all alive, sleeping another 10 seconds still not all alive, sleeping another 10 seconds still not all alive, sleeping another 10 seconds <ignore name="taft-04_0" pid="17464" time="Fri Aug 22 18:18:29 2008" type="cmd" duration="871" ec="127" /> <ignore name="taft-04_1" pid="17468" time="Fri Aug 22 18:18:29 2008" type="cmd" duration="871" ec="127" /> still not all alive, sleeping another 10 seconds still not all alive, sleeping another 10 seconds All killed nodes are back up (able to be pinged), making sure they're qarshable... All killed nodes are now qarshable Verifying that recovery properly took place (on the nodes that stayed in the cluster) checking that all of the cluster nodes are now/still cman members... checking fence recovery (state of each service)... checking dlm recovery (state of each service)... checking gfs recovery (state of each service)... checking gfs2 recovery (state of each service)... checking fence recovery (node membership of each service)... checking dlm recovery (node membership of each service)... checking gfs recovery (node membership of each service)... checking gfs2 recovery (node membership of each service)... Verifying that clvmd was started properly on the dueler(s) mounting /dev/mapper/taft-mirror1 on /mnt/taft1 on taft-04 mounting /dev/mapper/taft-mirror2 on /mnt/taft2 on taft-04 Created attachment 314933 [details]
2nd log from taft-01
Created attachment 314934 [details]
2nd log from taft-02
Created attachment 314935 [details]
2nd log from taft-03
Created attachment 314936 [details]
2nd log from taft-04
It looks to me like mount is waiting for gfs, and gfs is waiting for dlm. I'm not sure what dlm is waiting for. Dave: Can you take a look at the the first call trace for taft-01 and the last call trace for taft-04? They look very similar. The dlm_recoverd threads are waiting for recovery messages in most of the logs. This means everything that uses the dlm will be blocked waiting for dlm recovery to finish. We should reproduce this after adding <dlm log_debug="1"/> to cluster.conf. Let me know when it gets stuck so I can give it a look before any sysrq. Setting needinfo flag to collect the log_debug data suggested by Dave. Just a note that i'm now having trouble reproducing this issue. :( Taking off the proposed blocker list until this can be reproduced reliably. I have been able to hit this a few times while testing the latest 5.2.Z. I will attempt to repo with dlm log_debug. Dave T. Looked at Dean's problem and determined that it's not the same thing as this bug. This bug has been in NEEDINFO for six months now, so I'm going to close it. A new bug will likely be opened for Dean's new issue. Additional info: The problem Dean hit is apparently this one: Bug 442451 - GFS does not recover the journal when one node does a withdraw of the same filesystem The fix is in 5.3 but Dean was testing 5.2.z. |