Bug 496130
Summary: | node mysteriously evicted by qdisk | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||||||||||||||||||
Component: | cman | Assignee: | Lon Hohberger <lhh> | ||||||||||||||||||||
Status: | CLOSED WORKSFORME | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||||||||
Priority: | high | ||||||||||||||||||||||
Version: | 5.3 | CC: | cluster-maint, edamato, nstraz, samuel.kielek, smayhew, tao | ||||||||||||||||||||
Target Milestone: | rc | ||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||
Hardware: | All | ||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||
Last Closed: | 2009-10-27 20:41:50 UTC | Type: | --- | ||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||
Bug Depends On: | 500450 | ||||||||||||||||||||||
Bug Blocks: | 499522 | ||||||||||||||||||||||
Attachments: |
|
Description
Corey Marthaler
2009-04-16 19:24:52 UTC
Created attachment 339905 [details]
log from taft-01
Created attachment 339906 [details]
log from taft-02
Created attachment 339907 [details]
log from taft-03
Created attachment 339908 [details]
log from taft-04
FYI - I hit an issue that appears very similar to this one today while running the test revolver. It may have been a timing issue where one of the three nodes came back up faster than the one node left had sorted though the other leaving the cluster. I'll post the logs for this issue.. Scenario iteration 0.2 started at Fri Apr 17 10:56:49 CDT 2009 Sleeping 1 minute(s) to let the I/O get its lock count up... Senario: DLM kill Quorum plus one Those picked to face the revolver... taft-03-bond taft-02-bond taft-01-bond Feeling lucky taft-03-bond? Well do ya? Go'head make my day... Feeling lucky taft-02-bond? Well do ya? Go'head make my day... Feeling lucky taft-01-bond? Well do ya? Go'head make my day... Verifying nodes were removed from cluster Verified taft-01-bond was removed on taft-04-bond Verified taft-02-bond was removed on taft-04-bond Verified taft-03-bond was removed on taft-04-bond Verifying that the dueler(s) are alive still not all alive, sleeping another 10 seconds still not all alive, sleeping another 10 seconds still not all alive, sleeping another 10 seconds still not all alive, sleeping another 10 seconds still not all alive, sleeping another 10 seconds still not all alive, sleeping another 10 seconds still not all alive, sleeping another 10 seconds All killed nodes are back up (able to be pinged), making sure they're qarshable... still not all qarshable, sleeping another 10 seconds All killed nodes are now qarshable Verifying that recovery properly took place (on the nodes that stayed in the cluster) checking that all of the cluster nodes are now/still cman members... checking fence recovery (state of each service)... checking dlm recovery (state of each service)... checking gfs recovery (state of each service)... checking gfs2 recovery (state of each service)... checking fence recovery (node membership of each service)... checking dlm recovery (node membership of each service)... checking gfs recovery (node membership of each service)... checking gfs2 recovery (node membership of each service)... Verifying that clvmd was started properly on the dueler(s) clvmd is not running on taft-01-bond Created attachment 340037 [details]
new log from taft-01
Created attachment 340038 [details]
new log from taft-02
Created attachment 340040 [details]
new log from taft-03
Created attachment 340041 [details]
new log from taft-04
Nate and I were chasing this on RHEL4 too -- using the deadline scheduler helped, but did not entirely resolve the issue. As a start, switch the i/o scheduler to the deadline scheduler; then we can tune to make the cluster more flexible. I also have another fix for the 'undead' messages if you would like. It's important to note that Nate's cluster also has an MSA1000 Ok, another person using a very different sort of array (EMC² Symmetrix) reported a similar problem on RHEL 5.3. At the time time of eviction, I/O to the same array (though a different LUN) had very strange iostat numbers. For example: avgqu-sz - 30780484020872.61 (that's not a typo) await - 5.03 svctm - 74906.50 I think we need to cross-reference this data on one or both of the MSAs in use and see if we can reproduce this. Created attachment 342375 [details]
iostat numbers
See: https://bugzilla.redhat.com/show_bug.cgi?id=490147#c9 https://bugzilla.redhat.com/show_bug.cgi?id=490147#c10 https://bugzilla.redhat.com/show_bug.cgi?id=490147#c11 *** Bug 514627 has been marked as a duplicate of this bug. *** |