Bug 220211
| Summary: | multiple qdisk master after network outage | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] Red Hat Cluster Suite | Reporter: | Frederik Ferner <frederik.ferner> | ||||||
| Component: | cman | Assignee: | Lon Hohberger <lhh> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 4 | CC: | cfeist, cluster-maint, jos, tao | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | RHBA-2007-0134 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2007-05-10 21:04:50 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Frederik Ferner
2006-12-19 18:28:44 UTC
Created attachment 144025 [details]
Email describing the problem in more detail
Hmm, how close in time are the two systems; are they using NTP ? Yes, they are using NTP. It looks like qdiskd doesn't wait long enough for votes prior to declaring itself a master; it's supposed to wait at least one full cycle after thinking it's online prior to even thinking about trying to make a bid for master status. Fixes in CVS; RHEL4 / RHEL5 / head branches. Let me know if you'd like test packages. Yes, test packages would be great. http://people.redhat.com/lhh/packages.html You'll need the cman-kernel update (built against 42.0.3 kernel I think). NOT fixed. I hit it this morning. Part 1: Ok, here's the scoop - with the current RHEL4 CVS bits, the master_wait time (time to wait for NACKs / ACKs) was less than the amount of time used to declare a node up. This means that multiple nodes could make bids for master - and because they would ignore each other's disk-writes until the node was declared up, multiple nodes could declare themselves master. This is obviously wrong. By changing the formula for calculating the tko_up and master_wait variables to tko/3 and tko/2 respectively (they were the opposite), we eliminate the problem. Furthermore, we must restrict the master_wait to be a value greater than the tko_up value. Part 2: We require a resolution algorithm in the event that this happens beyond our control - so that we have predictable behavior in multi-master situations. The idea here is that we need all masters to rescind master status gracefully - forcing a re-election. This isn't hard, and needs to be done. Created attachment 150525 [details]
Fix
Adding rhel-4.5 tag so we can build. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0134.html |