Bug 546082
| Summary: | groupd stuck by partition merge | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | David Teigland <teigland> | ||||
| Component: | cman | Assignee: | David Teigland <teigland> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | low | ||||||
| Version: | 5.4 | CC: | ccaulfie, cluster-maint, djansa, edamato, jkortus | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | cman-2.0.115-24.el5.src.rpm | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2010-03-30 08:42:23 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
David Teigland
2009-12-09 22:45:50 UTC
Just to set expectations, we're not going to have solid, comprehensive handling of partition merging in RHEL5. What we can do is make a "best effort" attempt at detecting some simple partition merging scenarios and making them recoverable without a full cluster restart. Incidentally, one of the goals of the "disallowed" cman feature was to detect and mask partition merges in cman to make them invisible to the higher levels, so we could ignore them altogether. Cman is doing it's job in that regard. The problem in this bug is that the higher levels are still exposed to merges through the cpg layer, and that's causing groupd confusion. I'm not sure if some of the changes to groupd through the releases have changed things so that the cpg merge events have started to cause problems when the didn't before. There isn't anything obvious to me looking at the commits. In the example above, nodes 2 and 3 will show group_tool -v like the following after the merge and killing of node 1: fence 0 default 00010002 FAIL_ALL_STOPPED 1 100020003 -1 [1 2 3] and group_tool dump will show: 1260402364 0:default process_current_event 100020003 1 FAIL_ALL_STOPPED 1260402364 no cman update for recovery_set 1 quorate 1 and if cman kills the merged node before it's fenced, then something like this will appear in /var/log/messages: openais[3120]: [MAIN ] Killing node node-01 because it has rejoined the cluster with existing state Created attachment 377342 [details]
patch to work around
this patch seems to resolve the problem in the simple partition merge test case I'm using
pushed to RHEL55 branch http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=23d5cbe5dfcf20040814a09aafa33faf9f6f66e9 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0266.html |