Bug 1136456
Summary: | ricci communication failure during conf update push causes cman version sync issue | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | John Ruemker <jruemker> | ||||||
Component: | luci | Assignee: | Ryan McCabe <rmccabe> | ||||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 6.6 | CC: | ccaulfie, cfeist, chenders, cluster-maint, djansa, fdinitto, jpokorny, rmccabe, rsteiger | ||||||
Target Milestone: | rc | Keywords: | Patch | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | All | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | luci-0.26.0-68.el6 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Cause: When editing the cluster configuration, if an error occurred while attempting to set the new configuration on 1 or more nodes, luci still attempted to activate the new configuration version.
Consequence: The cluster could get out of sync, with some nodes not having the latest configuration.
Fix: Luci no longer activates a new cluster configuration if any errors occurred while writing the new configuration to any of the nodes.
Result: The cluster configuration versions will no longer become out of sync as a result of errors that occurred while attempting to write the new configuration.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-07-22 07:33:10 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1075802 | ||||||||
Attachments: |
|
Description
John Ruemker
2014-09-02 15:40:00 UTC
Created attachment 933837 [details]
ricci_helpers.py: Do not activate new configuration if sync failed to any node
This prevents the core issue, but is only a partial solution. With this, there are a few issues:
- Conga still displays success even though the changes weren't applied to the running cluster
- Setting values displayed throughout conga may not reflect what's applied
- One or more nodes have different configurations locally
One solution to some of this would be to update IndividualClusterController to actually detect the failure from update_cluster_conf and flash an error, as well as either refusing to redirect to a new page, or resetting values to their original state, or taking some other action to account for the failure. This isn't really straightforward though, given that one node now has a different config than the others, and so who is luci to say which values are "correct"?
Going one step further, update_cluster_conf could actually test the connection to ricci on all nodes first, and only proceed to sync the config if all are successful. This should at least prevent us from pushing out a config if all nodes can't receive it, but then what if nodes are truly missing at that time? What if the test succeeds but then a comms issue blocks the actual sync?
I couldn't come up with a decent approach to all of this, so the attached is the best I can suggest. I'm in favor of any solution that prevents us from putting cman into this mismatch state, as it has more than once caused critical cluster outages (including in the case that prompted this bug).
Created attachment 1005983 [details]
Preliminary enhancement on top of original patch
Ryan, John, please take a look at the enhacements that could save us
from the described scenarios (not 100% at all, but perhaps good enough).
As discovered by Radek Steiger during testing, both patches contain wrong logic wrt. handling error condition: "if not sync_err:" -> "if sync_err:" Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1454.html |