Description of problem: When making any change in Conga that will produce a new configuration, a failure to communicate with ricci on a subset of nodes (less than the complete set) results in luci pushing the config to just the working nodes and activating that new configuration. This triggers cman's version mismatch functionality, which suspends activity on nodes missing the latest version:
Aug 30 15:27:50 corosync [CMAN ] Activity suspended on this node
Aug 30 15:27:50 corosync [CMAN ] Error reloading the configuration, will retry every second
Aug 30 15:27:51 corosync [CMAN ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
Aug 30 15:27:51 corosync [CMAN ] Can't get updated config version 77: New configuration version has to be newer than current running configuration
This has other problems, such as the potential to have the cluster stack block if there is a membership transition while in this state. Triggering this is as simple as turning off ricci on some nodes and then making any change in conga.
luci should not activate the new configuration version in cman if syncing the config to any node fails.
Version-Release number of selected component (if applicable): luci-0.26.0-56.el6 and earlier
How reproducible: Easily
Steps to Reproduce:
1. Stop ricci on one node, and leave running on others
2. Make a change in conga that requires a cluster.conf update
Actual results: One node reports config version mismatch, activity suspended.
Expected results: Configuration is not updated
Created attachment 933837 [details]
ricci_helpers.py: Do not activate new configuration if sync failed to any node
This prevents the core issue, but is only a partial solution. With this, there are a few issues:
- Conga still displays success even though the changes weren't applied to the running cluster
- Setting values displayed throughout conga may not reflect what's applied
- One or more nodes have different configurations locally
One solution to some of this would be to update IndividualClusterController to actually detect the failure from update_cluster_conf and flash an error, as well as either refusing to redirect to a new page, or resetting values to their original state, or taking some other action to account for the failure. This isn't really straightforward though, given that one node now has a different config than the others, and so who is luci to say which values are "correct"?
Going one step further, update_cluster_conf could actually test the connection to ricci on all nodes first, and only proceed to sync the config if all are successful. This should at least prevent us from pushing out a config if all nodes can't receive it, but then what if nodes are truly missing at that time? What if the test succeeds but then a comms issue blocks the actual sync?
I couldn't come up with a decent approach to all of this, so the attached is the best I can suggest. I'm in favor of any solution that prevents us from putting cman into this mismatch state, as it has more than once caused critical cluster outages (including in the case that prompted this bug).
Created attachment 1005983 [details]
Preliminary enhancement on top of original patch
Ryan, John, please take a look at the enhacements that could save us
from the described scenarios (not 100% at all, but perhaps good enough).
As discovered by Radek Steiger during testing, both patches contain wrong
logic wrt. handling error condition: "if not sync_err:" -> "if sync_err:"
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.