Bug 1136456

Summary: ricci communication failure during conf update push causes cman version sync issue
Product: Red Hat Enterprise Linux 6 Reporter: John Ruemker <jruemker>
Component: luciAssignee: Ryan McCabe <rmccabe>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.6CC: ccaulfie, cfeist, chenders, cluster-maint, djansa, fdinitto, jpokorny, rmccabe, rsteiger
Target Milestone: rcKeywords: Patch
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: luci-0.26.0-68.el6 Doc Type: Bug Fix
Doc Text:
Cause: When editing the cluster configuration, if an error occurred while attempting to set the new configuration on 1 or more nodes, luci still attempted to activate the new configuration version. Consequence: The cluster could get out of sync, with some nodes not having the latest configuration. Fix: Luci no longer activates a new cluster configuration if any errors occurred while writing the new configuration to any of the nodes. Result: The cluster configuration versions will no longer become out of sync as a result of errors that occurred while attempting to write the new configuration.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-22 07:33:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1075802    
Attachments:
Description Flags
ricci_helpers.py: Do not activate new configuration if sync failed to any node
none
Preliminary enhancement on top of original patch jpokorny: review? (rmccabe)

Description John Ruemker 2014-09-02 15:40:00 UTC
Description of problem: When making any change in Conga that will produce a new configuration, a failure to communicate with ricci on a subset of nodes (less than the complete set) results in luci pushing the config to just the working nodes and activating that new configuration.  This triggers cman's version mismatch functionality, which suspends activity on nodes missing the latest version:

   Aug 30 15:27:50 corosync [CMAN  ] Activity suspended on this node
   Aug 30 15:27:50 corosync [CMAN  ] Error reloading the configuration, will retry every second
   Aug 30 15:27:51 corosync [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
   Aug 30 15:27:51 corosync [CMAN  ] Can't get updated config version 77: New configuration version has to be newer than current running configuration

This has other problems, such as the potential to have the cluster stack block if there is a membership transition while in this state.   Triggering this is as simple as turning off ricci on some nodes and then making any change in conga.  

luci should not activate the new configuration version in cman if syncing the config to any node fails.

Version-Release number of selected component (if applicable): luci-0.26.0-56.el6 and earlier


How reproducible: Easily


Steps to Reproduce:
1. Stop ricci on one node, and leave running on others
2. Make a change in conga that requires a cluster.conf update


Actual results: One node reports config version mismatch, activity suspended.


Expected results: Configuration is not updated 


Additional info:

Comment 2 John Ruemker 2014-09-02 15:47:48 UTC
Created attachment 933837 [details]
ricci_helpers.py: Do not activate new configuration if sync failed to any node

This prevents the core issue, but is only a partial solution.  With this, there are a few issues:

- Conga still displays success even though the changes weren't applied to the running cluster
- Setting values displayed throughout conga may not reflect what's applied 
- One or more nodes have different configurations locally

One solution to some of this would be to update IndividualClusterController to actually detect the failure from update_cluster_conf and flash an error, as well as either refusing to redirect to a new page, or resetting values to their original state, or taking some other action to account for the failure.  This isn't really straightforward though, given that one node now has a different config than the others, and so who is luci to say which values are "correct"?

Going one step further, update_cluster_conf could actually test the connection to ricci on all nodes first, and only proceed to sync the config if all are successful.  This should at least prevent us from pushing out a config if all nodes can't receive it, but then what if nodes are truly missing at that time?  What if the test succeeds but then a comms issue blocks the actual sync?  

I couldn't come up with a decent approach to all of this, so the attached is the best I can suggest.  I'm in favor of any solution that prevents us from putting cman into this mismatch state, as it has more than once caused critical cluster outages (including in the case that prompted this bug).

Comment 16 Jan Pokorný [poki] 2015-03-24 20:16:48 UTC
Created attachment 1005983 [details]
Preliminary enhancement on top of original patch

Ryan, John, please take a look at the enhacements that could save us
from the described scenarios (not 100% at all, but perhaps good enough).

Comment 17 Jan Pokorný [poki] 2015-04-13 11:04:01 UTC
As discovered by Radek Steiger during testing, both patches contain wrong
logic wrt. handling error condition: "if not sync_err:" -> "if sync_err:"

Comment 26 errata-xmlrpc 2015-07-22 07:33:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1454.html