1136456 – ricci communication failure during conf update push causes cman version sync issue

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1136456 - ricci communication failure during conf update push causes cman version sync issue

Summary: ricci communication failure during conf update push causes cman version sync ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	luci
Sub Component:
Version:	6.6
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Ryan McCabe
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1075802
TreeView+	depends on / blocked

Reported:	2014-09-02 15:40 UTC by John Ruemker
Modified:	2019-07-11 08:10 UTC (History)
CC List:	9 users (show)
Fixed In Version:	luci-0.26.0-68.el6
Doc Type:	Bug Fix
Doc Text:	Cause: When editing the cluster configuration, if an error occurred while attempting to set the new configuration on 1 or more nodes, luci still attempted to activate the new configuration version. Consequence: The cluster could get out of sync, with some nodes not having the latest configuration. Fix: Luci no longer activates a new cluster configuration if any errors occurred while writing the new configuration to any of the nodes. Result: The cluster configuration versions will no longer become out of sync as a result of errors that occurred while attempting to write the new configuration.
Clone Of:
Environment:
Last Closed:	2015-07-22 07:33:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ricci_helpers.py: Do not activate new configuration if sync failed to any node (1.32 KB, patch) 2014-09-02 15:47 UTC, John Ruemker	no flags	Details \| Diff
Preliminary enhancement on top of original patch (4.47 KB, patch) 2015-03-24 20:16 UTC, Jan Pokorný [poki]	jpokorny: review? (rmccabe)	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	220213	0	None	None	None	Never
Red Hat Product Errata	RHBA-2015:1454	0	normal	SHIPPED_LIVE	luci bug fix and enhancement update	2015-07-20 18:43:43 UTC

Description John Ruemker 2014-09-02 15:40:00 UTC

Description of problem: When making any change in Conga that will produce a new configuration, a failure to communicate with ricci on a subset of nodes (less than the complete set) results in luci pushing the config to just the working nodes and activating that new configuration.  This triggers cman's version mismatch functionality, which suspends activity on nodes missing the latest version:

   Aug 30 15:27:50 corosync [CMAN  ] Activity suspended on this node
   Aug 30 15:27:50 corosync [CMAN  ] Error reloading the configuration, will retry every second
   Aug 30 15:27:51 corosync [CMAN  ] Unable to load new config in corosync: New configuration version has to be newer than current running configuration
   Aug 30 15:27:51 corosync [CMAN  ] Can't get updated config version 77: New configuration version has to be newer than current running configuration

This has other problems, such as the potential to have the cluster stack block if there is a membership transition while in this state.   Triggering this is as simple as turning off ricci on some nodes and then making any change in conga.  

luci should not activate the new configuration version in cman if syncing the config to any node fails.

Version-Release number of selected component (if applicable): luci-0.26.0-56.el6 and earlier


How reproducible: Easily


Steps to Reproduce:
1. Stop ricci on one node, and leave running on others
2. Make a change in conga that requires a cluster.conf update


Actual results: One node reports config version mismatch, activity suspended.


Expected results: Configuration is not updated 


Additional info:

Comment 2 John Ruemker 2014-09-02 15:47:48 UTC

Created attachment 933837 [details]
ricci_helpers.py: Do not activate new configuration if sync failed to any node

This prevents the core issue, but is only a partial solution.  With this, there are a few issues:

- Conga still displays success even though the changes weren't applied to the running cluster
- Setting values displayed throughout conga may not reflect what's applied 
- One or more nodes have different configurations locally

One solution to some of this would be to update IndividualClusterController to actually detect the failure from update_cluster_conf and flash an error, as well as either refusing to redirect to a new page, or resetting values to their original state, or taking some other action to account for the failure.  This isn't really straightforward though, given that one node now has a different config than the others, and so who is luci to say which values are "correct"?

Going one step further, update_cluster_conf could actually test the connection to ricci on all nodes first, and only proceed to sync the config if all are successful.  This should at least prevent us from pushing out a config if all nodes can't receive it, but then what if nodes are truly missing at that time?  What if the test succeeds but then a comms issue blocks the actual sync?  

I couldn't come up with a decent approach to all of this, so the attached is the best I can suggest.  I'm in favor of any solution that prevents us from putting cman into this mismatch state, as it has more than once caused critical cluster outages (including in the case that prompted this bug).

Comment 16 Jan Pokorný [poki] 2015-03-24 20:16:48 UTC

Created attachment 1005983 [details]
Preliminary enhancement on top of original patch

Ryan, John, please take a look at the enhacements that could save us
from the described scenarios (not 100% at all, but perhaps good enough).

Comment 17 Jan Pokorný [poki] 2015-04-13 11:04:01 UTC

As discovered by Radek Steiger during testing, both patches contain wrong
logic wrt. handling error condition: "if not sync_err:" -> "if sync_err:"

Comment 26 errata-xmlrpc 2015-07-22 07:33:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1454.html

Note You need to log in before you can comment on or make changes to this bug.