Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1753246

Summary:	OSP 14->15: Galera can refuse to sync if it has more recent data than the rest of the cluster
Product:	Red Hat OpenStack	Reporter:	Jiri Stransky <jstransk>
Component:	openstack-tripleo-heat-templates	Assignee:	RHOS Maint <rhos-maint>
Status:	CLOSED WORKSFORME	QA Contact:	Sasha Smolyak <ssmolyak>
Severity:	high	Docs Contact:
Priority:	high
Version:	15.0 (Stein)	CC:	lmiccini, mburns
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-14 10:48:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1727807

Description Jiri Stransky 2019-09-18 13:22:18 UTC

This is to track an issue mentioned at https://bugzilla.redhat.com/show_bug.cgi?id=1743326#c6 . Most likely we need some way to ensure that we're transferring data from the node which has the latest version, or force a particular node to shutdown last? I'll quote Damien's description:

> 2. Joining nodes refuse to synchronize against the cluster because they
> think they have more recent data.
> 
> 
> Later on, controller-1 is added in the cluster:
> 
> srp 23 09:37:07 controller-0 pacemaker-controld[57750]: notice: Node
> controller-1 state is now member
> 
> And the galera resource is stopped entirely, for a reason that is not
> relevant here.
> 
> srp 23 09:41:23 controller-0 pacemaker-schedulerd[57749]: warning:
> Processing failed monitor of galera:0 on galera-bundle-0: unknown error
> srp 23 09:41:23 controller-0 pacemaker-schedulerd[57749]: notice:  * Stop   
> galera-bundle-podman-0               (                    controller-0 )  
> due to node availability
> srp 23 09:41:23 controller-0 pacemaker-schedulerd[57749]: notice:  * Stop   
> galera-bundle-0                      (                    controller-0 )  
> due to node availability
> srp 23 09:41:23 controller-0 pacemaker-schedulerd[57749]: notice:  * Stop   
> galera:0                             (          Master galera-bundle-0 )  
> due to node availability
> 
> (Note: the "Processing failed monitor" warning comes from the previous
> error, so it's harmless)
> 
> Eventually, pacemaker bootstraps a new cluster, and chooses controller-0 as
> a the bootstrap node:
> 
> srp 23 09:41:53 controller-0 galera(galera)[394705]: INFO: Galera started
> srp 23 09:41:53 controller-0 pacemaker-controld[57750]: notice: Result of
> promote operation for galera on galera-bundle-0: 0 (ok)
> 
> And tells controler-1 to rejoin the galera cluster. But when galera is start
> on controller-1, it refuses to rejoin and stops in error, because it thinks
> it has more recent data than controller-0.
> 
> 2019-08-23  9:42:54 0 [ERROR] WSREP:
> gcs/src/gcs_group.cpp:group_post_state_exchange():321: Reversing history:
> 631252 -> 543483, this member has applied 87769 more events than the primary
> component.Data loss is possible. Aborting.
> 
> controller-0 refuses to stay in primary galera component as a reaction, so
> pacemaker has to restart it as well.
> 
> 2019-08-23  9:42:59 0 [Note] WSREP: evs::proto(5ab6d537, OPERATIONAL,
> view_id(REG,5ab6d537,2)) suspected node without join message, declaring
> inactive
> 2019-08-23  9:43:00 0 [Note] WSREP: view(view_id(NON_PRIM,5ab6d537,2) memb {
> 
> At this step, galera is unmanaged on controller-1, so pacemaker won't be
> able to bootstrap the cluster anymore, so the DB resource will be
> unavailable to OpenStack.
> 
> That's the first time I see a node refusing to rejoin an existing cluster,
> could this indicate that the DB save/restore step in the upgrade wasn't run
> on the node with the most recent data?

Comment 5 Luca Miccini 2020-01-14 10:48:46 UTC

we are unable to reproduce this issue. closing for now.