Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1753246

Summary: OSP 14->15: Galera can refuse to sync if it has more recent data than the rest of the cluster
Product: Red Hat OpenStack Reporter: Jiri Stransky <jstransk>
Component: openstack-tripleo-heat-templatesAssignee: RHOS Maint <rhos-maint>
Status: CLOSED WORKSFORME QA Contact: Sasha Smolyak <ssmolyak>
Severity: high Docs Contact:
Priority: high    
Version: 15.0 (Stein)CC: lmiccini, mburns
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-14 10:48:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1727807    

Description Jiri Stransky 2019-09-18 13:22:18 UTC
This is to track an issue mentioned at https://bugzilla.redhat.com/show_bug.cgi?id=1743326#c6 . Most likely we need some way to ensure that we're transferring data from the node which has the latest version, or force a particular node to shutdown last? I'll quote Damien's description:

> 2. Joining nodes refuse to synchronize against the cluster because they
> think they have more recent data.
> 
> 
> Later on, controller-1 is added in the cluster:
> 
> srp 23 09:37:07 controller-0 pacemaker-controld[57750]: notice: Node
> controller-1 state is now member
> 
> And the galera resource is stopped entirely, for a reason that is not
> relevant here.
> 
> srp 23 09:41:23 controller-0 pacemaker-schedulerd[57749]: warning:
> Processing failed monitor of galera:0 on galera-bundle-0: unknown error
> srp 23 09:41:23 controller-0 pacemaker-schedulerd[57749]: notice:  * Stop   
> galera-bundle-podman-0               (                    controller-0 )  
> due to node availability
> srp 23 09:41:23 controller-0 pacemaker-schedulerd[57749]: notice:  * Stop   
> galera-bundle-0                      (                    controller-0 )  
> due to node availability
> srp 23 09:41:23 controller-0 pacemaker-schedulerd[57749]: notice:  * Stop   
> galera:0                             (          Master galera-bundle-0 )  
> due to node availability
> 
> (Note: the "Processing failed monitor" warning comes from the previous
> error, so it's harmless)
> 
> Eventually, pacemaker bootstraps a new cluster, and chooses controller-0 as
> a the bootstrap node:
> 
> srp 23 09:41:53 controller-0 galera(galera)[394705]: INFO: Galera started
> srp 23 09:41:53 controller-0 pacemaker-controld[57750]: notice: Result of
> promote operation for galera on galera-bundle-0: 0 (ok)
> 
> And tells controler-1 to rejoin the galera cluster. But when galera is start
> on controller-1, it refuses to rejoin and stops in error, because it thinks
> it has more recent data than controller-0.
> 
> 2019-08-23  9:42:54 0 [ERROR] WSREP:
> gcs/src/gcs_group.cpp:group_post_state_exchange():321: Reversing history:
> 631252 -> 543483, this member has applied 87769 more events than the primary
> component.Data loss is possible. Aborting.
> 
> controller-0 refuses to stay in primary galera component as a reaction, so
> pacemaker has to restart it as well.
> 
> 2019-08-23  9:42:59 0 [Note] WSREP: evs::proto(5ab6d537, OPERATIONAL,
> view_id(REG,5ab6d537,2)) suspected node without join message, declaring
> inactive
> 2019-08-23  9:43:00 0 [Note] WSREP: view(view_id(NON_PRIM,5ab6d537,2) memb {
> 
> At this step, galera is unmanaged on controller-1, so pacemaker won't be
> able to bootstrap the cluster anymore, so the DB resource will be
> unavailable to OpenStack.
> 
> That's the first time I see a node refusing to rejoin an existing cluster,
> could this indicate that the DB save/restore step in the upgrade wasn't run
> on the node with the most recent data?

Comment 5 Luca Miccini 2020-01-14 10:48:46 UTC
we are unable to reproduce this issue. closing for now.