Bug 1546902

Summary:	Replication stop working in global region if child region is switched to standby vmdb
Product:	Red Hat CloudForms Management Engine	Reporter:	Giovanni Fontana <gfontana>
Component:	Appliance	Assignee:	Gregg Tanzillo <gtanzill>
Status:	CLOSED DUPLICATE	QA Contact:	Alex Newman <anewman>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	5.8.0	CC:	abellott, anewman, lcouzens, ncarboni, obarenbo
Target Milestone:	GA
Target Release:	cfme-future
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-02-20 14:19:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Giovanni Fontana 2018-02-20 00:11:25 UTC

Created attachment 1398051 [details]
Screenshot evidences

Description of problem:
In a multi-region and HA environment, when a primary vmdb of a child region becomes unavailable and repmgr and failover-monitor switches the workers for standby vmdb, replication in global region stop working and a "500 Internal Server Error" is showed in Replication tab (look at screenshots attached).

Version-Release number of selected component (if applicable): 5.8.0


How reproducible:
Yes

Steps to Reproduce:
1. Setup a global and a remote region.
2. The remote region DB needs to be HA.
3. Simulate a failure in primary DB in remote region. Standby VMDB is switched to Primary VMDB.
4. Access "Configuration -> Settings -> Region -> Replication tab". The "500 Internal Server Error" is presented.

Actual results:
- Replication stops and a "500 Internal Server Error" is presented.

Expected results:
- Global region should detect that the Primary VMDB is out and start working with Standby VMDB, just like failover-monitor does with the workers in the region.
- No "Internal Server Error" should be presented.

Additional info:

Comment 2 luke couzens 2018-02-20 09:48:30 UTC

Is this not a duplicate of 1391095? 

The current way replication/HA works it wont failover correctly without some virtual IP usage as stated in that RFE bug.

Comment 3 Giovanni Fontana 2018-02-20 13:15:30 UTC

I think so, unless by the "500 Internal Server Error" issue (I didn't see any reference to this error).

Comment 4 Nick Carboni 2018-02-20 14:19:52 UTC

The 500 error was fixed as a part of https://bugzilla.redhat.com/show_bug.cgi?id=1540688 (specifically in https://github.com/ManageIQ/pg-pglogical/pull/20)

Marking this a duplicate of bug 1391095

*** This bug has been marked as a duplicate of bug 1391095 ***

Comment 5 Giovanni Fontana 2018-02-20 15:24:47 UTC

Hi Nick! The screenshot I have is a little bit different, is it being fixed by this PR also?

Regards,

Giovanni