Bug 1546902

Summary: Replication stop working in global region if child region is switched to standby vmdb
Product: Red Hat CloudForms Management Engine Reporter: Giovanni Fontana <gfontana>
Component: ApplianceAssignee: Gregg Tanzillo <gtanzill>
Status: CLOSED DUPLICATE QA Contact: Alex Newman <anewman>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.8.0CC: abellott, anewman, lcouzens, ncarboni, obarenbo
Target Milestone: GA   
Target Release: cfme-future   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-20 14:19:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Giovanni Fontana 2018-02-20 00:11:25 UTC
Created attachment 1398051 [details]
Screenshot evidences

Description of problem:
In a multi-region and HA environment, when a primary vmdb of a child region becomes unavailable and repmgr and failover-monitor switches the workers for standby vmdb, replication in global region stop working and a "500 Internal Server Error" is showed in Replication tab (look at screenshots attached).

Version-Release number of selected component (if applicable): 5.8.0


How reproducible:
Yes

Steps to Reproduce:
1. Setup a global and a remote region.
2. The remote region DB needs to be HA.
3. Simulate a failure in primary DB in remote region. Standby VMDB is switched to Primary VMDB.
4. Access "Configuration -> Settings -> Region -> Replication tab". The "500 Internal Server Error" is presented.

Actual results:
- Replication stops and a "500 Internal Server Error" is presented.

Expected results:
- Global region should detect that the Primary VMDB is out and start working with Standby VMDB, just like failover-monitor does with the workers in the region.
- No "Internal Server Error" should be presented.

Additional info:

Comment 2 luke couzens 2018-02-20 09:48:30 UTC
Is this not a duplicate of 1391095? 

The current way replication/HA works it wont failover correctly without some virtual IP usage as stated in that RFE bug.

Comment 3 Giovanni Fontana 2018-02-20 13:15:30 UTC
I think so, unless by the "500 Internal Server Error" issue (I didn't see any reference to this error).

Comment 4 Nick Carboni 2018-02-20 14:19:52 UTC
The 500 error was fixed as a part of https://bugzilla.redhat.com/show_bug.cgi?id=1540688 (specifically in https://github.com/ManageIQ/pg-pglogical/pull/20)

Marking this a duplicate of bug 1391095

*** This bug has been marked as a duplicate of bug 1391095 ***

Comment 5 Giovanni Fontana 2018-02-20 15:24:47 UTC
Hi Nick! The screenshot I have is a little bit different, is it being fixed by this PR also?

Regards,

Giovanni