1391095 – [RFE][L-8] Replication does not support HA

Bug 1391095 - [RFE][L-8] Replication does not support HA

Summary: [RFE][L-8] Replication does not support HA

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Replication
Sub Component:
Version:	5.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	MVP
Target Release:	5.10.0
Assignee:	Nick Carboni
QA Contact:	Tasos Papaioannou
Docs Contact:
URL:
Whiteboard:	replication:ha
Duplicates (1):	1546902 (view as bug list)
Depends On:
Blocks:	1480288 1511957 1555371
TreeView+	depends on / blocked

Reported:	2016-11-02 14:34 UTC by Alex Newman
Modified:	2022-03-13 14:08 UTC (History)
CC List:	22 users (show)
Fixed In Version:	5.10.0.15
Doc Type:	Known Issue
Doc Text:	In highly available CloudForms environments, data synchronization to the global region ceases to function after a remote region failover event. This occurs because of an issue with both primary to standby database (HA) replication configured along with region-to-region (remote/global) replication. To work around this, remove and re-create the subscription in the global region web user interface to point to the new primary database server in the remote region. After applying the workaround, replication to the global region will be restored.
Clone Of:
Environment:
Last Closed:	2019-02-07 23:02:18 UTC
Category:	---
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2019:0212	0	None	None	None	2019-02-07 23:02:27 UTC

Description Alex Newman 2016-11-02 14:34:32 UTC

Description of problem:
When replication from HA region is configured using the first primary DB after a fail-over event, replication stops working since it still points to the old DB which maybe offline or no longer the active DB 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Setup region w/ HA
2. Setup replication subscription to above HA region
3. Cause failover in the above HA region
4. After failover is complete, create new zone in the above HA region
5. Check for new HA zone in global region

Actual results:
replication still points to failed DB, so new data is not replicated

Expected results:
Replication to be HA aware and automatically update when in failover mode

Additional info:

Comment 2 Nick Carboni 2016-12-02 20:33:50 UTC

Investigating solutions to this using Tracker stories:
https://www.pivotaltracker.com/story/show/127384493
https://www.pivotaltracker.com/story/show/135369671

Comment 3 Nick Carboni 2017-03-23 16:04:58 UTC

For this we are investigating allowing the user to set a virtual IP address that will be taken over by the server which is currently the primary.

Then users will configure the servers in the local region as well as the replication subscription against the virtual IP rather than having to worry about changing the IP the "clients" are looking at (i.e. the Failover Monitor goes away).

To do this we need a way for users to set the virtual IP at database configuration time which will also require a new way of storing data (as the schema will not be initialized before at least one app server is brought up).

The pitch for this is to create a new table in the database which will be a simple key-value store. This table will store the VIP as well as the replication slot used by pglogical. This table will not be managed by activerecord migrations and will be created and populated using a newly developed mechanism.

We will then set the virtual_ip key to the VIP value when a user sets it up and set the replication_slot key to the name of the replication slot created for a subscription. Then when a failover occurs the new primary will pull these two values, promote itself, recreate the replication slot, then enable the VIP.

This will ensure that no clients can access the database server before it is ready to replicate any data changes, minimizing the chance for data-loss in the global region.

pglogical in the global region will also need to be configured to retry failed connections. This can either be accomplished by creating a monitoring thread to detect when subscriptions go down or by tweaking the tcp connection parameters pglogical uses to connect to the remote region database (see https://www.postgresql.org/docs/9.5/static/runtime-config-connection.html#GUC-TCP-KEEPALIVES-COUNT + https://github.com/2ndQuadrant/pglogical/commit/362035ef55edaadc0c4ee748061b78d63528131c)

The last piece of this effort will be creating a service that makes sure the IP is properly assigned (or not assigned) to a particular database server.

- One part of this is proper fencing. The utility we are planning to use for the VIP sends an ARP broadcast, so in my testing the clients will connect to the new primary, but we don't really want to rely on that. Fencing is being tracked separately in https://www.pivotaltracker.com/story/show/130841015, but that may be required for this to work.

- The other part is determining if the server should assign itself the VIP on reboot. We can do two checks to determine this.
1. Determine if the local postgres server is configured to run as a primary
2. If so, query the other servers in the repmgr cluster, if none of the other servers are configured as a primary and the repmgr repl_nodes table looks okay (the other nodes should be pointing to us as the upstream), then take the VIP

The plan is to develop this as a separate script that will run once at startup (probably as a systemd service).

Comment 4 Brett Thurber 2017-04-03 03:56:47 UTC

For re-adding a node back into the cluster, we will want to ensure that the database is in sync so something like pgrewind will need to be initiated.  May be a consideration as part of the proposed node re-entry script.  A vote may be needed.  I believe this is the intent of: https://www.pivotaltracker.com/story/show/130841015

Ack on the other items as this tends to match what traditional clustering relies on.

Comment 9 Nick Carboni 2018-02-20 14:19:52 UTC

*** Bug 1546902 has been marked as a duplicate of this bug. ***

Comment 11 CFME Bot 2018-08-10 16:06:04 UTC

https://github.com/ManageIQ/manageiq/pull/17837

Comment 12 CFME Bot 2018-08-10 16:08:19 UTC

https://github.com/ManageIQ/manageiq-appliance/pull/202

Comment 13 CFME Bot 2018-09-07 20:46:09 UTC

New commits detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/f9c3f5e722588e22ce17a4710e5328abdff851cc
commit f9c3f5e722588e22ce17a4710e5328abdff851cc
Author:     Nick Carboni <ncarboni>
AuthorDate: Fri Aug 10 11:53:02 2018 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Aug 10 11:53:02 2018 -0400

    Add the manageiq-postgres_ha_admin gem to the Gemfile

    We need this here now that we are using the gem from inside the app

    https://bugzilla.redhat.com/show_bug.cgi?id=1391095

 Gemfile | 1 +
 1 file changed, 1 insertion(+)


https://github.com/ManageIQ/manageiq/commit/78562b0719b0424852b42a7877ca58b39c2707ac
commit 78562b0719b0424852b42a7877ca58b39c2707ac
Author:     Nick Carboni <ncarboni>
AuthorDate: Fri Aug 10 11:51:54 2018 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Aug 10 11:51:54 2018 -0400

    Configure and start the failover monitor from EvmDatabase

    We need the PglogicalSubscription model to properly remove, add, and
    query for subscriptions. Making this all a method in the rails
    environment simplifies that process.

    https://bugzilla.redhat.com/show_bug.cgi?id=1391095

 lib/evm_database.rb | 52 +
 1 file changed, 52 insertions(+)


https://github.com/ManageIQ/manageiq/commit/0b24d4252d6aef3f7dcf93705b336e297a317ed2
commit 0b24d4252d6aef3f7dcf93705b336e297a317ed2
Author:     Nick Carboni <ncarboni>
AuthorDate: Fri Aug 24 14:31:39 2018 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Aug 24 14:31:39 2018 -0400

    Only allow the database_operations server to monitor subscription failover

    Since the subscriptions in the global region are a region-wide object
    (rather than database.yml which exists in each server) only one
    server should be responsible for monitoring them.

    The database_operations role seemed a good candidate

    https://bugzilla.redhat.com/show_bug.cgi?id=1391095

 lib/evm_database.rb | 2 +
 1 file changed, 2 insertions(+)


https://github.com/ManageIQ/manageiq/commit/0688eaf7fcb982bf71e12593baa515e4b0126425
commit 0688eaf7fcb982bf71e12593baa515e4b0126425
Author:     Nick Carboni <ncarboni>
AuthorDate: Fri Aug 24 14:31:14 2018 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Aug 24 14:31:14 2018 -0400

    Add .restart_failover_monitor_service to EvmDatabase class

    This will need to be done whenever the roles of a server change
    or when subscriptions are changed (and possibly in more places as
    well). It will be better to have it in one place.

    https://bugzilla.redhat.com/show_bug.cgi?id=1391095

 lib/evm_database.rb | 4 +
 1 file changed, 4 insertions(+)


https://github.com/ManageIQ/manageiq/commit/f277e419d6b86ca6dff1c876d97c66f9c5d6dd1d
commit f277e419d6b86ca6dff1c876d97c66f9c5d6dd1d
Author:     Nick Carboni <ncarboni>
AuthorDate: Fri Aug 24 14:34:09 2018 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Aug 24 14:34:09 2018 -0400

    Create the MiqServer#role_changes method

    This will allow us to determine exactly how each role will be
    changed during the role sync process.

    We use this information to only restart the failover monitor
    when the database operations role is being either added or removed.
    The failover monitor startup process will then get a chance
    to decide whether or not it should be monitoring subscriptions as well
    as database.yml

    https://bugzilla.redhat.com/show_bug.cgi?id=1391095

 app/models/miq_server/role_management.rb | 14 +-
 app/models/miq_server/worker_management/monitor.rb | 4 +
 2 files changed, 14 insertions(+), 4 deletions(-)


https://github.com/ManageIQ/manageiq/commit/9e5063c3e32873f15d8528e1420206accbeb0afe
commit 9e5063c3e32873f15d8528e1420206accbeb0afe
Author:     Nick Carboni <ncarboni>
AuthorDate: Fri Aug 24 14:35:35 2018 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Aug 24 14:35:35 2018 -0400

    Restart the failover monitor when subscriptions are added or deleted

    This will allow the monitor to pick up new subscriptions as they are
    added and stop monitoring old ones when they are removed.

    These also need to be queued because the server which is handling the
    subscription change may not have the database_operations role active

    https://bugzilla.redhat.com/show_bug.cgi?id=1391095

 app/models/pglogical_subscription.rb | 15 +-
 lib/evm_database.rb | 9 +
 spec/models/pglogical_subscription_spec.rb | 36 +
 3 files changed, 56 insertions(+), 4 deletions(-)


https://github.com/ManageIQ/manageiq/commit/eb8d478af5a42f32f2afbac8c6f80bb2533abfff
commit eb8d478af5a42f32f2afbac8c6f80bb2533abfff
Author:     Nick Carboni <ncarboni>
AuthorDate: Fri Aug 24 18:10:22 2018 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Aug 24 18:10:22 2018 -0400

    Add specs for EvmDatabase.run_failover_monitor

    https://bugzilla.redhat.com/show_bug.cgi?id=1391095

 lib/evm_database.rb | 4 +-
 spec/lib/evm_database_spec.rb | 59 +
 2 files changed, 61 insertions(+), 2 deletions(-)


https://github.com/ManageIQ/manageiq/commit/48f39a04a17a36382fa947b65335cd642991a46e
commit 48f39a04a17a36382fa947b65335cd642991a46e
Author:     Nick Carboni <ncarboni>
AuthorDate: Wed Sep  5 10:20:13 2018 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Wed Sep  5 10:20:13 2018 -0400

    Only restart the failover service if it was running previously

    Before this change, users would manually start the failover
    monitor service through the appliance console.
    This will keep the user experience the same as it was by not
    starting the service if the user didn't start it previously.

    https://bugzilla.redhat.com/show_bug.cgi?id=1391095

 lib/evm_database.rb | 3 +-
 spec/lib/evm_database_spec.rb | 20 +
 2 files changed, 22 insertions(+), 1 deletion(-)

Comment 14 CFME Bot 2018-09-10 15:52:03 UTC

New commits detected on ManageIQ/manageiq-appliance/master:

https://github.com/ManageIQ/manageiq-appliance/commit/3fe8e1ab9cfede9afe59ec9b0ed303d6ff8ac727
commit 3fe8e1ab9cfede9afe59ec9b0ed303d6ff8ac727
Author:     Nick Carboni <ncarboni>
AuthorDate: Fri Aug 10 11:41:48 2018 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Aug 10 11:41:48 2018 -0400

    Remove manageiq-postgres_ha_admin from the appliance dependencies

    We use the gem from manageiq's core repo now, so we will move
    the dependency there.

    https://bugzilla.redhat.com/show_bug.cgi?id=1391095

 manageiq-appliance-dependencies.rb | 1 -
 1 file changed, 1 deletion(-)


https://github.com/ManageIQ/manageiq-appliance/commit/d2c52fbd4b9776cbb6eeaa0b42a339e21c9e2289
commit d2c52fbd4b9776cbb6eeaa0b42a339e21c9e2289
Author:     Nick Carboni <ncarboni>
AuthorDate: Fri Aug 10 11:51:00 2018 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Aug 10 11:51:00 2018 -0400

    Use the rails runner to start the failover monitor

    https://bugzilla.redhat.com/show_bug.cgi?id=1391095

 COPY/usr/lib/systemd/system/evm-failover-monitor.service | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 15 Tasos Papaioannou 2018-09-17 19:44:02 UTC

Verified on 5.10.0.15.

Comment 17 errata-xmlrpc 2019-02-07 23:02:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0212

Note You need to log in before you can comment on or make changes to this bug.

brant.evans
bthurber
cbolz
clasohm
cpelland
dajohnso
dsundqvi
ealcaniz
gfontana
gtanzill
jhardy
jocarter
kmorey
ltsai
mfeifer
molasaga
ncarboni
nstephan
obarenbo
simaishi
smallamp
vaclav.miller