Description of problem: When replication from HA region is configured using the first primary DB after a fail-over event, replication stops working since it still points to the old DB which maybe offline or no longer the active DB Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Setup region w/ HA 2. Setup replication subscription to above HA region 3. Cause failover in the above HA region 4. After failover is complete, create new zone in the above HA region 5. Check for new HA zone in global region Actual results: replication still points to failed DB, so new data is not replicated Expected results: Replication to be HA aware and automatically update when in failover mode Additional info:
Investigating solutions to this using Tracker stories: https://www.pivotaltracker.com/story/show/127384493 https://www.pivotaltracker.com/story/show/135369671
For this we are investigating allowing the user to set a virtual IP address that will be taken over by the server which is currently the primary. Then users will configure the servers in the local region as well as the replication subscription against the virtual IP rather than having to worry about changing the IP the "clients" are looking at (i.e. the Failover Monitor goes away). To do this we need a way for users to set the virtual IP at database configuration time which will also require a new way of storing data (as the schema will not be initialized before at least one app server is brought up). The pitch for this is to create a new table in the database which will be a simple key-value store. This table will store the VIP as well as the replication slot used by pglogical. This table will not be managed by activerecord migrations and will be created and populated using a newly developed mechanism. We will then set the virtual_ip key to the VIP value when a user sets it up and set the replication_slot key to the name of the replication slot created for a subscription. Then when a failover occurs the new primary will pull these two values, promote itself, recreate the replication slot, then enable the VIP. This will ensure that no clients can access the database server before it is ready to replicate any data changes, minimizing the chance for data-loss in the global region. pglogical in the global region will also need to be configured to retry failed connections. This can either be accomplished by creating a monitoring thread to detect when subscriptions go down or by tweaking the tcp connection parameters pglogical uses to connect to the remote region database (see https://www.postgresql.org/docs/9.5/static/runtime-config-connection.html#GUC-TCP-KEEPALIVES-COUNT + https://github.com/2ndQuadrant/pglogical/commit/362035ef55edaadc0c4ee748061b78d63528131c) The last piece of this effort will be creating a service that makes sure the IP is properly assigned (or not assigned) to a particular database server. - One part of this is proper fencing. The utility we are planning to use for the VIP sends an ARP broadcast, so in my testing the clients will connect to the new primary, but we don't really want to rely on that. Fencing is being tracked separately in https://www.pivotaltracker.com/story/show/130841015, but that may be required for this to work. - The other part is determining if the server should assign itself the VIP on reboot. We can do two checks to determine this. 1. Determine if the local postgres server is configured to run as a primary 2. If so, query the other servers in the repmgr cluster, if none of the other servers are configured as a primary and the repmgr repl_nodes table looks okay (the other nodes should be pointing to us as the upstream), then take the VIP The plan is to develop this as a separate script that will run once at startup (probably as a systemd service).
For re-adding a node back into the cluster, we will want to ensure that the database is in sync so something like pgrewind will need to be initiated. May be a consideration as part of the proposed node re-entry script. A vote may be needed. I believe this is the intent of: https://www.pivotaltracker.com/story/show/130841015 Ack on the other items as this tends to match what traditional clustering relies on.
*** Bug 1546902 has been marked as a duplicate of this bug. ***
https://github.com/ManageIQ/manageiq/pull/17837
https://github.com/ManageIQ/manageiq-appliance/pull/202
New commits detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/f9c3f5e722588e22ce17a4710e5328abdff851cc commit f9c3f5e722588e22ce17a4710e5328abdff851cc Author: Nick Carboni <ncarboni> AuthorDate: Fri Aug 10 11:53:02 2018 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Fri Aug 10 11:53:02 2018 -0400 Add the manageiq-postgres_ha_admin gem to the Gemfile We need this here now that we are using the gem from inside the app https://bugzilla.redhat.com/show_bug.cgi?id=1391095 Gemfile | 1 + 1 file changed, 1 insertion(+) https://github.com/ManageIQ/manageiq/commit/78562b0719b0424852b42a7877ca58b39c2707ac commit 78562b0719b0424852b42a7877ca58b39c2707ac Author: Nick Carboni <ncarboni> AuthorDate: Fri Aug 10 11:51:54 2018 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Fri Aug 10 11:51:54 2018 -0400 Configure and start the failover monitor from EvmDatabase We need the PglogicalSubscription model to properly remove, add, and query for subscriptions. Making this all a method in the rails environment simplifies that process. https://bugzilla.redhat.com/show_bug.cgi?id=1391095 lib/evm_database.rb | 52 + 1 file changed, 52 insertions(+) https://github.com/ManageIQ/manageiq/commit/0b24d4252d6aef3f7dcf93705b336e297a317ed2 commit 0b24d4252d6aef3f7dcf93705b336e297a317ed2 Author: Nick Carboni <ncarboni> AuthorDate: Fri Aug 24 14:31:39 2018 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Fri Aug 24 14:31:39 2018 -0400 Only allow the database_operations server to monitor subscription failover Since the subscriptions in the global region are a region-wide object (rather than database.yml which exists in each server) only one server should be responsible for monitoring them. The database_operations role seemed a good candidate https://bugzilla.redhat.com/show_bug.cgi?id=1391095 lib/evm_database.rb | 2 + 1 file changed, 2 insertions(+) https://github.com/ManageIQ/manageiq/commit/0688eaf7fcb982bf71e12593baa515e4b0126425 commit 0688eaf7fcb982bf71e12593baa515e4b0126425 Author: Nick Carboni <ncarboni> AuthorDate: Fri Aug 24 14:31:14 2018 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Fri Aug 24 14:31:14 2018 -0400 Add .restart_failover_monitor_service to EvmDatabase class This will need to be done whenever the roles of a server change or when subscriptions are changed (and possibly in more places as well). It will be better to have it in one place. https://bugzilla.redhat.com/show_bug.cgi?id=1391095 lib/evm_database.rb | 4 + 1 file changed, 4 insertions(+) https://github.com/ManageIQ/manageiq/commit/f277e419d6b86ca6dff1c876d97c66f9c5d6dd1d commit f277e419d6b86ca6dff1c876d97c66f9c5d6dd1d Author: Nick Carboni <ncarboni> AuthorDate: Fri Aug 24 14:34:09 2018 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Fri Aug 24 14:34:09 2018 -0400 Create the MiqServer#role_changes method This will allow us to determine exactly how each role will be changed during the role sync process. We use this information to only restart the failover monitor when the database operations role is being either added or removed. The failover monitor startup process will then get a chance to decide whether or not it should be monitoring subscriptions as well as database.yml https://bugzilla.redhat.com/show_bug.cgi?id=1391095 app/models/miq_server/role_management.rb | 14 +- app/models/miq_server/worker_management/monitor.rb | 4 + 2 files changed, 14 insertions(+), 4 deletions(-) https://github.com/ManageIQ/manageiq/commit/9e5063c3e32873f15d8528e1420206accbeb0afe commit 9e5063c3e32873f15d8528e1420206accbeb0afe Author: Nick Carboni <ncarboni> AuthorDate: Fri Aug 24 14:35:35 2018 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Fri Aug 24 14:35:35 2018 -0400 Restart the failover monitor when subscriptions are added or deleted This will allow the monitor to pick up new subscriptions as they are added and stop monitoring old ones when they are removed. These also need to be queued because the server which is handling the subscription change may not have the database_operations role active https://bugzilla.redhat.com/show_bug.cgi?id=1391095 app/models/pglogical_subscription.rb | 15 +- lib/evm_database.rb | 9 + spec/models/pglogical_subscription_spec.rb | 36 + 3 files changed, 56 insertions(+), 4 deletions(-) https://github.com/ManageIQ/manageiq/commit/eb8d478af5a42f32f2afbac8c6f80bb2533abfff commit eb8d478af5a42f32f2afbac8c6f80bb2533abfff Author: Nick Carboni <ncarboni> AuthorDate: Fri Aug 24 18:10:22 2018 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Fri Aug 24 18:10:22 2018 -0400 Add specs for EvmDatabase.run_failover_monitor https://bugzilla.redhat.com/show_bug.cgi?id=1391095 lib/evm_database.rb | 4 +- spec/lib/evm_database_spec.rb | 59 + 2 files changed, 61 insertions(+), 2 deletions(-) https://github.com/ManageIQ/manageiq/commit/48f39a04a17a36382fa947b65335cd642991a46e commit 48f39a04a17a36382fa947b65335cd642991a46e Author: Nick Carboni <ncarboni> AuthorDate: Wed Sep 5 10:20:13 2018 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Wed Sep 5 10:20:13 2018 -0400 Only restart the failover service if it was running previously Before this change, users would manually start the failover monitor service through the appliance console. This will keep the user experience the same as it was by not starting the service if the user didn't start it previously. https://bugzilla.redhat.com/show_bug.cgi?id=1391095 lib/evm_database.rb | 3 +- spec/lib/evm_database_spec.rb | 20 + 2 files changed, 22 insertions(+), 1 deletion(-)
New commits detected on ManageIQ/manageiq-appliance/master: https://github.com/ManageIQ/manageiq-appliance/commit/3fe8e1ab9cfede9afe59ec9b0ed303d6ff8ac727 commit 3fe8e1ab9cfede9afe59ec9b0ed303d6ff8ac727 Author: Nick Carboni <ncarboni> AuthorDate: Fri Aug 10 11:41:48 2018 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Fri Aug 10 11:41:48 2018 -0400 Remove manageiq-postgres_ha_admin from the appliance dependencies We use the gem from manageiq's core repo now, so we will move the dependency there. https://bugzilla.redhat.com/show_bug.cgi?id=1391095 manageiq-appliance-dependencies.rb | 1 - 1 file changed, 1 deletion(-) https://github.com/ManageIQ/manageiq-appliance/commit/d2c52fbd4b9776cbb6eeaa0b42a339e21c9e2289 commit d2c52fbd4b9776cbb6eeaa0b42a339e21c9e2289 Author: Nick Carboni <ncarboni> AuthorDate: Fri Aug 10 11:51:00 2018 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Fri Aug 10 11:51:00 2018 -0400 Use the rails runner to start the failover monitor https://bugzilla.redhat.com/show_bug.cgi?id=1391095 COPY/usr/lib/systemd/system/evm-failover-monitor.service | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Verified on 5.10.0.15.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:0212