Description of problem:Reintroducing a standby node that has already be reintroduced causes failure Version-Release number of selected component (if applicable):5.8.0.12 How reproducible:100% Steps to Reproduce: 1.setup appliances in HA configuration 2.connect to standby node and try to re-run 'Configure Database Replication' and setup as standby node with same cluster ID Actual results: Appliance database found under: /var/opt/rh/rh-postgresql95/lib/pgsql/data Replication standby server can not be configured if the database already exists Would you like to remove the existing database before configuring as a standby server? WARNING: This is destructive. This will remove all previous data from this server Continue? (Y/N): Y No partition found for Standby database disk. You probably want to add an unpartitioned disk and try again. Are you sure you don't want to partition the Standby database disk? (Y/N): y Enter the number uniquely identifying this node in the replication cluster: 1 Enter the cluster database name: |vmdb_production| Enter the cluster database username: |root| Enter the cluster database password: ******* Enter the cluster database password: ******* Enter the primary database hostname or IP address: ******* Enter the Standby Server hostname or IP address: |*********| ********* Configure Replication Manager (repmgrd) for automatic failover? (Y/N): y An active standby node (**********) with the node number 1 already exists Would you like to continue configuration by overwriting the existing node? (Y/N): |N| y Warning: File /etc/repmgr.conf exists. Replication is already configured Continue with configuration? (Y/N): y Replication Server Configuration Cluster Node Number: 1 Cluster Database Name: vmdb_production Cluster Database User: root Cluster Database Password: "********" Cluster Primary Host: ******** Standby Host: ******** Automatic Failover: enabled Apply this Replication Server Configuration? (Y/N): y Configuring Replication Standby Server... [2017-04-26 10:45:34] [NOTICE] Redirecting logging output to '/var/log/repmgr/repmgrd.log' Failed to configure replication server /opt/rh/cfme-gemset/gems/awesome_spawn-1.4.1/lib/awesome_spawn.rb:105:in `run!': repmgr exit code: 7 (AwesomeSpawn::CommandResultError) from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console/database_replication.rb:122:in `block in run_repmgr_command' from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console/database_replication.rb:119:in `fork' from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console/database_replication.rb:119:in `run_repmgr_command' from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console/database_replication_standby.rb:97:in `clone_standby_server' from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console/database_replication_standby.rb:69:in `activate' from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console.rb:555:in `block in <module:ApplianceConsole>' from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console.rb:108:in `loop' from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console.rb:108:in `<module:ApplianceConsole>' from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console.rb:99:in `<top (required)>' from /usr/bin/appliance_console:10:in `require' from /usr/bin/appliance_console:10:in `<main>' Database Replication not configured Press any key to continue. Expected results: Node is reconfigured as standby Additional info: logs [2017-04-26 11:18:37] [ERROR] unable to reconnect to master (timeout 60 seconds)... [2017-04-26 11:19:30] [NOTICE] destination directory '/var/opt/rh/rh-postgresql95/lib/pgsql/data' provided [2017-04-26 11:19:30] [NOTICE] starting backup (using pg_basebackup)... [2017-04-26 11:19:30] [HINT] this may take some time; consider using the -c/--fast-checkpoint option [2017-04-26 11:19:39] [NOTICE] standby clone (using pg_basebackup) complete [2017-04-26 11:19:39] [NOTICE] you can now start your PostgreSQL server [2017-04-26 11:19:39] [HINT] for example : pg_ctl -D /var/opt/rh/rh-postgresql95/lib/pgsql/data start [2017-04-26 11:19:39] [HINT] After starting the server, you need to register this standby with "repmgr standby register" [2017-04-26 11:19:43] [NOTICE] standby node correctly registered for cluster miq_region_1_cluster with id 1 (conninfo: host=****** user=root dbname=vmdb_production) [2017-04-26 11:19:43] [WARNING] Unable to create event record: ERROR: cannot execute INSERT in a read-only transaction [2017-04-26 11:21:39] [NOTICE] destination directory '/var/opt/rh/rh-postgresql95/lib/pgsql/data' provided [2017-04-26 11:21:39] [ERROR] Slot 'repmgr_slot_1' already exists as an active slot WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. [2017-04-26 11:21:41] [WARNING] Can't stop current query: PQcancel() -- connect() failed: Connection refused [2017-04-26 11:21:41] [WARNING] connection to master has been lost, trying to recover... 60 seconds before failover decision [2017-04-26 11:21:51] [WARNING] connection to master has been lost, trying to recover... 50 seconds before failover decision [2017-04-26 11:22:01] [WARNING] connection to master has been lost, trying to recover... 40 seconds before failover decision [2017-04-26 11:22:11] [WARNING] connection to master has been lost, trying to recover... 30 seconds before failover decision [2017-04-26 11:22:21] [WARNING] connection to master has been lost, trying to recover... 20 seconds before failover decision [2017-04-26 11:22:31] [WARNING] connection to master has been lost, trying to recover... 10 seconds before failover decision [2017-04-26 11:22:41] [ERROR] unable to reconnect to master (timeout 60 seconds)...
This is failing because the replication slot on the primary is still being used by the PG instance that we are trying to overwrite. Specifically repmgr is failing in this block (https://github.com/2ndQuadrant/repmgr/blob/3802b917e030d239b04d29a4885681d6df63ddb7/dbutils.c#L954-L982) A fix for this is to ensure that postgres and repmgrd are stopped locally before trying to reconfigure the standby server. Without this change you are left with a replication slot on the primary DB server and no database at all on the standby (empty data directory) after the configuration fails. This is pretty much the worst case scenario as the disk will fill with WAL on the primary because of the unused slot and you have no standby. On the plus side if you take the same steps again the configuration will succeed because the replication slot is no longer marked as active because the standby database is clearly no longer running (we deleted the contents of the data directory), but I'll go ahead an request blocker for this anyway ...
Changing the severity because of the bad situation this error puts you in as described in comment 2: >you are left with a replication slot on the primary DB server and no database at all on the standby (empty data directory) after the configuration fails. This is pretty much the worst case scenario as the disk will fill with WAL on the primary because of the unused slot and you have no standby.
https://github.com/ManageIQ/manageiq-gems-pending/pull/142
New commit detected on ManageIQ/manageiq-gems-pending/master: https://github.com/ManageIQ/manageiq-gems-pending/commit/be284fee6c0e861889556378f91855f954c44c4c commit be284fee6c0e861889556378f91855f954c44c4c Author: Nick Carboni <ncarboni> AuthorDate: Wed Apr 26 13:55:01 2017 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Wed Apr 26 13:55:01 2017 -0400 Stop the postgres and repmgr services before configuring a standby This ensures that we will not error out in the "reconfigure" case. In this case the standby server is configured successfully and we go through the process of configuring a standby server again. Before this change we would fail because the running PG server would be using the replication slot that we wanted to replace. This error would leave the primary server with a replication slot and would remove the data from the standby server. https://bugzilla.redhat.com/show_bug.cgi?id=1445841 .../database_replication_standby.rb | 12 ++++++++++++ .../database_replication_standby_spec.rb | 22 ++++++++++++++++++++++ 2 files changed, 34 insertions(+)
Verified in 5.9.0.2