Bug 1445841 - Reintroducing a standby node that has already be reintroduced causes failure
Summary: Reintroducing a standby node that has already be reintroduced causes failure
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.8.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.9.0
Assignee: Nick Carboni
QA Contact: luke couzens
URL:
Whiteboard: HA:black
Depends On:
Blocks: 1446304 1446305
TreeView+ depends on / blocked
 
Reported: 2017-04-26 15:42 UTC by luke couzens
Modified: 2018-03-06 15:38 UTC (History)
7 users (show)

Fixed In Version: 5.9.0.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1446304 1446305 (view as bug list)
Environment:
Last Closed: 2018-03-06 15:38:10 UTC
Category: ---
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description luke couzens 2017-04-26 15:42:17 UTC
Description of problem:Reintroducing a standby node that has already be reintroduced causes failure 


Version-Release number of selected component (if applicable):5.8.0.12


How reproducible:100%


Steps to Reproduce:
1.setup appliances in HA configuration
2.connect to standby node and try to re-run 'Configure Database Replication' and setup as standby node with same cluster ID

Actual results:
Appliance database found under: /var/opt/rh/rh-postgresql95/lib/pgsql/data
Replication standby server can not be configured if the database already exists
Would you like to remove the existing database before configuring as a standby
server?
  WARNING: This is destructive. This will remove all previous data from this
server
Continue? (Y/N): Y
No partition found for Standby database disk. You probably want to add an
unpartitioned disk and try again.
Are you sure you don't want to partition the Standby database disk? (Y/N): y
Enter the number uniquely identifying this node in the replication cluster: 1
Enter the cluster database name: |vmdb_production| 
Enter the cluster database username: |root| 
Enter the cluster database password: *******
Enter the cluster database password: *******
Enter the primary database hostname or IP address: *******
Enter the Standby Server hostname or IP address: |*********| *********
Configure Replication Manager (repmgrd) for automatic failover? (Y/N): y
An active standby node (**********) with the node number 1 already exists
Would you like to continue configuration by overwriting the existing node?
(Y/N): |N| y
Warning: File /etc/repmgr.conf exists. Replication is already configured
Continue with configuration? (Y/N): y

Replication Server Configuration

        Cluster Node Number:        1
        Cluster Database Name:      vmdb_production
        Cluster Database User:      root
        Cluster Database Password:  "********"
        Cluster Primary Host:       ********
        Standby Host:               ********
        Automatic Failover:         enabled
Apply this Replication Server Configuration? (Y/N): y
Configuring Replication Standby Server...
[2017-04-26 10:45:34] [NOTICE] Redirecting logging output to
'/var/log/repmgr/repmgrd.log'
Failed to configure replication server
/opt/rh/cfme-gemset/gems/awesome_spawn-1.4.1/lib/awesome_spawn.rb:105:in `run!': repmgr exit code: 7 (AwesomeSpawn::CommandResultError)
	from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console/database_replication.rb:122:in `block in run_repmgr_command'
	from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console/database_replication.rb:119:in `fork'
	from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console/database_replication.rb:119:in `run_repmgr_command'
	from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console/database_replication_standby.rb:97:in `clone_standby_server'
	from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console/database_replication_standby.rb:69:in `activate'
	from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console.rb:555:in `block in <module:ApplianceConsole>'
	from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console.rb:108:in `loop'
	from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console.rb:108:in `<module:ApplianceConsole>'
	from /opt/rh/cfme-gemset/bundler/gems/manageiq-gems-pending-12fcafd08821/lib/gems/pending/appliance_console.rb:99:in `<top (required)>'
	from /usr/bin/appliance_console:10:in `require'
	from /usr/bin/appliance_console:10:in `<main>'
Database Replication not configured

Press any key to continue.

Expected results: Node is reconfigured as standby


Additional info:
logs
[2017-04-26 11:18:37] [ERROR] unable to reconnect to master (timeout 60 seconds)...
[2017-04-26 11:19:30] [NOTICE] destination directory '/var/opt/rh/rh-postgresql95/lib/pgsql/data' provided
[2017-04-26 11:19:30] [NOTICE] starting backup (using pg_basebackup)...
[2017-04-26 11:19:30] [HINT] this may take some time; consider using the -c/--fast-checkpoint option
[2017-04-26 11:19:39] [NOTICE] standby clone (using pg_basebackup) complete
[2017-04-26 11:19:39] [NOTICE] you can now start your PostgreSQL server
[2017-04-26 11:19:39] [HINT] for example : pg_ctl -D /var/opt/rh/rh-postgresql95/lib/pgsql/data start
[2017-04-26 11:19:39] [HINT] After starting the server, you need to register this standby with "repmgr standby register"
[2017-04-26 11:19:43] [NOTICE] standby node correctly registered for cluster miq_region_1_cluster with id 1 (conninfo: host=****** user=root dbname=vmdb_production)
[2017-04-26 11:19:43] [WARNING] Unable to create event record: ERROR:  cannot execute INSERT in a read-only transaction

[2017-04-26 11:21:39] [NOTICE] destination directory '/var/opt/rh/rh-postgresql95/lib/pgsql/data' provided
[2017-04-26 11:21:39] [ERROR] Slot 'repmgr_slot_1' already exists as an active slot
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
[2017-04-26 11:21:41] [WARNING] Can't stop current query: PQcancel() -- connect() failed: Connection refused

[2017-04-26 11:21:41] [WARNING] connection to master has been lost, trying to recover... 60 seconds before failover decision
[2017-04-26 11:21:51] [WARNING] connection to master has been lost, trying to recover... 50 seconds before failover decision
[2017-04-26 11:22:01] [WARNING] connection to master has been lost, trying to recover... 40 seconds before failover decision
[2017-04-26 11:22:11] [WARNING] connection to master has been lost, trying to recover... 30 seconds before failover decision
[2017-04-26 11:22:21] [WARNING] connection to master has been lost, trying to recover... 20 seconds before failover decision
[2017-04-26 11:22:31] [WARNING] connection to master has been lost, trying to recover... 10 seconds before failover decision
[2017-04-26 11:22:41] [ERROR] unable to reconnect to master (timeout 60 seconds)...

Comment 2 Nick Carboni 2017-04-26 16:04:41 UTC
This is failing because the replication slot on the primary is still being used by the PG instance that we are trying to overwrite.

Specifically repmgr is failing in this block (https://github.com/2ndQuadrant/repmgr/blob/3802b917e030d239b04d29a4885681d6df63ddb7/dbutils.c#L954-L982)

A fix for this is to ensure that postgres and repmgrd are stopped locally before trying to reconfigure the standby server.

Without this change you are left with a replication slot on the primary DB server and no database at all on the standby (empty data directory) after the configuration fails. This is pretty much the worst case scenario as the disk will fill with WAL on the primary because of the unused slot and you have no standby.

On the plus side if you take the same steps again the configuration will succeed because the replication slot is no longer marked as active because the standby database is clearly no longer running (we deleted the contents of the data directory), but I'll go ahead an request blocker for this anyway ...

Comment 3 Nick Carboni 2017-04-26 18:02:09 UTC
Changing the severity because of the bad situation this error puts you in as described in comment 2:

>you are left with a replication slot on the primary DB server and no database at all on the standby (empty data directory) after the configuration fails. This is pretty much the worst case scenario as the disk will fill with WAL on the primary because of the unused slot and you have no standby.

Comment 5 CFME Bot 2017-04-27 14:57:51 UTC
New commit detected on ManageIQ/manageiq-gems-pending/master:
https://github.com/ManageIQ/manageiq-gems-pending/commit/be284fee6c0e861889556378f91855f954c44c4c

commit be284fee6c0e861889556378f91855f954c44c4c
Author:     Nick Carboni <ncarboni>
AuthorDate: Wed Apr 26 13:55:01 2017 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Wed Apr 26 13:55:01 2017 -0400

    Stop the postgres and repmgr services before configuring a standby
    
    This ensures that we will not error out in the "reconfigure" case.
    
    In this case the standby server is configured successfully and we
    go through the process of configuring a standby server again. Before
    this change we would fail because the running PG server would be
    using the replication slot that we wanted to replace.
    
    This error would leave the primary server with a replication slot
    and would remove the data from the standby server.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1445841

 .../database_replication_standby.rb                | 12 ++++++++++++
 .../database_replication_standby_spec.rb           | 22 ++++++++++++++++++++++
 2 files changed, 34 insertions(+)

Comment 8 luke couzens 2017-10-12 12:03:38 UTC
Verified in 5.9.0.2


Note You need to log in before you can comment on or make changes to this bug.