Bug 1419420

Summary:	How to delete standby db server in CFME4.2 HA env
Product:	Red Hat CloudForms Management Engine	Reporter:	tachoi
Component:	Replication	Assignee:	Nick Carboni <ncarboni>
Status:	CLOSED NOTABUG	QA Contact:	Dave Johnson <dajohnso>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	5.7.0	CC:	dshevrin, gtanzill, jhardy, ncarboni, obarenbo, tachoi
Target Milestone:	GA
Target Release:	cfme-future
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-02-14 02:01:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description tachoi 2017-02-06 06:13:26 UTC

Description of problem:
- Configured a HA CFME 4.2 database cluster per the documents.
- Rebuilt standby db server from the scatch in the cluster and attempted to re-add it to the cluster using the appliance_console ("Configure Database Replication") with the same unique node ID, and the appliance_console threw the following error:
ERROR: duplicate key value violates unique constraint "repl_nodes_pkey"
DETAIL: Key (id)=(1) already exists
- was able to re-add the database node with a new unique ID (3). This means my database config is as follows:
  *Node ID 1: Does not exist
  *Node ID 2: Primary
  *Node ID 3: Standby

Questions:
* Are there any issues associated with re-adding the standby node with a new key? 
* What happens to the node identified by Key 1?
* Can I clean-up the existing database cluster node IDs somehow?

- Env info
* 3 x CFME appliances configured for database failover monitor
* 2 x CFME dedicates-database nodes (1 primary, 1 standby)
* All appliances are CloudForms 4.2 (CFME 5.7.0.17)

Version-Release number of selected component (if applicable):
CFME 5.7.0.17)

How reproducible:
Add Standby DB server with node id :1 
Delete the standby DB server 
Create new standby DB server from CFME vm template
Trying to add new standby DB server with node id :1

Steps to Reproduce:
1.
2.
3.

Actual results:
ERROR: duplicate key value violates unique constraint "repl_nodes_pkey"

Expected results:
ERROR: duplicate key value violates unique constraint "repl_nodes_pkey"

Additional info:

Comment 5 Nick Carboni 2017-02-10 14:28:47 UTC

Taeho,

Just to recap.

- Created a two node cluster (node 2 as primary and node 1 as standby)
- Removed node 1 (node 2 is still primary)
- Deployed a *new* appliance and attempted to add as node 1

At this point everything behaved as I would expect as a failed node stays in the repl_nodes table for record keeping purposes. A failed node in this table does not affect quorum calculations and is ignored for failover purposes. The thing that is more important is that the replication slot is removed. If it is not removed WAL log will accumulate on the primary.

Your steps were correct for removing the node and the replication slot assuming the standby node is truly *gone*.

If the standby node is still accessible you can run the `rempmgr standby unregister -f /etc/repmgr.conf` command on the standby to cleanly remove it from the repmgr configuration, but you will still have to remove the replication slot manually. It is also important to note that "standby unregister" will not stop the standby node so replication will continue (i.e. the slot will stay active).

When you added a new node with the same id manually the failover didn't work because repmgrd (the service that coordinates automatic failover) would not have been running on the new node. You can check the status of this service using the command `systemctl status rh-postgresql95-repmgr`

It sounds like we have a handful of RFEs out of this, but nothing I would consider "broken".

- Create an option for removing a standby cleanly
- This option will take care of removing all the data from the repmgr tables as well as handling the replication slot.
- This option will be usable only from the primary database and the user will provide the connection information to the desired standby. (Further improvements can be made to parse the conninfo for the primary or other standby nodes from the repl_nodes table).

- Create a separate option for controlling the repmgrd service
- If the service is not started when creating the standby, none of the configuration files necessary to start the service are present so it is not trivial to get it up and running.

Given all of this I would like to create two separate RFEs for these issues and close this issue if your question is answered Taeho (if not, go ahead and follow up in this BZ, I won't close it yet :) ). That way they will be easier to track as we work on them.