Description of problem: - Configured a HA CFME 4.2 database cluster per the documents. - Rebuilt standby db server from the scatch in the cluster and attempted to re-add it to the cluster using the appliance_console ("Configure Database Replication") with the same unique node ID, and the appliance_console threw the following error: ERROR: duplicate key value violates unique constraint "repl_nodes_pkey" DETAIL: Key (id)=(1) already exists - was able to re-add the database node with a new unique ID (3). This means my database config is as follows: *Node ID 1: Does not exist *Node ID 2: Primary *Node ID 3: Standby Questions: * Are there any issues associated with re-adding the standby node with a new key? * What happens to the node identified by Key 1? * Can I clean-up the existing database cluster node IDs somehow? - Env info * 3 x CFME appliances configured for database failover monitor * 2 x CFME dedicates-database nodes (1 primary, 1 standby) * All appliances are CloudForms 4.2 (CFME 5.7.0.17) Version-Release number of selected component (if applicable): CFME 5.7.0.17) How reproducible: Add Standby DB server with node id :1 Delete the standby DB server Create new standby DB server from CFME vm template Trying to add new standby DB server with node id :1 Steps to Reproduce: 1. 2. 3. Actual results: ERROR: duplicate key value violates unique constraint "repl_nodes_pkey" Expected results: ERROR: duplicate key value violates unique constraint "repl_nodes_pkey" Additional info:
Taeho, Just to recap. - Created a two node cluster (node 2 as primary and node 1 as standby) - Removed node 1 (node 2 is still primary) - Deployed a *new* appliance and attempted to add as node 1 At this point everything behaved as I would expect as a failed node stays in the repl_nodes table for record keeping purposes. A failed node in this table does not affect quorum calculations and is ignored for failover purposes. The thing that is more important is that the replication slot is removed. If it is not removed WAL log will accumulate on the primary. Your steps were correct for removing the node and the replication slot assuming the standby node is truly *gone*. If the standby node is still accessible you can run the `rempmgr standby unregister -f /etc/repmgr.conf` command on the standby to cleanly remove it from the repmgr configuration, but you will still have to remove the replication slot manually. It is also important to note that "standby unregister" will not stop the standby node so replication will continue (i.e. the slot will stay active). When you added a new node with the same id manually the failover didn't work because repmgrd (the service that coordinates automatic failover) would not have been running on the new node. You can check the status of this service using the command `systemctl status rh-postgresql95-repmgr` It sounds like we have a handful of RFEs out of this, but nothing I would consider "broken". - Create an option for removing a standby cleanly - This option will take care of removing all the data from the repmgr tables as well as handling the replication slot. - This option will be usable only from the primary database and the user will provide the connection information to the desired standby. (Further improvements can be made to parse the conninfo for the primary or other standby nodes from the repl_nodes table). - Create a separate option for controlling the repmgrd service - If the service is not started when creating the standby, none of the configuration files necessary to start the service are present so it is not trivial to get it up and running. Given all of this I would like to create two separate RFEs for these issues and close this issue if your question is answered Taeho (if not, go ahead and follow up in this BZ, I won't close it yet :) ). That way they will be easier to track as we work on them.