Bug 1419420
Summary: | How to delete standby db server in CFME4.2 HA env | ||
---|---|---|---|
Product: | Red Hat CloudForms Management Engine | Reporter: | tachoi |
Component: | Replication | Assignee: | Nick Carboni <ncarboni> |
Status: | CLOSED NOTABUG | QA Contact: | Dave Johnson <dajohnso> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.7.0 | CC: | dshevrin, gtanzill, jhardy, ncarboni, obarenbo, tachoi |
Target Milestone: | GA | ||
Target Release: | cfme-future | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-02-14 02:01:01 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
tachoi
2017-02-06 06:13:26 UTC
Taeho, Just to recap. - Created a two node cluster (node 2 as primary and node 1 as standby) - Removed node 1 (node 2 is still primary) - Deployed a *new* appliance and attempted to add as node 1 At this point everything behaved as I would expect as a failed node stays in the repl_nodes table for record keeping purposes. A failed node in this table does not affect quorum calculations and is ignored for failover purposes. The thing that is more important is that the replication slot is removed. If it is not removed WAL log will accumulate on the primary. Your steps were correct for removing the node and the replication slot assuming the standby node is truly *gone*. If the standby node is still accessible you can run the `rempmgr standby unregister -f /etc/repmgr.conf` command on the standby to cleanly remove it from the repmgr configuration, but you will still have to remove the replication slot manually. It is also important to note that "standby unregister" will not stop the standby node so replication will continue (i.e. the slot will stay active). When you added a new node with the same id manually the failover didn't work because repmgrd (the service that coordinates automatic failover) would not have been running on the new node. You can check the status of this service using the command `systemctl status rh-postgresql95-repmgr` It sounds like we have a handful of RFEs out of this, but nothing I would consider "broken". - Create an option for removing a standby cleanly - This option will take care of removing all the data from the repmgr tables as well as handling the replication slot. - This option will be usable only from the primary database and the user will provide the connection information to the desired standby. (Further improvements can be made to parse the conninfo for the primary or other standby nodes from the repl_nodes table). - Create a separate option for controlling the repmgrd service - If the service is not started when creating the standby, none of the configuration files necessary to start the service are present so it is not trivial to get it up and running. Given all of this I would like to create two separate RFEs for these issues and close this issue if your question is answered Taeho (if not, go ahead and follow up in this BZ, I won't close it yet :) ). That way they will be easier to track as we work on them. |