1426718 – [RFE] Make the process of reintroducing a failed HA node more user-friendly

Bug 1426718 - [RFE] Make the process of reintroducing a failed HA node more user-friendly

Summary: [RFE] Make the process of reintroducing a failed HA node more user-friendly

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	GA
Target Release:	5.9.0
Assignee:	John Hardy
QA Contact:	luke couzens
Docs Contact:
URL:
Whiteboard:	HA:black
Depends On:
Blocks:	1445379 1450511
TreeView+	depends on / blocked

Reported:	2017-02-24 16:34 UTC by Nick Carboni
Modified:	2018-03-06 15:16 UTC (History)
CC List:	6 users (show)
Fixed In Version:	5.9.0.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1445379 1450511 (view as bug list)
Environment:
Last Closed:	2018-03-06 15:16:30 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Nick Carboni 2017-02-24 16:34:02 UTC

Description of problem:
Right now, reintroducing a failed primary database node in an HA architecture is a painfully manual process that is prone to issues and doesn't always work properly if WAL archiving isn't configured (https://bugzilla.redhat.com/show_bug.cgi?id=1406815)

Version-Release number of selected component (if applicable):
5.7.0.17

It would be good if we offered a separate console option (or enhance the existing standby setup one) that would recreate the database on an appliance with a new base backup from the primary.

This would amount to removing the contents of the data directory and running through the same steps to configure a standby node.

This will need a big warning to say that it is a destructive operation and all the data currently stored in the local database will be lost.

After seeing the issues with pg_rewind I would rather see this as the "right" way to reintroduce a node.

Comment 3 CFME Bot 2017-04-19 22:08:03 UTC

https://github.com/ManageIQ/manageiq-gems-pending/pull/124

Comment 4 CFME Bot 2017-04-22 05:14:15 UTC

https://github.com/ManageIQ/manageiq-gems-pending/pull/126

Comment 5 CFME Bot 2017-04-24 20:34:24 UTC

New commit detected on ManageIQ/manageiq-gems-pending/master:
https://github.com/ManageIQ/manageiq-gems-pending/commit/63a179ea2b419007df07a0385989f8f20978ee8f

commit 63a179ea2b419007df07a0385989f8f20978ee8f
Author:     Nick Carboni <ncarboni>
AuthorDate: Wed Apr 19 17:43:17 2017 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Mon Apr 24 15:52:53 2017 -0400

    Offer to clear the data directory for new standby servers
    
    This will allow seamless reintegration of failed primary
    servers after a failover.
    
    When this happens the user will be given the option to clear
    the existing database and re-clone the new primary into this server
    and then continue to set up a standby as before.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1426718
    https://bugzilla.redhat.com/show_bug.cgi?id=1426769
    https://bugzilla.redhat.com/show_bug.cgi?id=1442911

 .../database_replication_standby.rb                |  20 +--
 .../database_replication_standby_spec.rb           | 143 ++++++++++++++-------
 2 files changed, 112 insertions(+), 51 deletions(-)

Comment 6 Nick Carboni 2017-04-25 12:30:45 UTC

Changed the console option for standby setup to also allow re-initializing failed primary servers.

This is much simpler and less error-prone than using pg_rewind and other manual cli commands.

Comment 9 luke couzens 2017-10-12 12:03:58 UTC

Verified in 5.9.0.2

Note You need to log in before you can comment on or make changes to this bug.