1406815 – HA, Reintroducing the primary Failed Node fails

Bug 1406815 - HA, Reintroducing the primary Failed Node fails

Summary: HA, Reintroducing the primary Failed Node fails

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	GA
Target Release:	cfme-future
Assignee:	Nick Carboni
QA Contact:	luke couzens
Docs Contact:
URL:
Whiteboard:	HA
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-12-21 14:38 UTC by luke couzens
Modified:	2019-01-24 14:31 UTC (History)
CC List:	6 users (show)
Fixed In Version:	5.8.0.12
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-24 14:31:22 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description luke couzens 2016-12-21 14:38:19 UTC

Description of problem:If you try to reintroduce a failed primary node after a certain amount of time has passed it will fail.


Version-Release number of selected component (if applicable):5.7.0.17


How reproducible:100%


Steps to Reproduce:
1.setup HA following docs
2.Stop Primary postgres
3.wait for failover
4.try to reintroduce the failed node

Actual results:Fails


Expected results:reintroduced correctly


Additional info:

current docs to help with setup and reintroduction of nodes:
https://doc-stage.usersys.redhat.com/documentation/en/red-hat-cloudforms/4.2/single/configuring-high-availability/

Comment 2 Nick Carboni 2016-12-21 14:42:53 UTC

This is because the WAL log needed to catch the standby up after the rewind is no longer on the primary server.

To fix this we can look into a better value for wal_keep_segments[1], but that will only buy us so much time.

The complete solution to this is to set up WAL archiving[2] so that the segments that are no longer needed on the primary are still available if a standby needs them.

I'm not sure this is in the scope of our feature as the reintroduction is still a fairly manual process anyway.

As a work around customers will have to recreate the standby from scratch by removing the contents of the data directory and using the console to reconfigure the failed primary as a standby node.

[1] https://www.postgresql.org/docs/9.5/static/runtime-config-replication.html#GUC-WAL-KEEP-SEGMENTS
[2] https://www.postgresql.org/docs/9.5/static/continuous-archiving.html

Comment 3 Nick Carboni 2017-09-15 18:22:16 UTC

This should have been fixed by https://github.com/ManageIQ/manageiq-gems-pending/pull/126

Comment 4 luke couzens 2017-09-18 09:26:58 UTC

Verified in 5.8.2.0

Note You need to log in before you can comment on or make changes to this bug.