Description of problem:If you try to reintroduce a failed primary node after a certain amount of time has passed it will fail. Version-Release number of selected component (if applicable):5.7.0.17 How reproducible:100% Steps to Reproduce: 1.setup HA following docs 2.Stop Primary postgres 3.wait for failover 4.try to reintroduce the failed node Actual results:Fails Expected results:reintroduced correctly Additional info: current docs to help with setup and reintroduction of nodes: https://doc-stage.usersys.redhat.com/documentation/en/red-hat-cloudforms/4.2/single/configuring-high-availability/
This is because the WAL log needed to catch the standby up after the rewind is no longer on the primary server. To fix this we can look into a better value for wal_keep_segments[1], but that will only buy us so much time. The complete solution to this is to set up WAL archiving[2] so that the segments that are no longer needed on the primary are still available if a standby needs them. I'm not sure this is in the scope of our feature as the reintroduction is still a fairly manual process anyway. As a work around customers will have to recreate the standby from scratch by removing the contents of the data directory and using the console to reconfigure the failed primary as a standby node. [1] https://www.postgresql.org/docs/9.5/static/runtime-config-replication.html#GUC-WAL-KEEP-SEGMENTS [2] https://www.postgresql.org/docs/9.5/static/continuous-archiving.html
This should have been fixed by https://github.com/ManageIQ/manageiq-gems-pending/pull/126
Verified in 5.8.2.0