1426769 – After reintroducing a failed primary node, there are old replication slots left on the "new" node

Bug 1426769 - After reintroducing a failed primary node, there are old replication slots left on the "new" node

Summary: After reintroducing a failed primary node, there are old replication slots le...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.9.0
Assignee:	Nick Carboni
QA Contact:	luke couzens
Docs Contact:
URL:
Whiteboard:	HA
Depends On:
Blocks:	1445380 1445383
TreeView+	depends on / blocked

Reported:	2017-02-24 19:41 UTC by Nick Carboni
Modified:	2021-03-17 16:35 UTC (History)
CC List:	7 users (show)
Fixed In Version:	5.9.0.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1445380 1445383 (view as bug list)
Environment:
Last Closed:	2018-03-06 15:22:01 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Nick Carboni 2017-02-24 19:41:31 UTC

When a primary node in a HA cluster fails it retains the replication slot(s) that was being used by any standby servers.

When this node is reintroduced into the cluster and starts generating WAL, that slot causes the new WAL to be retained and will eventually cause disk space issues.

That slot should be dropped when reintroducing the node into the cluster.

---

The challenge here will be deciding if all the replication slots should be dropped or picking the particular ones that should be removed.
Generally I feel like this should be the responsibility of repmgr when running `repmgr standby follow` so maybe we can open an RFE on that project as well (if this is not already included in a new version).

Comment 2 CFME Bot 2017-04-19 22:07:56 UTC

https://github.com/ManageIQ/manageiq-gems-pending/pull/124

Comment 3 CFME Bot 2017-04-22 02:28:13 UTC

https://github.com/ManageIQ/manageiq-gems-pending/pull/126

Comment 4 CFME Bot 2017-04-24 20:34:18 UTC

New commit detected on ManageIQ/manageiq-gems-pending/master:
https://github.com/ManageIQ/manageiq-gems-pending/commit/3c90a73de7c8ec34364050de8ef677f72ac424d7

commit 3c90a73de7c8ec34364050de8ef677f72ac424d7
Author:     Nick Carboni <ncarboni>
AuthorDate: Tue Apr 18 16:22:16 2017 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Mon Apr 24 15:52:52 2017 -0400

    Alter PostgresAdmin.prep_data_directory to remove all contents
    
    This will allow us to use it for reinitializing a database server
    as a standby when it was previously a primary.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1426769

 lib/gems/pending/util/postgres_admin.rb | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Comment 5 CFME Bot 2017-04-24 20:34:28 UTC

New commit detected on ManageIQ/manageiq-gems-pending/master:
https://github.com/ManageIQ/manageiq-gems-pending/commit/63a179ea2b419007df07a0385989f8f20978ee8f

commit 63a179ea2b419007df07a0385989f8f20978ee8f
Author:     Nick Carboni <ncarboni>
AuthorDate: Wed Apr 19 17:43:17 2017 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Mon Apr 24 15:52:53 2017 -0400

    Offer to clear the data directory for new standby servers
    
    This will allow seamless reintegration of failed primary
    servers after a failover.
    
    When this happens the user will be given the option to clear
    the existing database and re-clone the new primary into this server
    and then continue to set up a standby as before.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1426718
    https://bugzilla.redhat.com/show_bug.cgi?id=1426769
    https://bugzilla.redhat.com/show_bug.cgi?id=1442911

 .../database_replication_standby.rb                |  20 +--
 .../database_replication_standby_spec.rb           | 143 ++++++++++++++-------
 2 files changed, 112 insertions(+), 51 deletions(-)

Comment 6 Nick Carboni 2017-04-25 12:32:19 UTC

This should be fixed because a failed master (which would have replication slots on it) now is completely wiped and re-initialized when it is re-introduced into the cluster as a standby when using the console "Configure Standby" option.

Comment 9 luke couzens 2017-10-12 12:44:19 UTC

Verified in 5.9.0.2

Comment 10 CFME Bot 2021-03-17 16:35:58 UTC

New commit detected on ManageIQ/manageiq-appliance_console/master:

https://github.com/ManageIQ/manageiq-appliance_console/commit/012aaefe755d8a0c7264381e6196a37166f4558d
commit 012aaefe755d8a0c7264381e6196a37166f4558d
Author:     Nick Carboni <ncarboni>
AuthorDate: Tue Apr 18 20:22:16 2017 +0000
Commit:     Nick LaMuro <nicklamuro>
CommitDate: Tue Mar 16 19:25:16 2021 +0000

    Alter PostgresAdmin.prep_data_directory to remove all contents

    This will allow us to use it for reinitializing a database server
    as a standby when it was previously a primary.

    https://bugzilla.redhat.com/show_bug.cgi?id=1426769


    (transferred from ManageIQ/manageiq-gems-pending@3c90a73de7c8ec34364050de8ef677f72ac424d7)
 lib/manageiq/appliance_console/postgres_admin.rb | 3 +-
 1 file changed, 1 insertion(+), 2 deletions(-)

Note You need to log in before you can comment on or make changes to this bug.