1393049 – Backup/restore fails to run on an appliances in pglogical replication

Bug 1393049 - Backup/restore fails to run on an appliances in pglogical replication

Summary: Backup/restore fails to run on an appliances in pglogical replication

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.8.0
Assignee:	Nick Carboni
QA Contact:	luke couzens
Docs Contact:
URL:
Whiteboard:	black:replication
Depends On:
Blocks:	1399673
TreeView+	depends on / blocked

Reported:	2016-11-08 17:57 UTC by luke couzens
Modified:	2021-03-17 16:35 UTC (History)
CC List:	8 users (show)
Fixed In Version:	5.8.0.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1399673 (view as bug list)
Environment:
Last Closed:	2017-06-12 17:25:33 UTC
Category:	---
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description luke couzens 2016-11-08 17:57:12 UTC

Description of problem:Restoring back fails on appliances setup in replication using pglogical due to replication slots always accessing the db while trying to restore.


Version-Release number of selected component (if applicable):5.7.0.9 and 5.6.2.2


How reproducible:100%


Steps to Reproduce:
1.provision 2 appliances
2.configure internal database region99
3.configure second, fetching v2_key from first and setting it as region1
4.wait for evm
5.connect to webui or rr1
6.navigate to settings-configuration-settings-region-replication
7.select type 'Remote' hit save
8.connect to webui or rr99
9.navigate to settings-configuration-settings-region-replication
10.select type 'Global'
11.add subscription filling in details of rr1
12.click verify/save
13.make a backup of both appliances
14.stop evmserverd
15.restore backup to appliances


Actual results:Restore fails using appliance_console due to there being connections to the database 

Expected results:restore completes successfully


Additional info:

dropdb: database removal failed: ERROR:  database "vmdb_production" is used by a logical replication slot
DETAIL:  There is 1 slot, 1 of them active.

You can run pg_restore without dropping the db but you get 1404 errors
also evm will not start back up afterwards.

Comment 3 Nick Carboni 2016-11-11 21:59:26 UTC

Luke,

Can you list out exactly how you went about creating the backup and doing the restore?

For example did you use pg_dump manually or did you use the rake task? Same with the restore, did you use the console?

Also, it looks like some of this stuff might log to the appliance_console.log can you check for any additional output there?

Thanks

Comment 5 Nick Carboni 2016-11-14 21:42:44 UTC

So you can finish up the restore if you drop the subscription on the global region. That is the extra connection that is preventing the restore from moving forward.

We can make that a check we do during the restore process (drop the subscription or warn about it).

Another issue we will run into is that replication won't survive the backup/restore process. This is because pg_dump doesn't handle replication slots which is what pglogical relies on to work.

We would need to change the format of the backup file to include cfme specific metadata about replication slots.

That would be similar to the changes made to the upgrade script for this issue https://bugzilla.redhat.com/show_bug.cgi?id=1391693

Comment 6 Nick Carboni 2016-11-18 22:19:35 UTC

This PR will allow the restore to go through as long as we are either restoring to a global database or there are no subscriptions to the remote database.

We get around the connection problem by dropping subscriptions if they exist in global databases before executing the restore.

To restore a remote database, the subscription will have to be dropped manually before attempting the restore.

https://github.com/ManageIQ/manageiq-gems-pending/pull/23

Unfortunately, this does not solve the issue of replication not working after the restore, but hopefully this we be enough for this blocker bug as fixing replication for the restore will be a much larger effort.

Comment 7 CFME Bot 2016-11-28 15:54:19 UTC

New commit detected on ManageIQ/manageiq-gems-pending/master:
https://github.com/ManageIQ/manageiq-gems-pending/commit/aca175864e5962fa804552840873d7918a2df2d8

commit aca175864e5962fa804552840873d7918a2df2d8
Author:     Nick Carboni <ncarboni>
AuthorDate: Fri Nov 18 17:10:30 2016 -0500
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Nov 18 17:10:30 2016 -0500

    Don't test for connections before the restore
    
    The first part of the restore is to drop the database. This will fail
    if there are connections to the database.
    
    Failing before we call restore will also prevent us from fixing
    the situation by removing any pglogical subscriptions that may
    be present before dropping the database.
    
    Instead we can print an error after we fail to inform the user
    that the failure was because there were connections to the database.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1393049

 lib/gems/pending/appliance_console.rb | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comment 8 CFME Bot 2016-11-28 15:54:24 UTC

New commit detected on ManageIQ/manageiq-gems-pending/master:
https://github.com/ManageIQ/manageiq-gems-pending/commit/d081653ba85a0752f102a7a9c18b5e85be803ea4

commit d081653ba85a0752f102a7a9c18b5e85be803ea4
Author:     Nick Carboni <ncarboni>
AuthorDate: Fri Nov 18 17:13:00 2016 -0500
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Nov 18 17:13:00 2016 -0500

    Drop pglogical subscriptions before the restore
    
    pglogical subscriptions cause a postgresql worker process to hold
    open connections to the database. If we attempt to drop the database
    while we still have subscriptions, we will always fail.
    
    This commit removes any existing subscriptions before attempting
    a restore and ignores the failures in the case that pglogical
    is not being used. This allows the database restore to complete
    cleanly.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1393049

 lib/gems/pending/util/postgres_admin.rb | 14 +++++++++++---
 spec/util/postgres_admin_spec.rb        |  7 +++++++
 2 files changed, 18 insertions(+), 3 deletions(-)

Comment 10 luke couzens 2017-03-07 09:13:06 UTC

Verified in 5.8.0.3

Comment 11 CFME Bot 2021-03-17 16:35:53 UTC

New commit detected on ManageIQ/manageiq-appliance_console/master:

https://github.com/ManageIQ/manageiq-appliance_console/commit/117862145b48c1f0ad9b3d52bcce129186654329
commit 117862145b48c1f0ad9b3d52bcce129186654329
Author:     Nick Carboni <ncarboni>
AuthorDate: Fri Nov 18 22:13:00 2016 +0000
Commit:     Nick LaMuro <nicklamuro>
CommitDate: Tue Mar 16 19:25:16 2021 +0000

    Drop pglogical subscriptions before the restore

    pglogical subscriptions cause a postgresql worker process to hold
    open connections to the database. If we attempt to drop the database
    while we still have subscriptions, we will always fail.

    This commit removes any existing subscriptions before attempting
    a restore and ignores the failures in the case that pglogical
    is not being used. This allows the database restore to complete
    cleanly.

    https://bugzilla.redhat.com/show_bug.cgi?id=1393049


    (transferred from ManageIQ/manageiq-gems-pending@d081653ba85a0752f102a7a9c18b5e85be803ea4)
 lib/manageiq/appliance_console/postgres_admin.rb | 14 +-
 spec/postgres_admin_spec.rb | 7 +
 2 files changed, 18 insertions(+), 3 deletions(-)

Note You need to log in before you can comment on or make changes to this bug.