Bug 1335945 - Replication subscription sync blocks due to pglogical locking bug
Summary: Replication subscription sync blocks due to pglogical locking bug
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Replication
Version: 5.6.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.6.0
Assignee: Nick Carboni
QA Contact: Alex Newman
URL:
Whiteboard: replication:database
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-13 15:13 UTC by Nick Carboni
Modified: 2016-08-26 14:21 UTC (History)
4 users (show)

Fixed In Version: 5.6.0.8
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-29 16:02:23 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1348 0 normal SHIPPED_LIVE CFME 5.6.0 bug fixes and enhancement update 2016-06-29 18:50:04 UTC

Description Nick Carboni 2016-05-13 15:13:12 UTC
Description of problem:
After dropping a pglogical node, the server process acquires many shared access locks which prevent subsequent attempts to create replication slots (subscriptions).

Version-Release number of selected component (if applicable):
master-201605122000

How reproducible:
Always

Steps to Reproduce:
1. Deploy two appliances (region 0, and region 99)
2. Set the region 0 appliance to replication type "remote", then "none", then "remote" (create the pglogical node, remove it, recreate it)
3. Create a subscription to the region 0 appliance on the region 99 appliance

Actual results:
The initial sync never occurs.

Expected results:
Table data is synced from region 0 to region 99

Additional info:
What is happening is that the node drop code on the region 0 database acquires shared locks for the replication slots and never releases them.

The subscription create process then blocks trying to take an exclusive lock to create the replication slot for the new subscription.

There is no good workaround that I have found so far. At the very least we would need to restart the postgres service on the regional database to get the locks released, but even that would leave the global process in a bad state.

This is a bug in pglogical for which I have submitted a patch here https://github.com/2ndQuadrant/postgres/pull/3

Unfortunately PRs don't seem to be particularly welcome.

Because we are already running a "custom" build of pglogical to be compatible with SCL postgresql (i.e. we are not dependent on their source), we could make the patch and just rebuild.  I've done this locally on upstream appliances.

The issue is we don't have a good way to track these patches or review changes.

Comment 2 Nick Carboni 2016-05-13 19:34:53 UTC
Moved the PR to the pglogical repo rather than the postgres fork.

https://github.com/2ndQuadrant/pglogical/pull/3

Comment 4 Nick Carboni 2016-05-19 13:45:37 UTC
The changes were made in pglogical (https://github.com/2ndQuadrant/pglogical/commit/85052cb6e76f8a5caf2c9189729ecbc99485ef00)

So we will either update the version we are using when those changes get into a release or rebuild our rpm with just that patch included if a new pglogical release does not happen in time.

Comment 5 Nick Carboni 2016-05-19 19:53:30 UTC
The most recent release of pglogical (1.1.1) contains some changes that would require more extensive refactoring which is out of the scope of fixing this issue.

Because of this we are going to include a patch in our pglogical build to solve this particular issue.

This is the commit to add the patch and update the spec file to include this fix

http://pkgs.devel.redhat.com/cgit/rpms/postgresql-pglogical/commit/?h=cfme-rh-postgresql94-5.6-rhel-7

Comment 6 Nick Carboni 2016-05-19 21:05:13 UTC
Also built a new package for the upstream appliances here https://copr.fedorainfracloud.org/coprs/ncarboni/pglogical-SCL/build/288682/

Comment 9 errata-xmlrpc 2016-06-29 16:02:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1348


Note You need to log in before you can comment on or make changes to this bug.