1342255 – Replication stops if network connection is lost for over 60s

Bug 1342255 - Replication stops if network connection is lost for over 60s

Summary: Replication stops if network connection is lost for over 60s

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Replication
Sub Component:
Version:	5.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	GA
Target Release:	5.7.0
Assignee:	Nick Carboni
QA Contact:	Alex Newman
Docs Contact:
URL:
Whiteboard:	distributed
Depends On:
Blocks:	1344050
TreeView+	depends on / blocked

Reported:	2016-06-02 18:45 UTC by Nick Carboni
Modified:	2017-01-12 04:42 UTC (History)
CC List:	6 users (show)
Fixed In Version:	5.7.0.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1344050 (view as bug list)
Environment:
Last Closed:	2017-01-11 19:52:49 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Nick Carboni 2016-06-02 18:45:10 UTC

Description of problem:
If the connection between the remote and global databases is lost for longer than 60 seconds replication stops and does not continue when the connection is restored.

After 60 seconds the postgresql.log on the regional database server shows the following:
2016-06-02 18:17:17 GMT:10.8.99.231(48034):5750753c.a53:root@vmdb_production:[2643]:LOG:  terminating walsender process due to replication timeout
2016-06-02 18:17:17 GMT:10.8.99.231(48034):5750753c.a53:root@vmdb_production:[2643]:LOG:  disconnection: session time: 0:12:33.127 user=root database=vmdb_production host=10.8.99.231 port=48034

Version-Release number of selected component (if applicable):
5.6.0.9-rc2

How reproducible:
Always

Steps to Reproduce:
1. Set up pglogical replication
2. `ifdown eth0` on the global database server
3. Wait 60s
4. `ifup eth0` on the global database server

Actual results:
Replication is not continued when the connection is restored

Expected results:
Replication would continue

Additional info:

Disabling then enabling the pglogical subscription associated with the remote region will restart replication when the connection is restored, but there is no way to do this through the UI.

This behavior seems to be controlled by the wal_sender_timeout parameter in postgresql.conf (https://www.postgresql.org/docs/current/static/runtime-config-replication.html#GUC-WAL-SENDER-TIMEOUT)

Setting this to 0 would disable the disconnect behavior and allow replication to continue when the connection was restored, but would also cause the regional database server to accumulate WAL logs until the replication slot was removed manually in the case of the global server being lost for good.  This would also affect regular streaming replication (for HA purposes).

Comment 2 Nick Carboni 2016-06-02 19:13:19 UTC

Also, even if the WAL sender times out, it doesn't stop accumulating WAL logs. We would still need to remove the replication slot if the global server goes away for real.

Given that, for now it seems like there is no downside to disabling the timeout, so I'll make that change for now, and it will still be configurable if we run into any issues with it.

Comment 3 Nick Carboni 2016-06-02 19:22:03 UTC

https://github.com/ManageIQ/manageiq-appliance/pull/74

Comment 4 CFME Bot 2016-06-03 19:17:26 UTC

New commit detected on ManageIQ/manageiq-appliance/master:
https://github.com/ManageIQ/manageiq-appliance/commit/cc675e11f71dfa4bbf26952094cdf1ee91c7c532

commit cc675e11f71dfa4bbf26952094cdf1ee91c7c532
Author:     Nick Carboni <ncarboni>
AuthorDate: Thu Jun 2 15:17:26 2016 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Jun 3 14:53:28 2016 -0400

    Disable WAL sender timeout behavior
    
    The default behavior is to disable replication (the wal sender)
    after the destination is unreachable for 60 seconds
    
    This behavior is controlled by the wal_sender_timeout parameter
    in postgresql.conf
    (https://www.postgresql.org/docs/current/static/runtime-config-replication.html#GUC-WAL-SENDER-TIMEOUT)
    
    Setting this to 0 would disable the disconnect behavior
    and allow replication to continue when the connection was restored
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1342255

 TEMPLATE/var/opt/rh/rh-postgresql94/lib/pgsql/data/postgresql.conf.erb | 1 +
 1 file changed, 1 insertion(+)

Note You need to log in before you can comment on or make changes to this bug.