Bug 1342255

Summary:	Replication stops if network connection is lost for over 60s
Product:	Red Hat CloudForms Management Engine	Reporter:	Nick Carboni <ncarboni>
Component:	Replication	Assignee:	Nick Carboni <ncarboni>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Alex Newman <anewman>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	5.6.0	CC:	cpelland, greartes, jdeubel, jhardy, obarenbo, simaishi
Target Milestone:	GA	Keywords:	TestOnly, ZStream
Target Release:	5.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	distributed
Fixed In Version:	5.7.0.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1344050 (view as bug list)		Environment:
Last Closed:	2017-01-11 19:52:49 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1344050

Description Nick Carboni 2016-06-02 18:45:10 UTC

Description of problem:
If the connection between the remote and global databases is lost for longer than 60 seconds replication stops and does not continue when the connection is restored.

After 60 seconds the postgresql.log on the regional database server shows the following:
2016-06-02 18:17:17 GMT:10.8.99.231(48034):5750753c.a53:root@vmdb_production:[2643]:LOG:  terminating walsender process due to replication timeout
2016-06-02 18:17:17 GMT:10.8.99.231(48034):5750753c.a53:root@vmdb_production:[2643]:LOG:  disconnection: session time: 0:12:33.127 user=root database=vmdb_production host=10.8.99.231 port=48034

Version-Release number of selected component (if applicable):
5.6.0.9-rc2

How reproducible:
Always

Steps to Reproduce:
1. Set up pglogical replication
2. `ifdown eth0` on the global database server
3. Wait 60s
4. `ifup eth0` on the global database server

Actual results:
Replication is not continued when the connection is restored

Expected results:
Replication would continue

Additional info:

Disabling then enabling the pglogical subscription associated with the remote region will restart replication when the connection is restored, but there is no way to do this through the UI.

This behavior seems to be controlled by the wal_sender_timeout parameter in postgresql.conf (https://www.postgresql.org/docs/current/static/runtime-config-replication.html#GUC-WAL-SENDER-TIMEOUT)

Setting this to 0 would disable the disconnect behavior and allow replication to continue when the connection was restored, but would also cause the regional database server to accumulate WAL logs until the replication slot was removed manually in the case of the global server being lost for good.  This would also affect regular streaming replication (for HA purposes).

Comment 2 Nick Carboni 2016-06-02 19:13:19 UTC

Also, even if the WAL sender times out, it doesn't stop accumulating WAL logs. We would still need to remove the replication slot if the global server goes away for real.

Given that, for now it seems like there is no downside to disabling the timeout, so I'll make that change for now, and it will still be configurable if we run into any issues with it.

Comment 3 Nick Carboni 2016-06-02 19:22:03 UTC

https://github.com/ManageIQ/manageiq-appliance/pull/74

Comment 4 CFME Bot 2016-06-03 19:17:26 UTC

New commit detected on ManageIQ/manageiq-appliance/master:
https://github.com/ManageIQ/manageiq-appliance/commit/cc675e11f71dfa4bbf26952094cdf1ee91c7c532

commit cc675e11f71dfa4bbf26952094cdf1ee91c7c532
Author:     Nick Carboni <ncarboni>
AuthorDate: Thu Jun 2 15:17:26 2016 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Jun 3 14:53:28 2016 -0400

    Disable WAL sender timeout behavior
    
    The default behavior is to disable replication (the wal sender)
    after the destination is unreachable for 60 seconds
    
    This behavior is controlled by the wal_sender_timeout parameter
    in postgresql.conf
    (https://www.postgresql.org/docs/current/static/runtime-config-replication.html#GUC-WAL-SENDER-TIMEOUT)
    
    Setting this to 0 would disable the disconnect behavior
    and allow replication to continue when the connection was restored
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1342255

 TEMPLATE/var/opt/rh/rh-postgresql94/lib/pgsql/data/postgresql.conf.erb | 1 +
 1 file changed, 1 insertion(+)