Description of problem: If the connection between the remote and global databases is lost for longer than 60 seconds replication stops and does not continue when the connection is restored. After 60 seconds the postgresql.log on the regional database server shows the following: 2016-06-02 18:17:17 GMT:10.8.99.231(48034):5750753c.a53:root@vmdb_production:[2643]:LOG: terminating walsender process due to replication timeout 2016-06-02 18:17:17 GMT:10.8.99.231(48034):5750753c.a53:root@vmdb_production:[2643]:LOG: disconnection: session time: 0:12:33.127 user=root database=vmdb_production host=10.8.99.231 port=48034 Version-Release number of selected component (if applicable): 5.6.0.9-rc2 How reproducible: Always Steps to Reproduce: 1. Set up pglogical replication 2. `ifdown eth0` on the global database server 3. Wait 60s 4. `ifup eth0` on the global database server Actual results: Replication is not continued when the connection is restored Expected results: Replication would continue Additional info: Disabling then enabling the pglogical subscription associated with the remote region will restart replication when the connection is restored, but there is no way to do this through the UI. This behavior seems to be controlled by the wal_sender_timeout parameter in postgresql.conf (https://www.postgresql.org/docs/current/static/runtime-config-replication.html#GUC-WAL-SENDER-TIMEOUT) Setting this to 0 would disable the disconnect behavior and allow replication to continue when the connection was restored, but would also cause the regional database server to accumulate WAL logs until the replication slot was removed manually in the case of the global server being lost for good. This would also affect regular streaming replication (for HA purposes).
Also, even if the WAL sender times out, it doesn't stop accumulating WAL logs. We would still need to remove the replication slot if the global server goes away for real. Given that, for now it seems like there is no downside to disabling the timeout, so I'll make that change for now, and it will still be configurable if we run into any issues with it.
https://github.com/ManageIQ/manageiq-appliance/pull/74
New commit detected on ManageIQ/manageiq-appliance/master: https://github.com/ManageIQ/manageiq-appliance/commit/cc675e11f71dfa4bbf26952094cdf1ee91c7c532 commit cc675e11f71dfa4bbf26952094cdf1ee91c7c532 Author: Nick Carboni <ncarboni> AuthorDate: Thu Jun 2 15:17:26 2016 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Fri Jun 3 14:53:28 2016 -0400 Disable WAL sender timeout behavior The default behavior is to disable replication (the wal sender) after the destination is unreachable for 60 seconds This behavior is controlled by the wal_sender_timeout parameter in postgresql.conf (https://www.postgresql.org/docs/current/static/runtime-config-replication.html#GUC-WAL-SENDER-TIMEOUT) Setting this to 0 would disable the disconnect behavior and allow replication to continue when the connection was restored https://bugzilla.redhat.com/show_bug.cgi?id=1342255 TEMPLATE/var/opt/rh/rh-postgresql94/lib/pgsql/data/postgresql.conf.erb | 1 + 1 file changed, 1 insertion(+)