Description of problem: On the top/master region of a large CFME deployment, when you go to Region 99 -> Replication to see the list of child databases, if one of them is down this page does not open, gives "502 Proxy Error". This is what shows up in the logs: 2018-01-08 20:38:05 GMT::5a53d68f.c7b4:@:[51124]:ERROR: could not connect to the postgresql server in replication mode: timeout expired 2018-01-08 20:38:05 GMT::5a53d68f.c7b4:@:[51124]:DETAIL: dsn was: fallback_application_name='/var/www/miq/vmdb/lib/workers/bin/evm_server.rb' dbname='vmdb_production' host='<ipaddr>' user='root' password='<password>' port='5432' 2018-01-08 20:38:05 GMT::5a53d68f.c7b4:@:[51124]:LOG: apply worker [51124] at slot 4 generation 42479 crashed 2018-01-08 20:38:05 GMT::59d69ddd.d58e:@:[54670]:LOG: worker process: pglogical apply 16386:2117093528 (PID 51124) exited with exit code 1 2018-01-08 20:38:05 GMT::5a53d6ad.c7cc:@:[51148]:LOG: starting apply for subscription region_3_subscription 2018-01-08 20:38:06 GMT::5a53d6ad.c7cc:@:[51148]:ERROR: data stream ended 2018-01-08 20:38:06 GMT::5a53d6ad.c7cc:@:[51148]:LOG: apply worker [51148] at slot 3 generation 32821 crashed 2018-01-08 20:38:06 GMT::59d69ddd.d58e:@:[54670]:LOG: worker process: pglogical apply 16386:2404866424 (PID 51148) exited with exit code 1 2018-01-08 20:38:12 GMT::5a53d6b4.c7ce:@:[51150]:LOG: starting apply for subscription region_3_subscription 2018-01-08 20:38:13 GMT::5a53d6b5.c7d0:@:[51152]:LOG: starting apply for subscription region_19_subscription 2018-01-08 20:38:13 GMT::5a53d6b4.c7ce:@:[51150]:ERROR: data stream ended 2018-01-08 20:38:13 GMT::5a53d6b4.c7ce:@:[51150]:LOG: apply worker [51150] at slot 3 generation 32822 crashed 2018-01-08 20:38:13 GMT::59d69ddd.d58e:@:[54670]:LOG: worker process: pglogical apply 16386:2404866424 (PID 51150) exited with exit code 1 Version-Release number of selected component (if applicable): 4.5 How reproducible: Here's the thing: the "master" appliance (Region 99), the one that concentrates all the data, is working just fine, the problem is when I purposely shut down one of the "lower" appliances (Region 19), then go to the replication screen on the "master", that's when it does not work. If I bring up the lower region appliance, then it works again. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: The replication screen on the master appliance needs to be more forgiven with unreachable databases. That's the problem. Additional info:
https://github.com/ManageIQ/manageiq/pull/16889
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/f40c04332298912c3e4e93036c3725636a3d3759 commit f40c04332298912c3e4e93036c3725636a3d3759 Author: Yuri Rudman <yrudman> AuthorDate: Thu Jan 25 14:04:13 2018 -0500 Commit: Yuri Rudman <yrudman> CommitDate: Thu Jan 25 15:21:33 2018 -0500 rescue attempt to get backlog from remote server, it will allow to manage subscription screen even if remote db is offline https://bugzilla.redhat.com/show_bug.cgi?id=1533958 app/models/pglogical_subscription.rb | 3 +++ 1 file changed, 3 insertions(+)
Verified on 5.10.0.2.