Description of problem: The replication tab presents 502 proxy error when the remote site is down. Two previous issues have been fixed [1] that have addressed this issue. Long running requests starting: [----] I, [2020-01-29T18:42:37.811088 #3108:1148f60] INFO -- : MIQ(MiqServer#monitor_loop) Server Monitoring Complete - Timings: {:heartbeat=>0.011402606964111328, :server_dequeue=>0.005303382873535156, :worker_monitor=>1.0663485527038574, :worker_dequeue=>0.008280754089355469, :total_time=>1.0916004180908203} [----] W, [2020-01-29T18:42:38.224015 #3574:130bfa0] WARN -- : MIQ(MiqUiWorker::Runner#log_long_running_requests) Long running http(s) request: '/ops/pglogical_subscriptions_form_fields/new' handled by #3574:4d4cafc, running for 121.77 seconds [----] I, [2020-01-29T18:42:43.225464 #3427:1148f60] INFO -- : MIQ(MiqScheduleWorker::Runner#do_work) Number of scheduled items to be processed: 0. [----] I, [2020-01-29T18:42:53.934434 #3108:1148f60] INFO -- : MIQ(MiqServer#monitor_loop) Server Monitoring Complete - Timings: {:server_dequeue=>0.004981517791748047, :worker_monitor=>1.0818428993225098, :worker_dequeue=>0.006698131561279297, :total_time=>1.0937509536743164} [----] E, [2020-01-29T18:42:54.370781 #3574:4d4cafc] ERROR -- : MIQ(PglogicalSubscription#backlog) could not connect to server: Connection timed out Is the server running on host "200.100.90.212" and accepting TCP/IP connections on port 5432? [1] Original issues: https://bugzilla.redhat.com/show_bug.cgi?id=1759511 https://bugzilla.redhat.com/show_bug.cgi?id=1741240 Version-Release number of selected component (if applicable): 5.10.14
It looks like the host at 200.100.90.212 is not reachable and opening a socket to an unreachable host can sometimes take over a minute to time out. Also, it seems like the subscriptions page is being loaded in between the time that the host goes away and the time that the postgres server recognizes it. Because once postgres detects the subscription is bad it will mark it as such and subscriptions page will no longer attempt to reach the remote to get the current backlog. It would be good to see the postgres log from the global region that corresponds to this request.
https://bugzilla.redhat.com/show_bug.cgi?id=1759511 is not the same issue as this customer, that bug was caused by a nil reference which is not present in these logs. https://bugzilla.redhat.com/show_bug.cgi?id=1741240 was backported to 5.10 (ref: https://bugzilla.redhat.com/show_bug.cgi?id=1749052) in 5.10.10 so they have that fix. This seems to be either a separate issue or just slowness in the environment. I'm working on trying to reproduce this issue and come up with some steps to debug what's going on in this case. Until that happens I would recommend removing the subscriptions to the removed remote regions so that the page becomes responsive again. If it's necessary for the remote region to be down for some time I would suggest not adding the subscription until the remote region is expected to be stable.
I was able to reproduce this issue only when uncleanly shutting down the remote server (force power off "pull the plug"). In that case the subscription to that region continued to report its status as "replicating" which caused us to continue to attempt to query the remote database for the backlog calculation. In all other cases (shut down postgresql on the remote server and cleanly shut down the remote VM) the subscription reported as "down" an the page remained responsive. As a solution to this we could wrap the contents of the backlog method in some reasonable timeout (5 seconds maybe?) which should keep the subscription page responsive even if the remote servers are shut down in some bad way. Alternatively we could remove the backlog reporting from the subscription page entirely and add it to the kebab menu as a separate query on demand. The first option is probably a quicker fix but is probably a bit more fragile (someone could have connections that are up and take longer than the threshold), but the second option removes useful data from users in environments that are typically responsive. Thoughts?
https://github.com/ManageIQ/manageiq/pull/19791
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/8098b63460520d4087f769d174d7c847723573cb commit 8098b63460520d4087f769d174d7c847723573cb Author: Nick Carboni <ncarboni> AuthorDate: Fri Jan 31 15:09:41 2020 -0500 Commit: Nick Carboni <ncarboni> CommitDate: Fri Jan 31 15:09:41 2020 -0500 Add a connection timeout for remote region connections This will prevent a non-responsive remote region from hanging the UI when trying to query for the replication backlog. Regular ruby Timeout won't work here because the PG connection code doesn't respond to the exception until after it has exhausted the underlying libpq timeout logic. Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1796681 app/models/miq_region_remote.rb | 15 +- app/models/pglogical_subscription.rb | 6 +- 2 files changed, 11 insertions(+), 10 deletions(-)