1796681 – Replication tab presenting 502 proxy error when remote site is down

Bug 1796681 - Replication tab presenting 502 proxy error when remote site is down

Summary: Replication tab presenting 502 proxy error when remote site is down

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Replication
Sub Component:
Version:	5.10.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.12.0
Assignee:	Nick Carboni
QA Contact:	Tasos Papaioannou
Docs Contact:	Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On:
Blocks:	1798523 1798526
TreeView+	depends on / blocked

Reported:	2020-01-30 23:09 UTC by Jared Deubel
Modified:	2023-09-07 21:39 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1798523 1798526 (view as bug list)
Environment:
Last Closed:	2020-06-10 13:13:41 UTC
Category:	Bug
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:
Flags:	mfeifer: mirror+

Attachments	(Terms of Use)

Description Jared Deubel 2020-01-30 23:09:45 UTC

Description of problem:

The replication tab presents 502 proxy error when the remote site is down. Two previous issues have been fixed [1] that have addressed this issue.


Long running requests starting:
[----] I, [2020-01-29T18:42:37.811088 #3108:1148f60]  INFO -- : MIQ(MiqServer#monitor_loop) Server Monitoring Complete - Timings: {:heartbeat=>0.011402606964111328, :server_dequeue=>0.005303382873535156, :worker_monitor=>1.0663485527038574, :worker_dequeue=>0.008280754089355469, :total_time=>1.0916004180908203}
[----] W, [2020-01-29T18:42:38.224015 #3574:130bfa0]  WARN -- : MIQ(MiqUiWorker::Runner#log_long_running_requests) Long running http(s) request: '/ops/pglogical_subscriptions_form_fields/new' handled by #3574:4d4cafc, running for 121.77 seconds
[----] I, [2020-01-29T18:42:43.225464 #3427:1148f60]  INFO -- : MIQ(MiqScheduleWorker::Runner#do_work) Number of scheduled items to be processed: 0.
[----] I, [2020-01-29T18:42:53.934434 #3108:1148f60]  INFO -- : MIQ(MiqServer#monitor_loop) Server Monitoring Complete - Timings: {:server_dequeue=>0.004981517791748047, :worker_monitor=>1.0818428993225098, :worker_dequeue=>0.006698131561279297, :total_time=>1.0937509536743164}
[----] E, [2020-01-29T18:42:54.370781 #3574:4d4cafc] ERROR -- : MIQ(PglogicalSubscription#backlog) could not connect to server: Connection timed out
	Is the server running on host "200.100.90.212" and accepting
	TCP/IP connections on port 5432?

[1] Original issues:
https://bugzilla.redhat.com/show_bug.cgi?id=1759511
https://bugzilla.redhat.com/show_bug.cgi?id=1741240


Version-Release number of selected component (if applicable):
5.10.14

Comment 2 Gregg Tanzillo 2020-01-31 14:51:13 UTC

It looks like the host at 200.100.90.212 is not reachable and opening a socket to an unreachable host can sometimes take over a minute to time out. Also, it seems like the subscriptions page is being loaded in between the time that the host goes away and the time that the postgres server recognizes it. Because once postgres detects the subscription is bad it will mark it as such and subscriptions page will no longer attempt to reach the remote to get the current backlog.

It would be good to see the postgres log from the global region that corresponds to this request.

Comment 3 Nick Carboni 2020-01-31 16:57:38 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1759511 is not the same issue as this customer, that bug was caused by a nil reference which is not present in these logs.
https://bugzilla.redhat.com/show_bug.cgi?id=1741240 was backported to 5.10 (ref: https://bugzilla.redhat.com/show_bug.cgi?id=1749052) in 5.10.10 so they have that fix.

This seems to be either a separate issue or just slowness in the environment.
I'm working on trying to reproduce this issue and come up with some steps to debug what's going on in this case.

Until that happens I would recommend removing the subscriptions to the removed remote regions so that the page becomes responsive again.
If it's necessary for the remote region to be down for some time I would suggest not adding the subscription until the remote region is expected to be stable.

Comment 5 Nick Carboni 2020-01-31 17:38:34 UTC

I was able to reproduce this issue only when uncleanly shutting down the remote server (force power off "pull the plug"). In that case the subscription to that region continued to report its status as "replicating" which caused us to continue to attempt to query the remote database for the backlog calculation. In all other cases (shut down postgresql on the remote server and cleanly shut down the remote VM) the subscription reported as "down" an the page remained responsive.

As a solution to this we could wrap the contents of the backlog method in some reasonable timeout (5 seconds maybe?) which should keep the subscription page responsive even if the remote servers are shut down in some bad way.
Alternatively we could remove the backlog reporting from the subscription page entirely and add it to the kebab menu as a separate query on demand.

The first option is probably a quicker fix but is probably a bit more fragile (someone could have connections that are up and take longer than the threshold), but the second option removes useful data from users in environments that are typically responsive.
Thoughts?

Comment 6 CFME Bot 2020-01-31 23:46:04 UTC

https://github.com/ManageIQ/manageiq/pull/19791

Comment 7 CFME Bot 2020-02-04 21:10:46 UTC

New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/8098b63460520d4087f769d174d7c847723573cb
commit 8098b63460520d4087f769d174d7c847723573cb
Author:     Nick Carboni <ncarboni>
AuthorDate: Fri Jan 31 15:09:41 2020 -0500
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Jan 31 15:09:41 2020 -0500

    Add a connection timeout for remote region connections

    This will prevent a non-responsive remote region from hanging
    the UI when trying to query for the replication backlog.

    Regular ruby Timeout won't work here because the PG connection code
    doesn't respond to the exception until after it has exhausted the
    underlying libpq timeout logic.

    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1796681

 app/models/miq_region_remote.rb | 15 +-
 app/models/pglogical_subscription.rb | 6 +-
 2 files changed, 11 insertions(+), 10 deletions(-)

Note You need to log in before you can comment on or make changes to this bug.