Bug 1796681

Summary: Replication tab presenting 502 proxy error when remote site is down
Product: Red Hat CloudForms Management Engine Reporter: Jared Deubel <jdeubel>
Component: ReplicationAssignee: Nick Carboni <ncarboni>
Status: CLOSED NOTABUG QA Contact: Tasos Papaioannou <tpapaioa>
Severity: high Docs Contact: Red Hat CloudForms Documentation <cloudforms-docs>
Priority: high    
Version: 5.10.14CC: dmetzger, gtanzill, jocarter, mshriver, obarenbo
Target Milestone: GAKeywords: TestOnly, ZStream
Target Release: 5.12.0Flags: mfeifer: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1798523 1798526 (view as bug list) Environment:
Last Closed: 2020-06-10 13:13:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: Bug
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: CFME Core Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1798523, 1798526    

Description Jared Deubel 2020-01-30 23:09:45 UTC
Description of problem:

The replication tab presents 502 proxy error when the remote site is down. Two previous issues have been fixed [1] that have addressed this issue.


Long running requests starting:
[----] I, [2020-01-29T18:42:37.811088 #3108:1148f60]  INFO -- : MIQ(MiqServer#monitor_loop) Server Monitoring Complete - Timings: {:heartbeat=>0.011402606964111328, :server_dequeue=>0.005303382873535156, :worker_monitor=>1.0663485527038574, :worker_dequeue=>0.008280754089355469, :total_time=>1.0916004180908203}
[----] W, [2020-01-29T18:42:38.224015 #3574:130bfa0]  WARN -- : MIQ(MiqUiWorker::Runner#log_long_running_requests) Long running http(s) request: '/ops/pglogical_subscriptions_form_fields/new' handled by #3574:4d4cafc, running for 121.77 seconds
[----] I, [2020-01-29T18:42:43.225464 #3427:1148f60]  INFO -- : MIQ(MiqScheduleWorker::Runner#do_work) Number of scheduled items to be processed: 0.
[----] I, [2020-01-29T18:42:53.934434 #3108:1148f60]  INFO -- : MIQ(MiqServer#monitor_loop) Server Monitoring Complete - Timings: {:server_dequeue=>0.004981517791748047, :worker_monitor=>1.0818428993225098, :worker_dequeue=>0.006698131561279297, :total_time=>1.0937509536743164}
[----] E, [2020-01-29T18:42:54.370781 #3574:4d4cafc] ERROR -- : MIQ(PglogicalSubscription#backlog) could not connect to server: Connection timed out
	Is the server running on host "200.100.90.212" and accepting
	TCP/IP connections on port 5432?

[1] Original issues:
https://bugzilla.redhat.com/show_bug.cgi?id=1759511
https://bugzilla.redhat.com/show_bug.cgi?id=1741240


Version-Release number of selected component (if applicable):
5.10.14

Comment 2 Gregg Tanzillo 2020-01-31 14:51:13 UTC
It looks like the host at 200.100.90.212 is not reachable and opening a socket to an unreachable host can sometimes take over a minute to time out. Also, it seems like the subscriptions page is being loaded in between the time that the host goes away and the time that the postgres server recognizes it. Because once postgres detects the subscription is bad it will mark it as such and subscriptions page will no longer attempt to reach the remote to get the current backlog.

It would be good to see the postgres log from the global region that corresponds to this request.

Comment 3 Nick Carboni 2020-01-31 16:57:38 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1759511 is not the same issue as this customer, that bug was caused by a nil reference which is not present in these logs.
https://bugzilla.redhat.com/show_bug.cgi?id=1741240 was backported to 5.10 (ref: https://bugzilla.redhat.com/show_bug.cgi?id=1749052) in 5.10.10 so they have that fix.

This seems to be either a separate issue or just slowness in the environment.
I'm working on trying to reproduce this issue and come up with some steps to debug what's going on in this case.

Until that happens I would recommend removing the subscriptions to the removed remote regions so that the page becomes responsive again.
If it's necessary for the remote region to be down for some time I would suggest not adding the subscription until the remote region is expected to be stable.

Comment 5 Nick Carboni 2020-01-31 17:38:34 UTC
I was able to reproduce this issue only when uncleanly shutting down the remote server (force power off "pull the plug"). In that case the subscription to that region continued to report its status as "replicating" which caused us to continue to attempt to query the remote database for the backlog calculation. In all other cases (shut down postgresql on the remote server and cleanly shut down the remote VM) the subscription reported as "down" an the page remained responsive.

As a solution to this we could wrap the contents of the backlog method in some reasonable timeout (5 seconds maybe?) which should keep the subscription page responsive even if the remote servers are shut down in some bad way.
Alternatively we could remove the backlog reporting from the subscription page entirely and add it to the kebab menu as a separate query on demand.

The first option is probably a quicker fix but is probably a bit more fragile (someone could have connections that are up and take longer than the threshold), but the second option removes useful data from users in environments that are typically responsive.
Thoughts?

Comment 7 CFME Bot 2020-02-04 21:10:46 UTC
New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/8098b63460520d4087f769d174d7c847723573cb
commit 8098b63460520d4087f769d174d7c847723573cb
Author:     Nick Carboni <ncarboni>
AuthorDate: Fri Jan 31 15:09:41 2020 -0500
Commit:     Nick Carboni <ncarboni>
CommitDate: Fri Jan 31 15:09:41 2020 -0500

    Add a connection timeout for remote region connections

    This will prevent a non-responsive remote region from hanging
    the UI when trying to query for the replication backlog.

    Regular ruby Timeout won't work here because the PG connection code
    doesn't respond to the exception until after it has exhausted the
    underlying libpq timeout logic.

    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1796681

 app/models/miq_region_remote.rb | 15 +-
 app/models/pglogical_subscription.rb | 6 +-
 2 files changed, 11 insertions(+), 10 deletions(-)