1595832 – 502 Proxy Error

Bug 1595832 - 502 Proxy Error

Summary: 502 Proxy Error

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.10.0
Assignee:	Joe Rafaniello
QA Contact:	Tasos Papaioannou
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1609849 (view as bug list)
Depends On:
Blocks:	1600191
TreeView+	depends on / blocked

Reported:	2018-06-27 15:04 UTC by Jared Deubel
Modified:	2021-12-10 16:28 UTC (History)
CC List:	17 users (show)
Fixed In Version:	5.10.0.4
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1600191 (view as bug list)
Environment:
Last Closed:	2019-02-11 14:00:43 UTC
Category:	---
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
proxy error (17.47 KB, image/png) 2018-07-06 16:52 UTC, Daniyal	no flags	Details
screenshot of the proxy error (139.46 KB, image/png) 2018-09-04 10:07 UTC, Avinash Kumar Dasoundhi	no flags	Details
proxy error screenshot 2 (43.41 KB, image/png) 2018-09-05 03:41 UTC, Avinash Kumar Dasoundhi	no flags	Details
View All

Description Jared Deubel 2018-06-27 15:04:16 UTC

Description of problem:

While using the Web UI Server Error (API) popups are being generated. The contents contains Status 502 Proxy Error (see attached screenshot).

We have tried restarting the evmserverd processes, restarting the browser, clearing the browser cache and restarting, but none of these steps have caused the pop-ups to not be generated.

This is impacting a consulting engagement being delivered.

Version-Release number of selected component (if applicable):
5.9.2.4

Comment 2 Joe Rafaniello 2018-06-27 16:24:47 UTC

They applied cfme-5.9.2.4-8.el7cf.x86_64.rpm for the dalli fix (from bug 1578126) and restarted.  We don't know if this error was causing the proxy error but we want to eliminate it as a possibility.  I'm continuing to try to recreate it.

Comment 3 Joe Rafaniello 2018-06-29 16:18:05 UTC

The customer/field engineer has not seen this error reoccur since the dalli fix.  It's unclear if that fix also fixed this 502 error.  We can reopen if this issue resurfaces again.

Comment 11 Daniyal 2018-07-04 16:22:28 UTC

Running into the same thing - I have 2 UI and 2 DB nodes. Seeing the 502 proxy pop up error on UI1. Tailing evm.log while grepping for ERROR doesnt show anything

Comment 12 Gregg Tanzillo 2018-07-05 13:07:58 UTC

Hi daniyal.javed,

We've been looking into this problem this week. There are currently 2 issues in Bugzilla and in both cases the appliance is running on Azure. By any chance, are you also running your appliance on Azure?

Comment 13 Daniyal 2018-07-05 13:28:59 UTC

(In reply to Gregg Tanzillo from comment #12)
> Hi daniyal.javed,
> 
> We've been looking into this problem this week. There are currently 2 issues
> in Bugzilla and in both cases the appliance is running on Azure. By any
> chance, are you also running your appliance on Azure?

Hi Greg, I am running my appliances in VMWare. I only see it pop up on one of my two UI appliances. The pop up reoccurs about every 2 minutes.

Comment 14 Nick Carboni 2018-07-06 13:32:55 UTC

We attempted to reproduce this issue using an azure VM behind a load balancer with embedded ansible enabled and were not able to recreate the 502 pop-up errors.

Comment 15 Daniyal 2018-07-06 16:52:24 UTC

I dug into this a little more. The 502 Proxy error is appearing on URl <cfme-url>/api/auth?requester_type=ws

When I access the above URL from the UI where I see the pop up, it just stays on loading and then gives me a Proxy Error

When I access the above URL from UI2 where I am seeing the pop up, I can an auth token back with an expiry.

Comment 16 Daniyal 2018-07-06 16:52:56 UTC

Created attachment 1457051 [details]
proxy error

proxy error when accessing the API url

Comment 17 Nick Carboni 2018-07-06 16:56:21 UTC

Daniyal,

I'm not sure I understand the difference between "from the UI where I see the pop up" and "from UI2 where I am seeing the pop up". What are the differences between "UI" and "UI2"? Are you using a load balancer in front of two UI servers?

Comment 18 Daniyal 2018-07-06 16:57:54 UTC

Sorry Nick, there is a typo. I have two UIs. UI1 is where the pop up appears and UI2 is where the I am not seeing the pop up. There is an F5 load balancer infront of the two UI nodes

Comment 19 Dávid Halász 2018-07-07 05:29:41 UTC

Daniyal,

are you using memcached for sessions? Is it being shared between the UI nodes?

Thanks

Comment 20 CFME Bot 2018-07-10 19:03:38 UTC

https://github.com/ManageIQ/manageiq-api/pull/416

Comment 21 CFME Bot 2018-07-10 20:54:21 UTC

New commit detected on ManageIQ/manageiq-api/master:

https://github.com/ManageIQ/manageiq-api/commit/abd070af52cebc5a4707fe3c2b1c3db2a350f6c5
commit abd070af52cebc5a4707fe3c2b1c3db2a350f6c5
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Tue Jul 10 13:44:30 2018 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Tue Jul 10 13:44:30 2018 -0400

    Permit concurrent loads to avoid a deadlock

    https://bugzilla.redhat.com/show_bug.cgi?id=1596703
    https://bugzilla.redhat.com/show_bug.cgi?id=1595832

    If two web service worker threads both receive a request and have not yet
    initialized time_attributes, it's possible for the threads to deadlock as
    described below.

    Thread 1:
    constantizes CloudObjectStoreContainer, which tries to include_concern
    "Operations" (note: the constant is not yet loaded)... and waits for a
    lock held by thread 2

    Thread 2:
    constantizes CloudObjectStoreContainer, acquires a lock and sits and
    waits on something else

    It's unclear why thread 2 is holding a lock and will not relinquish it.

    For now, we can protect this constantize by wrapping
    each constantize in this method in a permit_concurrent_loads block.

    Note, several other classes successfully load concurrently, but
    CloudObjectStoreContainer consistently hits this.  It is the first class
    that tries to constantize that also does include_concern so it's
    possible this combination will be responsible for the eventual fix.

    To test this:

    1) Add rails lock debugging middleware to config/application.rb:
       `config.middleware.insert_before Rack::Sendfile, ActionDispatch::DebugLocks`

    2) Copy this script to /var/www/miq/vmdb:

    Thread.new do
      `curl -k -L https://admin:smartvm@localhost/api/notifications?expand=resources&attributes=details&sort_by=id&sort_order=desc&limit=100`
    end

    Thread.new do
      `curl -k -L https://admin:smartvm@localhost/api/notifications?expand=resources&attributes=details&sort_by=id&sort_order=desc&limit=100`
    end

    3) `systemctl start evmserverd` (or restart if you added the middleware in
    step 1)

    4) run the script in step 2.  `ruby path/to/file`.

    5) Hit the locks endpoint added in step 1): `curl -L http://admin:smartvm@localhost:4000/rails/locks`

    6) If the locks endpoint shows locks, run it again as it could be a
    temporary lock.  With the fixed code, the curl command should have no
    output.  With the broken code, it outputs the threads, their backtraces
    and who's blocking whom.

 lib/api/environment.rb | 7 +-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comment 24 Tasos Papaioannou 2018-07-17 14:38:44 UTC

Verified on 5.10.0.4.

Comment 25 Cyril Coupel 2018-07-30 14:12:03 UTC

Hi all,
I faced this issue today while upgrading from 5.9.2 to 5.9.3.
I have 2 appliances for UI, the 1st with 4 UI workers and the 2nd with only 1 UI worker to be used as backup.

When using the 1st appliance, I have the issue. When using the 2nd one, I don't have the issue.

I found a workaround by setting the 1st appliance to only 1 UI worker.

As the issue is solved in 5.10.0.4, do you plan to backport it to 5.9.x ?

Comment 26 Joe Rafaniello 2018-07-30 14:52:21 UTC

Yes, that makes sense.  By reducing the number of UI workers to 1, you're greatly decreasing the chance that two requests will be sent to the web service worker at exactly the "right" (perfect timing for the deadlock, bad timing for the user) time.  

Regarding, a fix for 5.9.x, it should be in 5.9.4.0, see bug #1600191.

Comment 27 Joe Rafaniello 2018-08-01 16:32:22 UTC

*** Bug 1609849 has been marked as a duplicate of this bug. ***

Comment 28 Avinash Kumar Dasoundhi 2018-09-04 10:07:02 UTC

Created attachment 1480729 [details]
screenshot of the proxy error

I am running the latest build i.e 5.9.4.7, and I have 1 DB and a bunch of UI workers approx 9. In every appliance, I am facing this issue.

But it is specific to only one place where we add the provider and then check the details of the provider after doing a refresh relationship and power stats. The provider is VMware which has 10k vms in it.

I tried the workaround also given above by setting the appliances to only 1 UI worker still, the popup is coming. 

Attaching the screenshot of the error also, which I am getting as a popup.

Comment 29 Joe Rafaniello 2018-09-04 13:22:44 UTC

Hi Avinash,

The issue you're describing sounds unrelated.  502 proxy error is a generic error and just indicates apache wasn't able to get a response to from our rails processes.  This BZ is specifically regarding the API webserver worker hitting a deadlock on the first request in two different threads.  From what you described, it's long beyond the first request.

Please verify the procedure is reproducible and open a new issue, providing logs (we need evm.log/production.log and apache/ssl* logs for this).

Comment 30 Avinash Kumar Dasoundhi 2018-09-05 03:41:13 UTC

Created attachment 1480961 [details]
proxy error screenshot 2

Hi Joe!

I have attached a second screenshot which is identical to the screenshot attached by Daniyal. When I am trying to access the URL from the browser, I am getting the same proxy error. 

I just want to confirm that my issue is totally different from what is being discussed here. If it is, I will be opening a new BZ with all the logs. Just wanted to understand what is happening in my case because in both scenarios we are having the exact same error.

Comment 31 Joe Rafaniello 2018-09-05 17:50:09 UTC

Hi Avinash,

Yes, all proxy errors look the same or similar from the outside.  Apache handled a request, dispatched it to one of our puma based ruby processes and most likely  took more than 2 minutes to complete it so apache timed it out.

This bug is for a specific class of this problem, where the API worker's first 2 requests happen at nearly the same time and deadlock each other.  If your API workers have handled a few requests, for example you've logged into the UI, and sometime after you get this proxy error, then you're seeing a different problem.

Comment 32 Avinash Kumar Dasoundhi 2018-09-06 04:42:38 UTC

Thanks, Joe for such detailed explanation.

Will open a new BZ with all the logs and screenshots.

Note You need to log in before you can comment on or make changes to this bug.