1526146 – Apache reverse proxy failing to connect to backend puma servers on a UI node after swapping

Bug 1526146 - Apache reverse proxy failing to connect to backend puma servers on a UI node after swapping

Summary: Apache reverse proxy failing to connect to backend puma servers on a UI node ...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.8.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	GA
Target Release:	cfme-future
Assignee:	Gregg Tanzillo
QA Contact:	Dave Johnson
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1551709
TreeView+	depends on / blocked

Reported:	2017-12-14 20:33 UTC by Ron
Modified:	2023-09-15 00:05 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-03-08 19:41:27 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ron 2017-12-14 20:33:40 UTC

Description of problem:
We have three UI worker nodes and two VMDB appliances in a global region. An HAproxy load balancer sits in front of the three UI workers. We were experiencing high latency when accessing the UI workers and tracked it down to a single UI node (UI01). Upon further investigation, the UI node was swapping. We dropped the UI workers from 9 to 4, which resolved the swapping issue, however response time is still high and apache is throwing a 502 along with timeouts to puma.

Version-Release number of selected component (if applicable): 5.8


How reproducible:
Our environment is currently experiencing this issue ongoing and we have removed UI01 from the load balancer to keep the application online.


Steps to Reproduce:
1. ? - Haven't tracked down the root cause of the issue yet.
2. Logins will timeout/fail.

Actual results:
UI worker node is unresponsive except initial login screen pulling from Rails /public.

Expected results:
Rapid response time of web application.

Additional info:
Appliances are running on VMware ESXi 6.5, with 8 vCPU and 24GB of memory.

Comment 2 Gregg Tanzillo 2017-12-15 14:36:43 UTC

Can we get a full set of logs from the UI01 appliance so that we can investigate? Thanks.

Comment 3 Ron 2017-12-15 14:54:37 UTC

Full logs will be too large to upload to BZ. I have uploaded a complete set of logs to the RHN Case 01994192.

Comment 4 Lynn Dixon 2017-12-15 15:01:12 UTC

Linking BZ to support case.

Comment 7 Joe Rafaniello 2017-12-19 17:48:04 UTC

We'd like to continue to try to track down this issue to see if it reoccurs,
but also provide some information so you can capture the diagnostics and fix the
problem quickly.

If it's possible, please reboot the appliance that had the proxy errors.

Monitor the system for any 502 or 503 errors users receive in the user
interface.

If those occur:

1) ssh into the problematic appliance

2) vmdb; # cd into the /var/www/miq/vmdb

3) 'grep timeout log/apache/*'. If this happened in the prior day(s), see [a]

4) You're looking for this line, specifically the port, 3002 in this case:

AH01102: error reading status line from remote server 0.0.0.0:3002.

Note, it could be a different port the next time it occurs so don’t look for 3002.

5) Run 'bin/rake evm:status' and find the MiqUiWorker with port 3002 in it's URL:
http://127.0.0.1:3002. Note, the PID of this worker. You can confirm this by
greping ps for the puma with 3002 port: 'ps -ef |grep 3002'. Again, these instructions are for port 3002 but it could be a different port, so make sure you use the port you found in the timeout error in step 4.

6) You'll want to confirm that this process is not handling recent UI requests
by looking for log lines within the last hour or so for that pid, 12345 here:
'grep #12345 log/production.log'

7) If this process is running on the port with the timeout errors and it's not
recently been processing requests in the production.log, it's safe to kill it:

'kill 12345'

8) A replacement UiWorker will be started and everything should be back to
normal.

9) Please capture the logs from this appliance (/var/www/miq/vmdb/log,
/var/log), note which pid you had to kill and when this was done. Please
provide these logs so we can analyze them.

Thank you.

[a] If this was in the previous days, you'll need to grep the prior day's logs
too 'zgrep timeout log/apache/*'

Comment 10 Ryan Spagnola 2017-12-20 13:14:20 UTC

Hi Joe,

Customer has provided the logs after reproducing the error. They have provided the cli command history and logs which are attached to this bz.

Thanks,
Ryan

Comment 11 Joe Rafaniello 2017-12-20 15:41:27 UTC

Hi Ron,

Thanks for the logs.  It looks like the most recent timeout error you found in the ssl_error.log is for the existing problematic UI worker 19700
we noticed in the last set of logs from comment 5.

The last error was at 5:12 am:
apache/ssl_error.log:[Wed Dec 20 05:12:49.765737 2017] [proxy_http:error] [pid 19918] (70007)The timeout specified has expired: [client X.X.X.X:55694] AH01102: error reading status line from remote server 0.0.0.0:3002, referer: https://REDACTED/

We can see from the top_output.log that the ui worker on port 3002 was pid 19700:

top - 05:12:25 up 43 days, 21:25,  2 users,  load average: 0.28, 0.16, 0.15
Tasks: 265 total,   1 running, 264 sleeping,   0 stopped,   0 zombie
19700 19434 root      21   1  921016 563360   2320 S   0.9  2.3  12:52.16 puma 3.3.0 (tcp://127.0.0.1:3002) [MIQ: Web Server Worker]

Additionally, you'll see that this pid stays around until you kill -9 the process.

At this point, we don't have any UI requests since the process was killed only a few minutes before you collected the logs.
Please test this appliance in the UI and let us know good or bad if the 502 errors return.  If so, please run through the same steps so we can get the diagnostics
and you can "workaround" this problem for now.

Comment 13 Ron 2017-12-20 15:48:00 UTC

The login no longer times out. There are other timeouts that are popping up in the logs for different ports however the UI is not throwing a 502.

Comment 14 Joe Rafaniello 2017-12-20 16:08:15 UTC

Thanks Ron. I marked the logs as private, just in case. We're looking now.

Comment 15 Joe Rafaniello 2017-12-20 16:45:54 UTC

Ron, we reviewed the new logs and it appears that everything is normal after you kill -9 the problematic UI worker.

We do not see timeouts in the apache logs and the new UI worker, pid 27218, is fielding requests in the production.log and things look normal.

Note, if you are referring this set of errors in the ssl_error.log, they are expected and should cause no harm:

[Wed Dec 20 10:44:55.745105 2017] [proxy_http:error] [pid 27350] [client X.X.X.X:62699] AH01114: HTTP: failed to make connection to backend: 0.0.0.0, referer: https://REDACTED/ops/explorer
[Wed Dec 20 10:44:55.745217 2017] [proxy:error] [pid 27350] (111)Connection refused: AH00957: HTTP: attempt to connect to 0.0.0.0:3006 (0.0.0.0) failed
[Wed Dec 20 10:44:55.745240 2017] [proxy:error] [pid 27350] AH00959: ap_proxy_connect_backend disabling worker for (0.0.0.0) for 60s

Please continue to use this appliance and monitor for the 502 errors and let us know what you find.  We're not sure if the 502s seen previously was an isolated incident or a symptom of something else, so please let us know what you find.

Comment 16 Joe Rafaniello 2018-03-08 19:41:27 UTC

Hi Ron,

Since we've not heard from you, I'm going to close this bug.  If you get 502 errors in the browser and timeouts in your apache logs, please do update this bug or open a new one if it's slightly different.  It's possible the code exercised in the puma worker is hanging, leading to these timeouts.  We'd like to track down and squash any of these bugs.

Also, we had a talk discussion relating to these types of errors with some information I've found helpful in tracking down why a ui/webservice worker might start causing timeout errors:
http://talk.manageiq.org/t/502-proxy-error-after-migrating-from-fine-to-gaprindashvili/3238/18?u=jrafanie

thanks,
Joe

Comment 17 Red Hat Bugzilla 2023-09-15 00:05:44 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.