1677711 – localhost:4000/api is not responding

Bug 1677711 - localhost:4000/api is not responding

Summary: localhost:4000/api is not responding

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	API
Sub Component:
Version:	5.9.6
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	GA
Target Release:	5.10.z
Assignee:	Joe Rafaniello
QA Contact:	Parthvi Vala
Docs Contact:	Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On:
Blocks:	1679295
TreeView+	depends on / blocked

Reported:	2019-02-15 16:26 UTC by Ryan Spagnola
Modified:	2019-11-06 20:00 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-03-15 21:49:38 UTC
Category:	Bug
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ryan Spagnola 2019-02-15 16:26:03 UTC

Description of problem:
httpd and evmserverd are running but /api/ is not responding.


Version-Release number of selected component (if applicable):
5.9.6

How reproducible:
sporadic 

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
this issue is happening sporadically when they post a request-
automation task.
They call the cf endpoint, not specific provisioning request but any automation request. When the load seems to be high when they see these 502 errors which fail the respective automation tasks

httpd is running
evmserverd is running

WebServiceWorker is running
puma :4000 is running

curl -k https://localhost/api is not responsive
curl http://localhost:4000/api is not responsive

systemctl restart httpd did not help
systemctl restart evmserverd did not help

Comment 3 Gregg Tanzillo 2019-02-15 18:24:36 UTC

I was able to see a bunch of 502 errors in the latest set of apache logs. However, those apache logs are archived and we don't have the corresponding evm and production logs to match up with those errors.

However, I was able to see a handful of 502 errors in the logs from 2/7 with matching production log entries

These are all the 502 error there

log/apache/ssl_access.log:10.0.65.101 - - [07/Feb/2019:12:21:30 -0700] "GET /api/automation_requests/32000000104932 HTTP/1.1" 200 5021
log/apache/ssl_access.log:10.0.65.99 - - [07/Feb/2019:13:40:01 -0700] "POST /api/provision_requests HTTP/1.1" 502 424
log/apache/ssl_access.log:10.0.65.101 - - [07/Feb/2019:13:40:04 -0700] "POST /api/provision_requests HTTP/1.1" 502 424
log/apache/ssl_access.log:10.0.65.98 - - [07/Feb/2019:13:40:13 -0700] "POST /api/provision_requests HTTP/1.1" 502 424
log/apache/ssl_access.log-20190205.gz:10.0.64.43 - - [04/Feb/2019:21:13:33 -0700] "POST /api/provision_requests HTTP/1.1" 502 424

It looks like they correspond to a short period of time when there were some long running requests varying from 30s to 60s 

[----] I, [2019-02-07T13:38:03.598685 #12381:19fff14]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:38:03 -0700
[----] I, [2019-02-07T13:38:09.106008 #12381:1a002fc]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:38:09 -0700
[----] I, [2019-02-07T13:38:11.571006 #12381:19f86ec]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:38:11 -0700
[----] I, [2019-02-07T13:38:18.958363 #12381:19f89bc]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:38:18 -0700
[----] I, [2019-02-07T13:38:23.386576 #12381:19fee84]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:38:23 -0700
[----] I, [2019-02-07T13:38:32.439876 #12381:19fff14]  INFO -- : Completed 200 OK in 28835ms (Views: 0.3ms | ActiveRecord: 4318.0ms)
[----] I, [2019-02-07T13:38:32.796553 #12381:19fff14]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:38:32 -0700
[----] I, [2019-02-07T13:38:55.655375 #12381:1a002fc]  INFO -- : Completed 200 OK in 46540ms (Views: 0.2ms | ActiveRecord: 16450.7ms)
[----] I, [2019-02-07T13:38:55.893546 #12381:1a002fc]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:38:55 -0700
[----] I, [2019-02-07T13:39:01.175647 #12381:19f86ec]  INFO -- : Completed 200 OK in 49592ms (Views: 0.2ms | ActiveRecord: 21074.0ms)
[----] I, [2019-02-07T13:39:01.843271 #12381:19f86ec]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:39:01 -0700
[----] I, [2019-02-07T13:39:08.089279 #12381:19f89bc]  INFO -- : Completed 200 OK in 49115ms (Views: 0.3ms | ActiveRecord: 22719.4ms)
[----] I, [2019-02-07T13:39:08.303158 #12381:19f89bc]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:39:08 -0700
[----] I, [2019-02-07T13:39:16.361245 #12381:19fee84]  INFO -- : Completed 200 OK in 52937ms (Views: 0.3ms | ActiveRecord: 28616.5ms)
[----] I, [2019-02-07T13:39:16.585154 #12381:19fee84]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:39:16 -0700
[----] I, [2019-02-07T13:39:35.161532 #12381:19fff14]  INFO -- : Completed 200 OK in 62338ms (Views: 0.2ms | ActiveRecord: 27921.3ms)
[----] I, [2019-02-07T13:39:35.520582 #12381:19fff14]  INFO -- : Started GET "/api/provision_requests/32000000104991" for 127.0.0.1 at 2019-02-07 13:39:35 -0700
[----] I, [2019-02-07T13:39:36.597071 #12381:19fff14]  INFO -- : Completed 200 OK in 1065ms (Views: 0.2ms | ActiveRecord: 802.9ms)
[----] I, [2019-02-07T13:39:38.335947 #12381:19fff14]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:39:38 -0700
[----] I, [2019-02-07T13:39:56.785337 #12381:1a002fc]  INFO -- : Completed 200 OK in 60882ms (Views: 0.2ms | ActiveRecord: 28546.4ms)
[----] I, [2019-02-07T13:39:57.019357 #12381:1a002fc]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:39:57 -0700
[----] I, [2019-02-07T13:40:01.285911 #12381:19f86ec]  INFO -- : Completed 200 OK in 59430ms (Views: 0.3ms | ActiveRecord: 29126.7ms)
[----] I, [2019-02-07T13:40:01.497065 #12381:19f86ec]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:40:01 -0700
[----] I, [2019-02-07T13:40:04.932435 #12381:19fee84]  INFO -- : Completed 200 OK in 48341ms (Views: 0.2ms | ActiveRecord: 26791.9ms)
[----] I, [2019-02-07T13:40:05.141158 #12381:19fee84]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:40:05 -0700
[----] I, [2019-02-07T13:40:07.321970 #12381:19f89bc]  INFO -- : Completed 200 OK in 59011ms (Views: 0.2ms | ActiveRecord: 30891.2ms)
[----] I, [2019-02-07T13:40:07.526851 #12381:19f89bc]  INFO -- : Started POST "/api/provision_requests" for 127.0.0.1 at 2019-02-07 13:40:07 -0700
[----] I, [2019-02-07T13:40:29.978660 #12381:19fff14]  INFO -- : Completed 200 OK in 51614ms (Views: 0.3ms | ActiveRecord: 20595.0ms)

Given that there is only 1 API worker configured, I would recommend adding at least one more to get it over the hump during these periods where there are many requests at a high frequency along with some long running ones.

Comment 9 Joe Rafaniello 2019-02-20 16:47:55 UTC

It looks like there are no hung requests, just slow requests.

On the same appliance:
1 UI worker
2 web service workers (5 threads per process)
Postgresql Database is running
Openstack event catcher, refresh worker
Event Handler

With this configuration, if the 10 web service worker threads are busy, you'll get 502 errors since apache can't find a available thread to work on them.

[----] I, [2019-02-19T21:41:20.325822 #62072:1045118]  INFO -- : MIQ(MiqWebServiceWorker.sync_workers) Workers are being synchronized: Current #: [1], Desired #: [2]
[----] I, [2019-02-19T21:44:40.888018 #10768:fe5114]  INFO -- : MIQ(MiqUiWorker.sync_workers) Workers are being synchronized: Current #: [0], Desired #: [1]

I'm seeing slow requests on getting authentication, anywhere from 13 to 30 seconds, so I'll be looking at why it's taking so long to complete the authentication.  If you have more than a handful of users trying to login at the same time, you could use up these 10 threads on these 13-30 second requests.  I'm still looking but here's what I see so far:

[----] I, [2019-02-19T21:40:25.940144 #62436:1ad0614]  INFO -- : Started GET "/api/auth?requester_type=ui" for 127.0.0.1 at 2019-02-19 21:40:25 -0700
[----] I, [2019-02-19T21:40:55.869661 #62436:1ad0614]  INFO -- : Completed 200 OK in 29924ms (Views: 0.2ms | ActiveRecord: 78.1ms)

[----] I, [2019-02-19T21:41:35.843530 #62436:1ad3260]  INFO -- : Started GET "/api/auth?requester_type=ui" for 127.0.0.1 at 2019-02-19 21:41:35 -0700
[----] I, [2019-02-19T21:41:59.004818 #62436:1ad3260]  INFO -- : Completed 200 OK in 23156ms (Views: 0.7ms | ActiveRecord: 34.9ms)

[----] I, [2019-02-19T21:42:01.895664 #9666:1b90004]  INFO -- : Started GET "/api/auth?requester_type=ui" for 127.0.0.1 at 2019-02-19 21:42:01 -0700
[----] I, [2019-02-19T21:42:36.384779 #9666:1b90004]  INFO -- : Completed 200 OK in 34480ms (Views: 0.4ms | ActiveRecord: 89.1ms)

[----] I, [2019-02-19T21:45:56.754893 #11225:1b74afc]  INFO -- : Started GET "/api/auth?requester_type=ui" for 127.0.0.1 at 2019-02-19 21:45:56 -0700
[----] I, [2019-02-19T21:46:25.078124 #11225:1b74afc]  INFO -- : Completed 200 OK in 27021ms (Views: 0.5ms | ActiveRecord: 103.7ms)

[----] I, [2019-02-19T21:46:49.233879 #11225:1b752e0]  INFO -- : Started GET "/api/auth?requester_type=ui" for 127.0.0.1 at 2019-02-19 21:46:49 -0700
[----] I, [2019-02-19T21:47:11.503967 #11225:1b752e0]  INFO -- : Completed 200 OK in 22263ms (Views: 0.2ms | ActiveRecord: 16.4ms)

[----] I, [2019-02-19T21:51:22.273726 #15553:426bc8c]  INFO -- : Started GET "/api/auth?requester_type=ui" for 127.0.0.1 at 2019-02-19 21:51:22 -0700
[----] I, [2019-02-19T21:51:35.369972 #15553:426bc8c]  INFO -- : Completed 200 OK in 11822ms (Views: 0.4ms | ActiveRecord: 310.9ms)

[----] I, [2019-02-19T21:53:32.312809 #15553:42620b0]  INFO -- : Started GET "/api/auth?requester_type=ui" for 127.0.0.1 at 2019-02-19 21:53:32 -0700
[----] I, [2019-02-19T21:53:45.463371 #15553:42620b0]  INFO -- : Completed 200 OK in 13007ms (Views: 0.3ms | ActiveRecord: 17.7ms)

Increasing the web service worker count would give them more threads to handle these web service requests but we will have to discover why it's taking 30 seconds to perform authentication or authorization.
Of course, running a database and provider workers on the same machine as the UI and web service workers will cause each user to experience slower performance since they're competing for resources.

Comment 11 Joe Rafaniello 2019-02-20 17:07:24 UTC

Ryan, I was thinking of this guide:  http://manageiq.org/blog/2018/01/troubleshooting-auth/

We had another bug, https://bugzilla.redhat.com/show_bug.cgi?id=1531626, where this happened when the a user's group membership included a group with unicode/utf-8 characters.  That bug was fixed but perhaps there is something specific with a group that's causing a slightly different problem.  Do all users have this problem?  Can they isolate a single ldap group membership that's causing this problem?

Comment 13 Joe Rafaniello 2019-02-20 21:02:56 UTC

Specifically, this is the log line warning that seems to take 20-30 seconds to appear and could be due to the user's group membership.  If we can narrow down what the user's groups are, perhaps we can isolate it to a single group .  Is it a new group?  Are other users in different groups having this problem?

Sanitized log snippet:

[----] I, [2019-02-19T21:46:49.377276 #11225:1b752e0]  INFO -- : MIQ(MiqLdap#bind) Binding to LDAP: Host: [XXX], User: [CN=XXX]... successful
[----] I, [2019-02-19T21:46:49.377556 #11225:1b752e0]  INFO -- : MIQ(MiqLdap#get_user_object) Type: [samaccountname], Base DN: [DC=XXX], Filter: <(samaccountname=XXX)>
[----] W, [2019-02-19T21:47:11.446832 #11225:1b752e0]  WARN -- : MIQ(Authenticator::Ldap#groups_for) Required key not specified: [basedn] (from Authenticator#user_proxy_membership)
[----] I, [2019-02-19T21:47:11.470873 #11225:1b752e0]  INFO -- : MIQ(Authenticator::Ldap#authorize) Authorized User: [XXX]

Comment 15 Joe Rafaniello 2019-02-21 17:10:38 UTC

Ryan,

The customer can also try manually running ldapsearch as described in this troubleshooting guide: http://manageiq.org/blog/2018/01/troubleshooting-auth/

Perhaps if they run this in different environments, they can compare timings to see if it's a network issue or consistently slow everywhere.  Also, it would be helpful to know if slow logins have been a problem for a long time or recently became a problem.

We should also consider upgrading off of MiqLdap since it's deprecated.  Using external auth should be faster but we can discuss this after some diagnostics are run.

Thanks,
joe

Comment 16 Joe Vlcek 2019-02-25 21:34:42 UTC

(In reply to Joe Rafaniello from comment #15)
> Ryan,
> 
> The customer can also try manually running ldapsearch as described in this
> troubleshooting guide: http://manageiq.org/blog/2018/01/troubleshooting-auth/
> 
> Perhaps if they run this in different environments, they can compare timings
> to see if it's a network issue or consistently slow everywhere.  Also, it
> would be helpful to know if slow logins have been a problem for a long time
> or recently became a problem.
> 
> We should also consider upgrading off of MiqLdap since it's deprecated. 
> Using external auth should be faster but we can discuss this after some
> diagnostics are run.
> 
> Thanks,
> joe

It would be valuable to know how long the ldapsearch takes to complete. Please ask the
customer to try running: "time ldapsearch <args>" and report how long the ldapsearch takes
to complete.

Also, in the initial description to this BZ, you have reported that using curl to access
the API endpoint is failing. Can you please ask the customer to try running the following
from the appliance shell and let us know if that succeeds and how long it takes?

e.g.:
my pc> ssh root@<appliance IPAddr>
cfme appliance shell> time curl --user admin:<your admin password> localhost:3000/api

Comment 17 Joe Rafaniello 2019-02-25 23:03:08 UTC

> We should also consider upgrading off of MiqLdap since it's deprecated.  Using external auth should be faster but we can discuss this after some diagnostics are run.

Note, I said this prematurely.  We really need to identify why the group lookup requests are so slow, by testing the environment and trying to get more information.  We can consider converting the ldap configuration after things are running smoother if need be.

Comment 20 Joe Rafaniello 2019-03-13 19:57:04 UTC

Ryan, do we have an update on this issue?

Comment 21 dmetzger 2019-03-15 19:51:11 UTC

Closing per conversation with CEE:

"Remove this from 5.10.2 since customer is unable to reproduce the issue and hasnt responded back to the support ticket in a couple weeks"

Comment 22 Joe Rafaniello 2019-03-15 21:49:38 UTC

Closing as per comment #21... looks like it was accidentally put in NEW

Note You need to log in before you can comment on or make changes to this bug.