Bug 1823849 - 502 proxy error in the api/web service worker
Summary: 502 proxy error in the api/web service worker
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.11.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: GA
: 5.11.5
Assignee: Joe Rafaniello
QA Contact: Jaroslav Henner
Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-14 15:46 UTC by Joe Rafaniello
Modified: 2020-07-15 18:08 UTC (History)
5 users (show)

Fixed In Version: 5.11.5.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-05 13:43:34 UTC
Category: ---
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:
pm-rhel: cfme-5.11.z+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2020 0 None None None 2020-05-05 13:43:42 UTC

Description Joe Rafaniello 2020-04-14 15:46:39 UTC
Description of problem: Web service worker becomes unresponsive, the UI starts getting 502 proxy errors.  

Customer is using 5.11.4.2.

3 nearly synchronous requests get stuck, causing a 4th, coming in 2 seconds later to also get stuck:

[----] I, [2020-04-13T17:50:17.314148 #4714:2aac1927d748]  INFO -- : Started GET "/api/providers?expand=resources&attributes=name,type,last_compliance_status,aggregate_vm_cpus,aggregate_vm_memory,vms" for <REDACTED> at 2020-04-13 17:50:17 -0400
[----] I, [2020-04-13T17:50:17.511732 #4714:2aac1927d568]  INFO -- : Started GET "/api/availability_zones?expand=resources&attributes=type,id,name" for <REDACTED> at 2020-04-13 17:50:17 -0400
[----] I, [2020-04-13T17:50:17.512309 #4714:2aac1927d400]  INFO -- : Started GET "/api/cloud_volumes?expand=resources&attributes=vms,size" for <REDACTED> at 2020-04-13 17:50:17 -0400
[----] I, [2020-04-13T17:50:19.219942 #4714:2aac1927d324]  INFO -- : Started GET "/api/vms?expand=resources&attributes=name,vendor,type,last_compliance_status,tags,power_state,v_owning_datacenter,service,cloud,storage,availability_zone_id&filter[]=active=true" for <REDACTED> at 2020-04-13 17:50:19 -0400


Top output shows the web service worker "stuck" until finally going high CPU and completing the requests 30 minutes later:

timesync: date time is-> Mon Apr 13 18:01:01 EDT 2020 -0400
 4714 root      21   1  711976 445896  11984 S   0.0   1.4   0:07.42 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S   0.0   1.4   0:07.43 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S   0.0   1.4   0:07.44 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S   0.0   1.4   0:07.44 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S   0.7   1.4   0:07.85 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S   0.0   1.4   0:07.86 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S   0.0   1.4   0:07.88 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S   0.0   1.4   0:07.88 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S   0.0   1.4   0:07.90 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S   0.0   1.4   0:07.90 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S   0.0   1.4   0:07.91 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S   0.0   1.4   0:07.92 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S   0.0   1.4   0:07.93 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S   0.0   1.4   0:07.93 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S   0.0   1.4   0:07.94 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S  92.2   1.4   1:03.34 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S  98.5   1.4   2:02.53 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S  98.6   1.4   3:01.77 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S  98.6   1.4   4:00.95 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
 4714 root      21   1  711976 445896  11984 S  98.6   1.4   5:00.14 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]


[----] I, [2020-04-13T18:20:43.789524 #4714:2aac1927d400]  INFO -- : Completed   in 1824294ms (ActiveRecord: 384.7ms)
[----] I, [2020-04-13T18:20:43.789692 #4714:2aac1927d568]  INFO -- : Completed   in 1824289ms (ActiveRecord: 277.3ms)
[----] I, [2020-04-13T18:20:43.789856 #4714:2aac1927d324]  INFO -- : Completed   in 1824286ms (ActiveRecord: 269.9ms)
[----] I, [2020-04-13T18:20:43.790021 #4714:2aac1927d748]  INFO -- : Completed   in 1824297ms (ActiveRecord: 397.4ms)


Another set of requests:

[----] I, [2020-04-13T18:21:03.908008 #10755:2aac1c7bc4f4]  INFO -- : Started GET "/api/services?expand=resources&filter[]=retired=false&attributes=vms,created_at,name,id" for <REDACTED> at 2020-04-13 18:21:03 -0400
[----] I, [2020-04-13T18:21:05.319782 #10755:2aac1c7bd7c8]  INFO -- : Started GET "/api/vms?expand=resources&attributes=name,vendor,type,last_compliance_status,tags,power_state,v_owning_datacenter,service,cloud,storage,availability_zone_id&filter[]=active=true" for <REDACTED> at 2020-04-13 18:21:05 -0400
[----] I, [2020-04-13T18:21:05.753903 #10755:2aac1c7bd1c4]  INFO -- : Started GET "/api/policy_profiles?expand=resources&expand=resources,policies" for <REDACTED> at 2020-04-13 18:21:05 -0400
[----] I, [2020-04-13T18:21:05.754268 #10755:2aac1c7bc6fc]  INFO -- : Started GET "/api/services?expand=resources&filter[]=retired=false&attributes=vms,created_at,name,id" for <REDACTED> at 2020-04-13 18:21:05 -0400


Top output shows the web service worker "stuck" until finally going high CPU and completing the requests 60 minutes later:

timesync: date time is-> Mon Apr 13 19:01:01 EDT 2020 -0400
10755 root      21   1  685004 377628  12044 S   0.0   1.2   0:04.34 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S   0.0   1.2   0:04.35 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S   0.0   1.2   0:04.36 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  42.2   1.2   0:29.72 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.5   1.2   1:28.91 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.6   1.2   2:28.12 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.6   1.2   3:27.35 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.5   1.2   4:26.52 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.6   1.2   5:25.76 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.6   1.2   6:24.99 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.6   1.2   7:24.14 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.6   1.2   8:23.33 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.6   1.2   9:22.56 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.6   1.2  10:21.79 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.6   1.2  11:21.03 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.6   1.2  12:20.23 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.6   1.2  13:19.41 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.6   1.2  14:18.65 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.6   1.2  15:17.87 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]
10755 root      21   1  685004 377628  12044 S  98.6   1.2  16:17.09 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker]


[----] I, [2020-04-13T19:21:04.301450 #10755:2aac1c7bd7c8]  INFO -- : Completed   in 3598891ms (ActiveRecord: 58.4ms)
[----] I, [2020-04-13T19:21:04.301659 #10755:2aac1c7bc4f4]  INFO -- : Completed   in 3599646ms (ActiveRecord: 56.4ms)
[----] I, [2020-04-13T19:21:04.301994 #10755:2aac1c7bd1c4]  INFO -- : Completed   in 3598191ms (ActiveRecord: 737.2ms)
[----] I, [2020-04-13T19:21:04.302158 #10755:2aac1c7bc6fc]  INFO -- : Completed   in 3598543ms (ActiveRecord: 168.8ms)

Comment 3 Joe Rafaniello 2020-04-14 16:07:18 UTC
Note, I'm seeing that both reported PIDs exited at different amounts of time waiting on locks 30 vs. 60 minutes with Unexpected error: Too many open files 

[----] W, [2020-04-13T18:20:43.474303 #4714:2aac192aaf68]  WARN -- : MIQ(MiqWebServiceWorker#status_update) Unexpected error: Too many open files @ rb_sysopen - /proc/4714/stat, while requesting process info for [Web Services Worker] with PID=[4714]


[----] W, [2020-04-13T19:21:04.041065 #10755:2aac192d7ec8]  WARN -- : MIQ(MiqWebServiceWorker#status_update) Unexpected error: Too many open files @ rb_sysopen - /proc/10755/stat, while requesting process info for [Web Services Worker] with PID=[10755]

Comment 4 Joe Rafaniello 2020-04-14 16:14:09 UTC
For background, I'm requesting them run this to get the number of rows for the relevant tables associated with the stuck api requests:

vmdb
bin/rails r 'puts Provider.count; AvailabilityZone.count; puts CloudVolume.count; puts Service.count; puts VmOrTemplate.count; puts MiqPolicySet.count'

Comment 5 CFME Bot 2020-04-14 20:25:20 UTC
New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/4012c50dbf32b84618a299102928111c402c76a4
commit 4012c50dbf32b84618a299102928111c402c76a4
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Tue Apr 14 14:06:41 2020 +0000
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Tue Apr 14 14:06:41 2020 +0000

    Eager load collection constants to avoid constant loading problems.

    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1823849

    Allowing threads to autoload code requires us to spend time tracking down bugs
    in code loading and insert disgusting rails interlock blocks to "protect"
    various places that could load constants in threads.

    This solution tries to avoid these workarounds by eager loading all of the
    collection classes and their descendants before any requests are made, since
    we're likely to need to load them anyway.

 app/models/miq_web_service_worker.rb | 5 +
 spec/models/miq_web_server_worker_spec.rb | 8 +
 2 files changed, 13 insertions(+)

Comment 6 CFME Bot 2020-04-15 14:15:18 UTC
New commit detected on ManageIQ/manageiq/jansa:

https://github.com/ManageIQ/manageiq/commit/91899e25a3943e375e4dc0c5060a1e69e2e836b6
commit 91899e25a3943e375e4dc0c5060a1e69e2e836b6
Author:     Keenan Brock <keenan>
AuthorDate: Tue Apr 14 20:23:50 2020 +0000
Commit:     Keenan Brock <keenan>
CommitDate: Tue Apr 14 20:23:50 2020 +0000

    Merge pull request #20072 from jrafanie/eager-load-api-collection-classes-in-web-service-worker

    Eager load collection constants to avoid constant loading problems.

    (cherry picked from commit 747ab52b29347b7e6352d7b4890307213623c4de)

    https://bugzilla.redhat.com/show_bug.cgi?id=1823849

 app/models/miq_web_service_worker.rb | 5 +
 spec/models/miq_web_server_worker_spec.rb | 8 +
 2 files changed, 13 insertions(+)

Comment 8 Jaroslav Henner 2020-04-19 15:50:47 UTC
In order to verify this I need some information about how to reproduce. @Joe, can you please help?

Comment 9 Joe Rafaniello 2020-04-20 16:39:42 UTC
To test this, you'd have to limit your web service worker count to 1, restart.

Then try to do the requests in parallel:

[----] I, [2020-04-13T17:50:17.314148 #4714:2aac1927d748]  INFO -- : Started GET "/api/providers?expand=resources&attributes=name,type,last_compliance_status,aggregate_vm_cpus,aggregate_vm_memory,vms" for <REDACTED> at 2020-04-13 17:50:17 -0400
[----] I, [2020-04-13T17:50:17.511732 #4714:2aac1927d568]  INFO -- : Started GET "/api/availability_zones?expand=resources&attributes=type,id,name" for <REDACTED> at 2020-04-13 17:50:17 -0400
[----] I, [2020-04-13T17:50:17.512309 #4714:2aac1927d400]  INFO -- : Started GET "/api/cloud_volumes?expand=resources&attributes=vms,size" for <REDACTED> at 2020-04-13 17:50:17 -0400
[----] I, [2020-04-13T17:50:19.219942 #4714:2aac1927d324]  INFO -- : Started GET "/api/vms?expand=resources&attributes=name,vendor,type,last_compliance_status,tags,power_state,v_owning_datacenter,service,cloud,storage,availability_zone_id&filter[]=active=true" for <REDACTED> at 2020-04-13 17:50:19 -0400

An example from a past BZ is here:  https://bugzilla.redhat.com/show_bug.cgi?id=1596703#c29

I will try to see if I can create a recreation script but that's basically how'd you recreate it.

Each time you run this script and fail to recreate it, you'll need to kill the web service worker process as this issue revolves around initial constant (code) loading, so once the code successfully loads, you will not be able to recreate it.

Comment 13 errata-xmlrpc 2020-05-05 13:43:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2020

Comment 14 Jaroslav Henner 2020-05-19 11:15:15 UTC
Setting qe_tests_coverate to - as though it would be great to have a test for these kind of issues, I don't know how to test this effectively and reliably.


Note You need to log in before you can comment on or make changes to this bug.