Description of problem: Web service worker becomes unresponsive, the UI starts getting 502 proxy errors. Customer is using 5.11.4.2. 3 nearly synchronous requests get stuck, causing a 4th, coming in 2 seconds later to also get stuck: [----] I, [2020-04-13T17:50:17.314148 #4714:2aac1927d748] INFO -- : Started GET "/api/providers?expand=resources&attributes=name,type,last_compliance_status,aggregate_vm_cpus,aggregate_vm_memory,vms" for <REDACTED> at 2020-04-13 17:50:17 -0400 [----] I, [2020-04-13T17:50:17.511732 #4714:2aac1927d568] INFO -- : Started GET "/api/availability_zones?expand=resources&attributes=type,id,name" for <REDACTED> at 2020-04-13 17:50:17 -0400 [----] I, [2020-04-13T17:50:17.512309 #4714:2aac1927d400] INFO -- : Started GET "/api/cloud_volumes?expand=resources&attributes=vms,size" for <REDACTED> at 2020-04-13 17:50:17 -0400 [----] I, [2020-04-13T17:50:19.219942 #4714:2aac1927d324] INFO -- : Started GET "/api/vms?expand=resources&attributes=name,vendor,type,last_compliance_status,tags,power_state,v_owning_datacenter,service,cloud,storage,availability_zone_id&filter[]=active=true" for <REDACTED> at 2020-04-13 17:50:19 -0400 Top output shows the web service worker "stuck" until finally going high CPU and completing the requests 30 minutes later: timesync: date time is-> Mon Apr 13 18:01:01 EDT 2020 -0400 4714 root 21 1 711976 445896 11984 S 0.0 1.4 0:07.42 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 0.0 1.4 0:07.43 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 0.0 1.4 0:07.44 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 0.0 1.4 0:07.44 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 0.7 1.4 0:07.85 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 0.0 1.4 0:07.86 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 0.0 1.4 0:07.88 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 0.0 1.4 0:07.88 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 0.0 1.4 0:07.90 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 0.0 1.4 0:07.90 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 0.0 1.4 0:07.91 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 0.0 1.4 0:07.92 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 0.0 1.4 0:07.93 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 0.0 1.4 0:07.93 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 0.0 1.4 0:07.94 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 92.2 1.4 1:03.34 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 98.5 1.4 2:02.53 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 98.6 1.4 3:01.77 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 98.6 1.4 4:00.95 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 4714 root 21 1 711976 445896 11984 S 98.6 1.4 5:00.14 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] [----] I, [2020-04-13T18:20:43.789524 #4714:2aac1927d400] INFO -- : Completed in 1824294ms (ActiveRecord: 384.7ms) [----] I, [2020-04-13T18:20:43.789692 #4714:2aac1927d568] INFO -- : Completed in 1824289ms (ActiveRecord: 277.3ms) [----] I, [2020-04-13T18:20:43.789856 #4714:2aac1927d324] INFO -- : Completed in 1824286ms (ActiveRecord: 269.9ms) [----] I, [2020-04-13T18:20:43.790021 #4714:2aac1927d748] INFO -- : Completed in 1824297ms (ActiveRecord: 397.4ms) Another set of requests: [----] I, [2020-04-13T18:21:03.908008 #10755:2aac1c7bc4f4] INFO -- : Started GET "/api/services?expand=resources&filter[]=retired=false&attributes=vms,created_at,name,id" for <REDACTED> at 2020-04-13 18:21:03 -0400 [----] I, [2020-04-13T18:21:05.319782 #10755:2aac1c7bd7c8] INFO -- : Started GET "/api/vms?expand=resources&attributes=name,vendor,type,last_compliance_status,tags,power_state,v_owning_datacenter,service,cloud,storage,availability_zone_id&filter[]=active=true" for <REDACTED> at 2020-04-13 18:21:05 -0400 [----] I, [2020-04-13T18:21:05.753903 #10755:2aac1c7bd1c4] INFO -- : Started GET "/api/policy_profiles?expand=resources&expand=resources,policies" for <REDACTED> at 2020-04-13 18:21:05 -0400 [----] I, [2020-04-13T18:21:05.754268 #10755:2aac1c7bc6fc] INFO -- : Started GET "/api/services?expand=resources&filter[]=retired=false&attributes=vms,created_at,name,id" for <REDACTED> at 2020-04-13 18:21:05 -0400 Top output shows the web service worker "stuck" until finally going high CPU and completing the requests 60 minutes later: timesync: date time is-> Mon Apr 13 19:01:01 EDT 2020 -0400 10755 root 21 1 685004 377628 12044 S 0.0 1.2 0:04.34 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 0.0 1.2 0:04.35 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 0.0 1.2 0:04.36 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 42.2 1.2 0:29.72 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.5 1.2 1:28.91 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.6 1.2 2:28.12 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.6 1.2 3:27.35 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.5 1.2 4:26.52 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.6 1.2 5:25.76 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.6 1.2 6:24.99 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.6 1.2 7:24.14 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.6 1.2 8:23.33 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.6 1.2 9:22.56 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.6 1.2 10:21.79 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.6 1.2 11:21.03 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.6 1.2 12:20.23 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.6 1.2 13:19.41 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.6 1.2 14:18.65 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.6 1.2 15:17.87 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] 10755 root 21 1 685004 377628 12044 S 98.6 1.2 16:17.09 puma 3.7.1 (tcp://127.0.0.1:4000) [MIQ: Web Server Worker] [----] I, [2020-04-13T19:21:04.301450 #10755:2aac1c7bd7c8] INFO -- : Completed in 3598891ms (ActiveRecord: 58.4ms) [----] I, [2020-04-13T19:21:04.301659 #10755:2aac1c7bc4f4] INFO -- : Completed in 3599646ms (ActiveRecord: 56.4ms) [----] I, [2020-04-13T19:21:04.301994 #10755:2aac1c7bd1c4] INFO -- : Completed in 3598191ms (ActiveRecord: 737.2ms) [----] I, [2020-04-13T19:21:04.302158 #10755:2aac1c7bc6fc] INFO -- : Completed in 3598543ms (ActiveRecord: 168.8ms)
https://github.com/ManageIQ/manageiq/pull/20072
Note, I'm seeing that both reported PIDs exited at different amounts of time waiting on locks 30 vs. 60 minutes with Unexpected error: Too many open files [----] W, [2020-04-13T18:20:43.474303 #4714:2aac192aaf68] WARN -- : MIQ(MiqWebServiceWorker#status_update) Unexpected error: Too many open files @ rb_sysopen - /proc/4714/stat, while requesting process info for [Web Services Worker] with PID=[4714] [----] W, [2020-04-13T19:21:04.041065 #10755:2aac192d7ec8] WARN -- : MIQ(MiqWebServiceWorker#status_update) Unexpected error: Too many open files @ rb_sysopen - /proc/10755/stat, while requesting process info for [Web Services Worker] with PID=[10755]
For background, I'm requesting them run this to get the number of rows for the relevant tables associated with the stuck api requests: vmdb bin/rails r 'puts Provider.count; AvailabilityZone.count; puts CloudVolume.count; puts Service.count; puts VmOrTemplate.count; puts MiqPolicySet.count'
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/4012c50dbf32b84618a299102928111c402c76a4 commit 4012c50dbf32b84618a299102928111c402c76a4 Author: Joe Rafaniello <jrafanie> AuthorDate: Tue Apr 14 14:06:41 2020 +0000 Commit: Joe Rafaniello <jrafanie> CommitDate: Tue Apr 14 14:06:41 2020 +0000 Eager load collection constants to avoid constant loading problems. Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1823849 Allowing threads to autoload code requires us to spend time tracking down bugs in code loading and insert disgusting rails interlock blocks to "protect" various places that could load constants in threads. This solution tries to avoid these workarounds by eager loading all of the collection classes and their descendants before any requests are made, since we're likely to need to load them anyway. app/models/miq_web_service_worker.rb | 5 + spec/models/miq_web_server_worker_spec.rb | 8 + 2 files changed, 13 insertions(+)
New commit detected on ManageIQ/manageiq/jansa: https://github.com/ManageIQ/manageiq/commit/91899e25a3943e375e4dc0c5060a1e69e2e836b6 commit 91899e25a3943e375e4dc0c5060a1e69e2e836b6 Author: Keenan Brock <keenan> AuthorDate: Tue Apr 14 20:23:50 2020 +0000 Commit: Keenan Brock <keenan> CommitDate: Tue Apr 14 20:23:50 2020 +0000 Merge pull request #20072 from jrafanie/eager-load-api-collection-classes-in-web-service-worker Eager load collection constants to avoid constant loading problems. (cherry picked from commit 747ab52b29347b7e6352d7b4890307213623c4de) https://bugzilla.redhat.com/show_bug.cgi?id=1823849 app/models/miq_web_service_worker.rb | 5 + spec/models/miq_web_server_worker_spec.rb | 8 + 2 files changed, 13 insertions(+)
In order to verify this I need some information about how to reproduce. @Joe, can you please help?
To test this, you'd have to limit your web service worker count to 1, restart. Then try to do the requests in parallel: [----] I, [2020-04-13T17:50:17.314148 #4714:2aac1927d748] INFO -- : Started GET "/api/providers?expand=resources&attributes=name,type,last_compliance_status,aggregate_vm_cpus,aggregate_vm_memory,vms" for <REDACTED> at 2020-04-13 17:50:17 -0400 [----] I, [2020-04-13T17:50:17.511732 #4714:2aac1927d568] INFO -- : Started GET "/api/availability_zones?expand=resources&attributes=type,id,name" for <REDACTED> at 2020-04-13 17:50:17 -0400 [----] I, [2020-04-13T17:50:17.512309 #4714:2aac1927d400] INFO -- : Started GET "/api/cloud_volumes?expand=resources&attributes=vms,size" for <REDACTED> at 2020-04-13 17:50:17 -0400 [----] I, [2020-04-13T17:50:19.219942 #4714:2aac1927d324] INFO -- : Started GET "/api/vms?expand=resources&attributes=name,vendor,type,last_compliance_status,tags,power_state,v_owning_datacenter,service,cloud,storage,availability_zone_id&filter[]=active=true" for <REDACTED> at 2020-04-13 17:50:19 -0400 An example from a past BZ is here: https://bugzilla.redhat.com/show_bug.cgi?id=1596703#c29 I will try to see if I can create a recreation script but that's basically how'd you recreate it. Each time you run this script and fail to recreate it, you'll need to kill the web service worker process as this issue revolves around initial constant (code) loading, so once the code successfully loads, you will not be able to recreate it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2020
Setting qe_tests_coverate to - as though it would be great to have a test for these kind of issues, I don't know how to test this effectively and reliably.