Description of problem: When you get a 502 error in the UI or API, the only way to track it down and figure out what happened is to look at the apache ssl_error.log to find the exact time of the timeout. From there, you can see which puma port was accessed and timed out. Then, you need to look at the top_output.log or evm.log to figure out which PID had that port at that exact time. From there, you need to look at the production.log for that PID to see what requests were Started but never Completed. Finally, you can look at the request, any parameters passed to it and investigate why that request hung. This is tedious. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Modify the /api/ping or /ping endpoints to sleep > 120 seconds 2. Access https://host/api/ping or https://host/ping and wait for the browser error. 3. Check the production.log... you'll find a request Started but not Completed. Actual results: We don't log or inform the user as to what request was hung or just taking a long time. Expected results: We should call out any hung or slow requests that could trigger a 502 so we can greatly improve the turnaround time in troubleshooting these problems. Additional info:
In order to test this, the /ping and /api/ping controllers can be found below. You can modify them to do sleep 180 or whatever value is > 120: https://github.com/ManageIQ/manageiq-ui-classic/blob/df30be45d2839ef3b3945efe4d66d55f02942fb3/app/controllers/ping_controller.rb#L4 https://github.com/ManageIQ/manageiq-api/blob/fa6ab6570125984c47f08990b3ce86877dee46b9/app/controllers/api/ping_controller.rb#L4
https://github.com/ManageIQ/manageiq/pull/17842
Testing: This is how I tested some normal requests and some really slow: [root@localhost vmdb]# bundle show manageiq-api /opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/bundler/gems/manageiq-api-be689fe211b9 [root@localhost vmdb]# cd /opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/bundler/gems/manageiq-api-be689fe211b9 [root@localhost manageiq-api-be689fe211b9]# cat app/controllers/api/ping_controller.rb module Api class PingController < ActionController::API def index sleep 180 if rand(5) == 0 render :plain => 'pong' end end end
New commits detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/cf34c1f53478267fa7604536a54ea670061e8e64 commit cf34c1f53478267fa7604536a54ea670061e8e64 Author: Joe Rafaniello <jrafanie> AuthorDate: Fri Aug 10 13:44:48 2018 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Aug 10 13:44:48 2018 -0400 Add a middleware to detect long running requests https://bugzilla.redhat.com/show_bug.cgi?id=1614918 This middleware tracks each request and when it started using thread local variables. long_running_requests traverses these thread local variables to find requests beyond the 2 minute timeout. This information can be retrieved by another thread, such as the heartbeating thread of a puma worker, which can then respond to a timed out request. lib/request_started_on_middleware.rb | 57 + spec/lib/request_started_on_middleware_spec.rb | 28 + 2 files changed, 85 insertions(+) https://github.com/ManageIQ/manageiq/commit/61680c19a152abd49bdb93af02ec1e50489bbd21 commit 61680c19a152abd49bdb93af02ec1e50489bbd21 Author: Joe Rafaniello <jrafanie> AuthorDate: Fri Aug 10 13:53:24 2018 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Aug 10 13:53:24 2018 -0400 Log long running requests via RequestStartedOnMiddleware https://bugzilla.redhat.com/show_bug.cgi?id=1614918 For puma based workers, we implement do_heartbeat_work to log_long_running_requests. This method is run very frequently per minute, so we ensure this check happens at most every 30 seconds. The log message includes the requested URI path, the thread (in the same PID:TID format that we log elsewhere) and the request duration so far. Using the PID:TID information, users can then find the request parameters by searching the logs for this PID:TID. app/models/mixins/miq_web_server_runner_mixin.rb | 18 + config/application.rb | 2 + 2 files changed, 20 insertions(+)
New commit detected on ManageIQ/manageiq/hammer: https://github.com/ManageIQ/manageiq/commit/6a809cbd3dfbaccc89ff53f384bea797e3166508 commit 6a809cbd3dfbaccc89ff53f384bea797e3166508 Author: Gregg Tanzillo <gtanzill> AuthorDate: Wed Sep 26 17:51:58 2018 -0400 Commit: Gregg Tanzillo <gtanzill> CommitDate: Wed Sep 26 17:51:58 2018 -0400 Merge pull request #17842 from jrafanie/detect_long_running_requests Detect and log long running http(s) requests (cherry picked from commit cc6e263931c3b6bffc60af64168f9ba26207f2a9) https://bugzilla.redhat.com/show_bug.cgi?id=1614918 app/models/mixins/miq_web_server_runner_mixin.rb | 18 + config/application.rb | 2 + lib/request_started_on_middleware.rb | 57 + spec/lib/request_started_on_middleware_spec.rb | 28 + 4 files changed, 105 insertions(+)
Yes, this should have been put in post several weeks ago.
Verified on 5.10.0.30.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:0212