Bug 1614918

Summary: [RFE] Create a tool to track requests to puma so that when users get 502 errors, the logs tell us exactly which request timed out
Product: Red Hat CloudForms Management Engine Reporter: Joe Rafaniello <jrafanie>
Component: ApplianceAssignee: Joe Rafaniello <jrafanie>
Status: CLOSED ERRATA QA Contact: Tasos Papaioannou <tpapaioa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.9.0CC: abellott, jrafanie, obarenbo, simaishi, tpapaioa, yrudman
Target Milestone: GAKeywords: FutureFeature
Target Release: 5.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 5.10.0.18 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-07 23:03:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joe Rafaniello 2018-08-10 17:40:44 UTC
Description of problem: When you get a 502 error in the UI or API, the only way to track it down and figure out what happened is to look at the apache ssl_error.log to find the exact time of the timeout.  From there, you can see which puma port was accessed and timed out.  Then, you need to look at the top_output.log or evm.log to figure out which PID had that port at that exact time.  From there, you need to look at the production.log for that PID to see what requests were Started but never Completed.  Finally, you can look at the request, any parameters passed to it and investigate why that request hung.

This is tedious.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.  Modify the /api/ping or /ping endpoints to sleep > 120 seconds
2.  Access https://host/api/ping or  https://host/ping and wait for the browser error.
3.  Check the production.log... you'll find a request Started but not Completed.

Actual results:  We don't log or inform the user as to what request was hung or just taking a long time.


Expected results:  We should call out any hung or slow requests that could trigger a 502 so we can greatly improve the turnaround time in troubleshooting these problems.


Additional info:

Comment 2 Joe Rafaniello 2018-08-10 17:43:15 UTC
In order to test this, the /ping and /api/ping controllers can be found below.  You can modify them to do sleep 180 or whatever value is > 120:

https://github.com/ManageIQ/manageiq-ui-classic/blob/df30be45d2839ef3b3945efe4d66d55f02942fb3/app/controllers/ping_controller.rb#L4

https://github.com/ManageIQ/manageiq-api/blob/fa6ab6570125984c47f08990b3ce86877dee46b9/app/controllers/api/ping_controller.rb#L4

Comment 4 Joe Rafaniello 2018-08-10 19:29:56 UTC
Testing:

This is how I tested some normal requests and some really slow:

[root@localhost vmdb]# bundle show manageiq-api
/opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/bundler/gems/manageiq-api-be689fe211b9

[root@localhost vmdb]# cd /opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/bundler/gems/manageiq-api-be689fe211b9

[root@localhost manageiq-api-be689fe211b9]# cat app/controllers/api/ping_controller.rb
module Api
  class PingController < ActionController::API
    def index
      sleep 180 if rand(5) == 0
      render :plain => 'pong'
    end
  end
end

Comment 5 CFME Bot 2018-09-26 21:56:10 UTC
New commits detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/cf34c1f53478267fa7604536a54ea670061e8e64
commit cf34c1f53478267fa7604536a54ea670061e8e64
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Aug 10 13:44:48 2018 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Aug 10 13:44:48 2018 -0400

    Add a middleware to detect long running requests

    https://bugzilla.redhat.com/show_bug.cgi?id=1614918

    This middleware tracks each request and when it started using thread
    local variables.  long_running_requests traverses these thread local
    variables to find requests beyond the 2 minute timeout.  This
    information can be retrieved by another thread, such as the heartbeating
    thread of a puma worker, which can then respond to a timed out request.

 lib/request_started_on_middleware.rb | 57 +
 spec/lib/request_started_on_middleware_spec.rb | 28 +
 2 files changed, 85 insertions(+)


https://github.com/ManageIQ/manageiq/commit/61680c19a152abd49bdb93af02ec1e50489bbd21
commit 61680c19a152abd49bdb93af02ec1e50489bbd21
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Aug 10 13:53:24 2018 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Aug 10 13:53:24 2018 -0400

    Log long running requests via RequestStartedOnMiddleware

    https://bugzilla.redhat.com/show_bug.cgi?id=1614918

    For puma based workers, we implement do_heartbeat_work to
    log_long_running_requests. This method is run very frequently per minute,
    so we ensure this check happens at most every 30 seconds.

    The log message includes the requested URI path, the thread (in the same
    PID:TID format that we log elsewhere) and the request duration so far.

    Using the PID:TID information, users can then find the request
    parameters by searching the logs for this PID:TID.

 app/models/mixins/miq_web_server_runner_mixin.rb | 18 +
 config/application.rb | 2 +
 2 files changed, 20 insertions(+)

Comment 6 CFME Bot 2018-10-01 16:56:55 UTC
New commit detected on ManageIQ/manageiq/hammer:

https://github.com/ManageIQ/manageiq/commit/6a809cbd3dfbaccc89ff53f384bea797e3166508
commit 6a809cbd3dfbaccc89ff53f384bea797e3166508
Author:     Gregg Tanzillo <gtanzill>
AuthorDate: Wed Sep 26 17:51:58 2018 -0400
Commit:     Gregg Tanzillo <gtanzill>
CommitDate: Wed Sep 26 17:51:58 2018 -0400

    Merge pull request #17842 from jrafanie/detect_long_running_requests

    Detect and log long running http(s) requests

    (cherry picked from commit cc6e263931c3b6bffc60af64168f9ba26207f2a9)

    https://bugzilla.redhat.com/show_bug.cgi?id=1614918

 app/models/mixins/miq_web_server_runner_mixin.rb | 18 +
 config/application.rb | 2 +
 lib/request_started_on_middleware.rb | 57 +
 spec/lib/request_started_on_middleware_spec.rb | 28 +
 4 files changed, 105 insertions(+)

Comment 8 Joe Rafaniello 2018-12-13 20:43:40 UTC
Yes, this should have been put in post several weeks ago.

Comment 9 Tasos Papaioannou 2019-01-09 15:50:46 UTC
Verified on 5.10.0.30.

Comment 10 errata-xmlrpc 2019-02-07 23:03:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0212