Bug 1575663 - Proxy Error - Error reading from remote server
Summary: Proxy Error - Error reading from remote server
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.8.0
Hardware: All
OS: All
high
high
Target Milestone: GA
: 5.8.4
Assignee: Joe Rafaniello
QA Contact: Dave Johnson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-07 14:51 UTC by Ryan Spagnola
Modified: 2018-06-11 15:57 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-30 14:38:13 UTC
Category: Bug
Cloudforms Team: CFME Core
Target Upstream Version:


Attachments (Terms of Use)

Description Ryan Spagnola 2018-05-07 14:51:39 UTC
Description of problem:

Logs are showing the same stacktrace from this bz 
https://bugzilla.redhat.com/show_bug.cgi?id=1417661

[----] I, [2018-04-30T10:53:11.801914 #1987:1521a9c]  INFO -- : Started POST "/service/tree_select/?id=s-40r5890" for 127.0.0.1 at 2018-04-30 10:53:11 -0400
[----] I, [2018-04-30T10:53:11.804704 #1987:1521a9c]  INFO -- : Processing by ServiceController#tree_select as JS
[----] I, [2018-04-30T10:53:11.804776 #1987:1521a9c]  INFO -- :   Parameters: {"id"=>"s-40r5890"}
[----] W, [2018-04-30T10:53:11.815887 #1987:1521a9c]  WARN -- : Can't verify CSRF token authenticity.
[----] F, [2018-04-30T10:53:11.816315 #1987:1521a9c] FATAL -- : Error caught: [ActionController::InvalidAuthenticityToken] ActionController::InvalidAuthenticityToken
/opt/rh/cfme-gemset/gems/actionpack-5.0.3/lib/action_controller/metal/request_forgery_protection.rb:195:in `handle_unverified_request'
/opt/rh/cfme-gemset/gems/actionpack-5.0.3/lib/action_controller/metal/request_forgery_protection.rb:223:in `handle_unverified_request'
/opt/rh/cfme-gemset/gems/actionpack-5.0.3/lib/action_controller/metal/request_forgery_protection.rb:218:in `verify_authenticity_token'
/opt/rh/cfme-gemset/gems/activesupport-5.0.3/lib/active_support/callbacks.rb:382:in `block in make_lambda'
/opt/rh/cfme-gemset/gems/activesupport-5.0.3/lib/active_support/callbacks.rb:150:in `block (2 levels) in halting_and_conditional'
/opt/rh/cfme-gemset/gems/actionpack-5.0.3/lib/abstract_controller/callbacks.rb:12:in `block (2 levels) in <module:Callbacks>'
/opt/rh/cfme-gemset/gems/activesupport-5.0.3/lib/active_support/callbacks.rb:151:in `block in halting_and_conditional'
/opt/rh/cfme-gemset/gems/activesupport-5.0.3/lib/active_support/callbacks.rb:454:in `block in call'
/opt/rh/cfme-gemset/gems/activesupport-5.0.3/lib/active_support/callbacks.rb:454:in `each'
/opt/rh/cfme-gemset/gems/activesupport-5.0.3/lib/active_support/callbacks.rb:454:in `call'
/opt/rh/cfme-gemset/gems/activesupport-5.0.3/lib/active_support/callbacks.rb:101:in `__run_callbacks__'
/opt/rh/cfme-gemset/gems/activesupport-5.0.3/lib/active_support/callbacks.rb:750:in `_run_process_action_callbacks'
/opt/rh/cfme-gemset/gems/activesupport-5.0.3/lib/active_support/callbacks.rb:90:in `run_callbacks'
/opt/rh/cfme-gemset/gems/actionpack-5.0.3/lib/abstract_controller/callbacks.rb:19:in `process_action'
/opt/rh/cfme-gemset/gems/actionpack-5.0.3/lib/action_controller/metal/rescue.rb:20:in `process_action'
/opt/rh/cfme-gemset/gems/actionpack-5.0.3/lib/action_controller/metal/instrumentation.rb:32:in `block in process_action'
/opt/rh/cfme-gemset/gems/activesupport-5.0.3/lib/active_support/notifications.rb:164:in `block in instrument'
/opt/rh/cfme-gemset/gems/activesupport-5.0.3/lib/active_support/notifications/instrumenter.rb:21:in `instrument'
/opt/rh/cfme-gemset/gems/activesupport-5.0.3/lib/active_support/notifications.rb:164:in `instrument'
/opt/rh/cfme-gemset/gems/actionpack-5.0.3/lib/action_controller/metal/instrumentation.rb:30:in `process_action'
/opt/rh/cfme-gemset/gems/actionpack-5.0.3/lib/action_controller/metal/params_wrapper.rb:248:in `process_action'
/opt/rh/cfme-gemset/gems/activerecord-5.0.3/lib/active_record/railties/controller_runtime.rb:18:in `process_action'
/opt/rh/cfme-gemset/gems/actionpack-5.0.3/lib/abstract_controller/base.rb:126:in `process'


Version-Release number of selected component (if applicable):
5.8.2.3

How reproducible:
Random

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
* Approx time/date of last occurance:
04/30/18 prior to 8:30am EST

Customer has found that the following command pipeline returns a non-zero output on the UI's when the problem exists and returns a null when it is operating as expected.

# important: if there are values in column2 (Recv-Q), it strongly indicates
# "proxy error" issue and requires restarts
netstat -aWn|egrep ":3000|:4000"|grep EST|awk '$2 !~/^0/ {print}'

at which point he is able to do the following to restore normal operational behavior (i.e. the basic ability to login & even be presented with a login page)

systemctl stop httpd.service
systemctl restart evmserverd.service
systemctl restart memcached.service
systemctl start httpd.service

Comment 6 Joe Rafaniello 2018-05-08 20:18:56 UTC
I'm seeing lots apache timeouts associated with 502/503 errors in the logs.  They look like this:

(70007)The timeout specified has expired: [client x.x.x.x:55518] AH01102: error reading status line from remote server 0.0.0.0:3000, referer: https://xxx/dashboard/show

These mostly occurred in the ui1 logs although there were 3 occurrences in the ui2 logs.

Digging deeper...

Let's look for an uneven number of Started/Completed log lines in the rails logs:

zgrep -E 'Started|Completed' */production.log | grep -Eo "#[0-9]+[:a-f0-9]+" | sort | uniq -c
     24 #1987:151e02c
     30 #1987:151e158
     26 #1987:151e298
     20 #1987:1521a9c
     20 #1987:1521d6c
      8 #1999:1519068
      8 #1999:151961c
      8 #1999:15197e8
     10 #1999:1519bbc
      8 #1999:1519d10
      4 #2067:1b8c490
      5 #2067:1b8d6b0
      2 #2067:1b8e844
      4 #2067:1b8f140
      5 #2067:1b8f2d0
      5 #2070:1c9c204
      4 #2070:1c9c948
      2 #2070:1c9cbb4
      4 #2070:1c9ce48
      3 #2070:1c9fc4c
      2 #2076:1b7ae34
      2 #2076:1b7c98c
      2 #2076:1b7dc38
      2 #2082:1d0130c
      2 #2082:1d016e0
      2 #2082:1d01eec
      2 #2116:1248b7c
      5 #2116:12492d4
      5 #2116:12497ac
      2 #2116:1249a54
      2 #2116:1249be4
    114 #2125:123c494
    114 #2125:123c944
    114 #2125:123cc64
    116 #2125:12410c0
    116 #2125:12416ec

3 of these pid/tid combinations has an ODD number of Started/Completed, indicating a request that started but never finished:
2067
2070
2116

2116:
[----] I, [2018-04-30T07:23:47.742266 #2116:12492d4]  INFO -- : Started GET "/service/explorer" for 127.0.0.1 at 2018-04-30 07:23:47 -0400
[----] I, [2018-04-30T07:23:49.758504 #2116:12497ac]  INFO -- : Started GET "/service/explorer" for 127.0.0.1 at 2018-04-30 07:23:49 -0400

2070:
[----] I, [2018-04-30T07:28:19.644102 #2070:1c9c204]  INFO -- : Started GET "/service/explorer" for 127.0.0.1 at 2018-04-30 07:28:19 -0400
[----] I, [2018-04-30T07:28:21.195937 #2070:1c9fc4c]  INFO -- : Started GET "/service/explorer" for 127.0.0.1 at 2018-04-30 07:28:21 -0400

2067:
[----] I, [2018-04-30T08:05:20.978029 #2067:1b8d6b0]  INFO -- : Started GET "/service/explorer" for 127.0.0.1 at 2018-04-30 08:05:20 -0400
[----] I, [2018-04-30T08:05:22.590811 #2067:1b8f2d0]  INFO -- : Started GET "/service/explorer" for 127.0.0.1 at 2018-04-30 08:05:22 -0400

Comment 7 Joe Rafaniello 2018-05-08 20:24:08 UTC
Ryan, how many services do they have?  Is it possible to get a database dump?  

Since this seems to be happening consistently on service/explorer, I'm thinking there's some optimizations we need to do but of course, I could be wrong.

Note, this has been something we've improved in the past so perhaps this environment is slightly different and therefore we didn't see the performance problem before:  https://github.com/ManageIQ/manageiq/pull/11171

Comment 11 Joe Rafaniello 2018-05-16 15:23:11 UTC
Ryan, I cannot access this appliance or postgresql database.  If fix_auth is failing, we should still be able to look at the tables while we try to get past the fix_auth failure.

Comment 24 Joe Rafaniello 2018-05-30 14:38:13 UTC
Closing this for now.  The size of the miq_queue table (777k+ rows) could have been responsible for the slowness we've seen but we'll re-open this case if the problem resurfaces.


Note You need to log in before you can comment on or make changes to this bug.