Bug 1439852 - OCP Ad-Hoc metrcis fails with "504 Gateway Time-out The server didn't respond in time"
Summary: OCP Ad-Hoc metrcis fails with "504 Gateway Time-out The server didn't respond...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Providers
Version: 5.8.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: GA
: 5.9.0
Assignee: Yaacov Zamir
QA Contact: Dave Johnson
URL:
Whiteboard: container
Depends On: 1439910 1440548
Blocks: 1442167
TreeView+ depends on / blocked
 
Reported: 2017-04-06 17:00 UTC by Ian Tewksbury
Modified: 2018-03-06 15:57 UTC (History)
8 users (show)

Fixed In Version: 5.9.0.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1442167 (view as bug list)
Environment:
Last Closed: 2018-03-06 15:57:06 UTC
Category: ---
Cloudforms Team: Container Management
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
error_screen_shot (144.16 KB, image/png)
2017-04-06 17:01 UTC, Ian Tewksbury
no flags Details
live streaming metrics working in OCP (46.57 KB, image/png)
2017-04-09 16:09 UTC, Ian Tewksbury
no flags Details
live metrics working in OCP (273.26 KB, image/png)
2017-04-09 16:20 UTC, Ian Tewksbury
no flags Details
OCP provider validation (167.99 KB, image/png)
2017-04-09 17:19 UTC, Ian Tewksbury
no flags Details
OCP metrics validation (144.24 KB, image/png)
2017-04-09 17:19 UTC, Ian Tewksbury
no flags Details
CFME OCP Pod C&U Timeline working (269.84 KB, image/png)
2017-04-09 17:21 UTC, Ian Tewksbury
no flags Details
CFME OCP provider overview with aggregated metrics (342.26 KB, image/png)
2017-04-09 17:23 UTC, Ian Tewksbury
no flags Details
CFME OCP Pod C&U Timeline working (269.84 KB, image/png)
2017-04-09 17:24 UTC, Ian Tewksbury
no flags Details

Description Ian Tewksbury 2017-04-06 17:00:57 UTC
Description of problem:
OCP metrics is working on the overview page but when trying to do ad-hoc metrics get an error.


Version-Release number of selected component (if applicable):
5.8.0.9-alpha2.20170404195944_1d7ece4


How reproducible:
Another person got a 503 error against the same OCP cluster.


Steps to Reproduce:
1. Add an OCP provider with C&U
2. Verify in overview page that the metrics have populated
3. Compute -> Containers -> Provider -> Monitoring -> Ad hoc Metrics


Actual results:
The metrics never load and a "504 Gateway Time-out The server didn't respond in time" burner message pops up.


Expected results:
No errors. Metrics populate.


Additional info:
5.8 test-a-thon

Comment 2 Ian Tewksbury 2017-04-06 17:01:23 UTC
Created attachment 1269438 [details]
error_screen_shot

Comment 3 Mooli Tayer 2017-04-09 11:27:13 UTC
Not sure there is really a problem here.

How long does it take hawkular to respond when getting the 504 error?

Can you attach the timing of:
time curl -k -s     -H "Hawkular-Tenant: _system"     -H "Authorization: Bearer $HAWKULAR_TOKEN"     https://$HAWKULAR_HOST/hawkular/metrics/metrics

(maybe also a word count of the number of result rows[1])

HAWKULAR_HOST, HAWKULAR_TOKEN you can take from the provider definition.

[1] curl -k -s     -H "Hawkular-Tenant: _system"     -H "Authorization: Bearer $HAWKULAR_TOKEN"     https://$HAWKULAR_HOST/hawkular/metrics/metrics| wc

Comment 4 Mooli Tayer 2017-04-09 11:28:43 UTC
Yaacov do we have a way to set the timeout on this specific flow?

Comment 5 Ian Tewksbury 2017-04-09 14:02:47 UTC
@Mooli,

The only thing I changed is that you had "metrics/metrics" and I changed that to "metrics" since that is the URL of the metrics.

[root@itewk-cfme58-test-1 ~]# time curl -k -s     -H "Hawkular-Tenant: _system"     -H "Authorization: Bearer $HAWKULAR_TOKEN"     https://$HAWKULAR_HOST/hawkular/metrics
{"Implementation-Version":"0.21.5.Final-redhat-1","Built-From-Git-SHA1":"632f908a52d3e45b3a0bafa84e117ec6ca87bb19","name":"Hawkular-Metrics"}
real	0m0.146s
user	0m0.051s
sys	0m0.060s


[root@itewk-cfme58-test-1 ~]# curl -k -s     -H "Hawkular-Tenant: _system"     -H "Authorization: Bearer $HAWKULAR_TOKEN"     https://$HAWKULAR_HOST/hawkular/metrics| wc
      0       1     141
[root@itewk-cfme58-test-1 ~]#

Comment 6 Mooli Tayer 2017-04-09 14:16:43 UTC
The /metrics endpoint is a status line. The metrics/metrics is one of the requests the the page does. Also please make sure you get the failure you described while timing. Thanks!

Comment 7 Mooli Tayer 2017-04-09 14:17:18 UTC
see comment 4

Comment 8 Yaacov Zamir 2017-04-09 14:38:32 UTC
hi all

@Ian

a. The ad-hoc page does not require C&U or OpenShift, only a Hawkular server.
b. Checking metrics in overview page does show that we have/had a Hawkular server working at some point, BUT we can have data in the ad-hoc metrics without C&U metrics (if we did not set C&U), and we can have metrics in ManageIQ without having metrics in the Ad-Hoc page (if the hawkular server stopped for some reason).
c. a better check will be to check that the hawkular server validate correctly in the edit container provider page. 

@Mooli

a. yes, sometimes the /metrics endpoint works while the /metrics/metrics fails, it is important to check the /metrics/metrics endpoint.

b. yes, we have a way to set timeout:
  b.1. the HAWKULARCLIENT_REST_TIMEOUT environment virable can be set
  b.2. we can decide of a better default value and set it in the ruby code, but we need to know what is a reasonable default value.

----------

@Ian,

1. Please check that the Hawkular server steel validates in the edit container provider page, (we may have an unrelated bug).
2. Please check the /metrics/metrics endpoint with time so we can get an astimate for a resonble timeout default value.

p.s.
Im on PTO until Wednesday ...

Comment 9 Yaacov Zamir 2017-04-09 14:45:38 UTC
p.s

on my machine, its 3s before hawkular answers with no error in page:

[yzamir@localhost ~]$ time curl -k -s     -H "Hawkular-Tenant: _system"     -H "Authorization: Bearer $HAWKULAR_TOKEN"     https://$HAWKULAR_HOST/hawkular/metrics/metrics
[{"id":"machine/yzamir-centos7-2.eng.lab.tlv.redhat.com/network/rx","tags":{"descriptor_name":"network/rx","group_id":"/network/rx","host_id":"yzamir-centos7-2.eng.lab.tlv.redhat.com","hostname":"yzamir-centos7-2.eng.l ...
...
...
group_id":"docker-daemon/memory/major_page_faults_rate","host_id":"yzamir-centos7-2.eng.lab.tlv.redhat.com","hostname":"yzamir-centos7-2.eng.lab.tlv.redhat.com","nodename":"yzamir-centos7-2.eng.lab.tlv.redhat.com","type":"sys_container"},"dataRetention":7,"type":"gauge","tenantId":"_system","minTimestamp":1491134400000,"maxTimestamp":1491748950000}]
real	0m3.045s
user	0m0.031s
sys	0m0.080s
[yzamir@localhost ~]$

Comment 10 Ian Tewksbury 2017-04-09 15:35:32 UTC
nothing responds at metrics/metrics which is why i tried metrics:


[root@itewk-cfme58-test-1 ~]# time curl -k -s     -H "Hawkular-Tenant: _system"     -H "Authorization: Bearer $HAWKULAR_TOKEN"     https://$HAWKULAR_HOST/hawkular/metrics/metrics
<html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

real	0m30.185s
user	0m0.044s
sys	0m0.082s

But the metrics work fine within OCP and the aggregated metrics work within CFME, so why wouldn't the ad-hoc work? Though I don't really understand how the ad-hoc is supposed to work.

Comment 11 Yaacov Zamir 2017-04-09 15:46:42 UTC
>  But the metrics work fine within OCP and the aggregated metrics work within CFME

@Ian, No the C&U can not work if /metrics/metrics times out, you probably see metrics collected an hour or a day ago, look for errors in the logs.

> so why wouldn't the ad-hoc work? Though I don't really understand how the ad-hoc is supposed to work.

Ad hoc metrics is not using the metrics collected by the C&U system, it need a working hawkular server, yours is timing out ( 30s ! )

> [root@itewk-cfme58-test-1 ~]# time curl -k -s     -H "Hawkular-Tenant: _system"     -H "Authorization: Bearer $HAWKULAR_TOKEN"     > https://$HAWKULAR_HOST/hawkular/metrics/metrics
<html><body><h1>504 Gateway Time-out</h1>

This is not a bug in the ad-hoc page, it's a bug in hawkular or the hawkular setup.

Comment 12 Ian Tewksbury 2017-04-09 16:08:59 UTC
@Yaacov,

I respecitivly disagree.

OCP is reporting live streaming metrics without issue and it depends on the same hawkular instance. Why would OCP be able to display the metrics if CFME can't? That seems like an issue with CFME to me, not hawkular.

Attaching screen shot of metrics working in OCP.

Comment 13 Ian Tewksbury 2017-04-09 16:09:48 UTC
Created attachment 1270267 [details]
live streaming metrics working in OCP

Comment 14 Ian Tewksbury 2017-04-09 16:20:00 UTC
Created attachment 1270268 [details]
live metrics working in OCP

Comment 15 Yaacov Zamir 2017-04-09 16:31:34 UTC
@Ian we can not research this bug while Hawkular does not behave as expected , if hawkular times out with an "504 Gateway Time-out" error, the best we can do is to show this message to the user.

Please fix the hawkular server to behave as expected [1]:
/metrics/metrics should return a list of metrics

Once hawkular is working according to specs we can continue with this bug


[1] http://www.hawkular.org/docs/rest/rest-metrics.html#_metric

Comment 16 Yaacov Zamir 2017-04-09 16:45:03 UTC
> OCP is reporting live streaming metrics without issue and it depends on the same hawkular instance. Why would OCP be able to display the metrics if CFME can't? That seems like an issue with CFME to me, not hawkular.

The Openshift UI runs inside the Openshift, it may use different ip, or routing - proxy from ManageIQ.
The Openshift UI may have different credentials than ManageIQ.

1.  the same curl from within the openshift master may work - this will tell us that the routing is bad.
2. the same curl using usewrname/password may work, this will tell us that the token we are using does not have the right credentials to read metrics.

-------------

a. did you check that the hawkular server re-validates currectly ?
b. did you check for errors in the C&U logs ( it uses the same route and credentials, so it should fail too )

Comment 18 Ian Tewksbury 2017-04-09 16:47:00 UTC
(In reply to Yaacov Zamir from comment #16)
> > OCP is reporting live streaming metrics without issue and it depends on the same hawkular instance. Why would OCP be able to display the metrics if CFME can't? That seems like an issue with CFME to me, not hawkular.
> 
> The Openshift UI runs inside the Openshift, it may use different ip, or
> routing - proxy from ManageIQ.
> The Openshift UI may have different credentials than ManageIQ.
> 
> 1.  the same curl from within the openshift master may work - this will tell
> us that the routing is bad.
> 2. the same curl using usewrname/password may work, this will tell us that
> the token we are using does not have the right credentials to read metrics.
> 
> -------------
> 
> a. did you check that the hawkular server re-validates currectly ?
> b. did you check for errors in the C&U logs ( it uses the same route and
> credentials, so it should fail too )

As stated in my original bug report, the metrics collection in CFME is working fine in the standard C&U UI in CFME, I get metrics for the Pods there. It is only in the ad-hoc metrics page that the metrics do not come through. Therefor the metrics credentials verify without issue otherwise we wouldn't be getting the pod level metrics in the C&U UI.

Comment 21 Ian Tewksbury 2017-04-09 17:19:05 UTC
Created attachment 1270271 [details]
OCP provider validation

Comment 22 Ian Tewksbury 2017-04-09 17:19:31 UTC
Created attachment 1270272 [details]
OCP metrics validation

Comment 23 Ian Tewksbury 2017-04-09 17:21:19 UTC
Created attachment 1270273 [details]
CFME OCP Pod C&U Timeline working

Comment 24 Ian Tewksbury 2017-04-09 17:21:57 UTC
Is there a way to move this over to the OCP team?

Comment 25 Ian Tewksbury 2017-04-09 17:23:53 UTC
Created attachment 1270274 [details]
CFME OCP provider overview with aggregated metrics

Comment 26 Ian Tewksbury 2017-04-09 17:24:26 UTC
Created attachment 1270275 [details]
CFME OCP Pod C&U Timeline working

Comment 29 Yaacov Zamir 2017-04-09 18:07:46 UTC
> Is there a way to move this over to the OCP team?

We can open a new bug on OpenShift about:

[root@itewk-cfme58-test-1 ~]# time curl -k -s     -H "Hawkular-Tenant: _system"     -H "Authorization: Bearer $HAWKULAR_TOKEN"     https://$HAWKULAR_HOST/hawkular/metrics/metrics
<html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

real	0m30.185s
user	0m0.044s
sys	0m0.082s

If the url and credentials are currect, Openshift with Hawkular working currectly should return a list of metrics, without an error.

Comment 31 Mooli Tayer 2017-04-09 21:31:07 UTC
(In reply to Ian Tewksbury from comment #24)
> Is there a way to move this over to the OCP team?

Unfortunately no, flags and defaults get messed up.

I Created https://bugzilla.redhat.com/show_bug.cgi?id=1440548. Ian please help the hawkular team figure it out, we need to get to the bottom of this.

Comment 32 Ian Tewksbury 2017-04-09 21:48:15 UTC
@Mooli,

Will do. I will track down with OCP hawkular folks and see what we can come up with. Thanks for the support.

Comment 33 Yaacov Zamir 2017-04-12 14:02:16 UTC
submitted upstream:
https://github.com/ManageIQ/manageiq-ui-classic/pull/1018

Comment 34 Yaacov Zamir 2017-04-13 03:33:49 UTC
merged upstream:
https://github.com/ManageIQ/manageiq-ui-classic/pull/1018

Comment 36 Pavel Zagalsky 2017-12-21 12:42:53 UTC
Verified on 5.9.0.12


Note You need to log in before you can comment on or make changes to this bug.