Description of problem: OCP metrics is working on the overview page but when trying to do ad-hoc metrics get an error. Version-Release number of selected component (if applicable): 5.8.0.9-alpha2.20170404195944_1d7ece4 How reproducible: Another person got a 503 error against the same OCP cluster. Steps to Reproduce: 1. Add an OCP provider with C&U 2. Verify in overview page that the metrics have populated 3. Compute -> Containers -> Provider -> Monitoring -> Ad hoc Metrics Actual results: The metrics never load and a "504 Gateway Time-out The server didn't respond in time" burner message pops up. Expected results: No errors. Metrics populate. Additional info: 5.8 test-a-thon
Created attachment 1269438 [details] error_screen_shot
Not sure there is really a problem here. How long does it take hawkular to respond when getting the 504 error? Can you attach the timing of: time curl -k -s -H "Hawkular-Tenant: _system" -H "Authorization: Bearer $HAWKULAR_TOKEN" https://$HAWKULAR_HOST/hawkular/metrics/metrics (maybe also a word count of the number of result rows[1]) HAWKULAR_HOST, HAWKULAR_TOKEN you can take from the provider definition. [1] curl -k -s -H "Hawkular-Tenant: _system" -H "Authorization: Bearer $HAWKULAR_TOKEN" https://$HAWKULAR_HOST/hawkular/metrics/metrics| wc
Yaacov do we have a way to set the timeout on this specific flow?
@Mooli, The only thing I changed is that you had "metrics/metrics" and I changed that to "metrics" since that is the URL of the metrics. [root@itewk-cfme58-test-1 ~]# time curl -k -s -H "Hawkular-Tenant: _system" -H "Authorization: Bearer $HAWKULAR_TOKEN" https://$HAWKULAR_HOST/hawkular/metrics {"Implementation-Version":"0.21.5.Final-redhat-1","Built-From-Git-SHA1":"632f908a52d3e45b3a0bafa84e117ec6ca87bb19","name":"Hawkular-Metrics"} real 0m0.146s user 0m0.051s sys 0m0.060s [root@itewk-cfme58-test-1 ~]# curl -k -s -H "Hawkular-Tenant: _system" -H "Authorization: Bearer $HAWKULAR_TOKEN" https://$HAWKULAR_HOST/hawkular/metrics| wc 0 1 141 [root@itewk-cfme58-test-1 ~]#
The /metrics endpoint is a status line. The metrics/metrics is one of the requests the the page does. Also please make sure you get the failure you described while timing. Thanks!
see comment 4
hi all @Ian a. The ad-hoc page does not require C&U or OpenShift, only a Hawkular server. b. Checking metrics in overview page does show that we have/had a Hawkular server working at some point, BUT we can have data in the ad-hoc metrics without C&U metrics (if we did not set C&U), and we can have metrics in ManageIQ without having metrics in the Ad-Hoc page (if the hawkular server stopped for some reason). c. a better check will be to check that the hawkular server validate correctly in the edit container provider page. @Mooli a. yes, sometimes the /metrics endpoint works while the /metrics/metrics fails, it is important to check the /metrics/metrics endpoint. b. yes, we have a way to set timeout: b.1. the HAWKULARCLIENT_REST_TIMEOUT environment virable can be set b.2. we can decide of a better default value and set it in the ruby code, but we need to know what is a reasonable default value. ---------- @Ian, 1. Please check that the Hawkular server steel validates in the edit container provider page, (we may have an unrelated bug). 2. Please check the /metrics/metrics endpoint with time so we can get an astimate for a resonble timeout default value. p.s. Im on PTO until Wednesday ...
p.s on my machine, its 3s before hawkular answers with no error in page: [yzamir@localhost ~]$ time curl -k -s -H "Hawkular-Tenant: _system" -H "Authorization: Bearer $HAWKULAR_TOKEN" https://$HAWKULAR_HOST/hawkular/metrics/metrics [{"id":"machine/yzamir-centos7-2.eng.lab.tlv.redhat.com/network/rx","tags":{"descriptor_name":"network/rx","group_id":"/network/rx","host_id":"yzamir-centos7-2.eng.lab.tlv.redhat.com","hostname":"yzamir-centos7-2.eng.l ... ... ... group_id":"docker-daemon/memory/major_page_faults_rate","host_id":"yzamir-centos7-2.eng.lab.tlv.redhat.com","hostname":"yzamir-centos7-2.eng.lab.tlv.redhat.com","nodename":"yzamir-centos7-2.eng.lab.tlv.redhat.com","type":"sys_container"},"dataRetention":7,"type":"gauge","tenantId":"_system","minTimestamp":1491134400000,"maxTimestamp":1491748950000}] real 0m3.045s user 0m0.031s sys 0m0.080s [yzamir@localhost ~]$
nothing responds at metrics/metrics which is why i tried metrics: [root@itewk-cfme58-test-1 ~]# time curl -k -s -H "Hawkular-Tenant: _system" -H "Authorization: Bearer $HAWKULAR_TOKEN" https://$HAWKULAR_HOST/hawkular/metrics/metrics <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> real 0m30.185s user 0m0.044s sys 0m0.082s But the metrics work fine within OCP and the aggregated metrics work within CFME, so why wouldn't the ad-hoc work? Though I don't really understand how the ad-hoc is supposed to work.
> But the metrics work fine within OCP and the aggregated metrics work within CFME @Ian, No the C&U can not work if /metrics/metrics times out, you probably see metrics collected an hour or a day ago, look for errors in the logs. > so why wouldn't the ad-hoc work? Though I don't really understand how the ad-hoc is supposed to work. Ad hoc metrics is not using the metrics collected by the C&U system, it need a working hawkular server, yours is timing out ( 30s ! ) > [root@itewk-cfme58-test-1 ~]# time curl -k -s -H "Hawkular-Tenant: _system" -H "Authorization: Bearer $HAWKULAR_TOKEN" > https://$HAWKULAR_HOST/hawkular/metrics/metrics <html><body><h1>504 Gateway Time-out</h1> This is not a bug in the ad-hoc page, it's a bug in hawkular or the hawkular setup.
@Yaacov, I respecitivly disagree. OCP is reporting live streaming metrics without issue and it depends on the same hawkular instance. Why would OCP be able to display the metrics if CFME can't? That seems like an issue with CFME to me, not hawkular. Attaching screen shot of metrics working in OCP.
Created attachment 1270267 [details] live streaming metrics working in OCP
Created attachment 1270268 [details] live metrics working in OCP
@Ian we can not research this bug while Hawkular does not behave as expected , if hawkular times out with an "504 Gateway Time-out" error, the best we can do is to show this message to the user. Please fix the hawkular server to behave as expected [1]: /metrics/metrics should return a list of metrics Once hawkular is working according to specs we can continue with this bug [1] http://www.hawkular.org/docs/rest/rest-metrics.html#_metric
> OCP is reporting live streaming metrics without issue and it depends on the same hawkular instance. Why would OCP be able to display the metrics if CFME can't? That seems like an issue with CFME to me, not hawkular. The Openshift UI runs inside the Openshift, it may use different ip, or routing - proxy from ManageIQ. The Openshift UI may have different credentials than ManageIQ. 1. the same curl from within the openshift master may work - this will tell us that the routing is bad. 2. the same curl using usewrname/password may work, this will tell us that the token we are using does not have the right credentials to read metrics. ------------- a. did you check that the hawkular server re-validates currectly ? b. did you check for errors in the C&U logs ( it uses the same route and credentials, so it should fail too )
(In reply to Yaacov Zamir from comment #16) > > OCP is reporting live streaming metrics without issue and it depends on the same hawkular instance. Why would OCP be able to display the metrics if CFME can't? That seems like an issue with CFME to me, not hawkular. > > The Openshift UI runs inside the Openshift, it may use different ip, or > routing - proxy from ManageIQ. > The Openshift UI may have different credentials than ManageIQ. > > 1. the same curl from within the openshift master may work - this will tell > us that the routing is bad. > 2. the same curl using usewrname/password may work, this will tell us that > the token we are using does not have the right credentials to read metrics. > > ------------- > > a. did you check that the hawkular server re-validates currectly ? > b. did you check for errors in the C&U logs ( it uses the same route and > credentials, so it should fail too ) As stated in my original bug report, the metrics collection in CFME is working fine in the standard C&U UI in CFME, I get metrics for the Pods there. It is only in the ad-hoc metrics page that the metrics do not come through. Therefor the metrics credentials verify without issue otherwise we wouldn't be getting the pod level metrics in the C&U UI.
Created attachment 1270271 [details] OCP provider validation
Created attachment 1270272 [details] OCP metrics validation
Created attachment 1270273 [details] CFME OCP Pod C&U Timeline working
Is there a way to move this over to the OCP team?
Created attachment 1270274 [details] CFME OCP provider overview with aggregated metrics
Created attachment 1270275 [details] CFME OCP Pod C&U Timeline working
> Is there a way to move this over to the OCP team? We can open a new bug on OpenShift about: [root@itewk-cfme58-test-1 ~]# time curl -k -s -H "Hawkular-Tenant: _system" -H "Authorization: Bearer $HAWKULAR_TOKEN" https://$HAWKULAR_HOST/hawkular/metrics/metrics <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> real 0m30.185s user 0m0.044s sys 0m0.082s If the url and credentials are currect, Openshift with Hawkular working currectly should return a list of metrics, without an error.
(In reply to Ian Tewksbury from comment #24) > Is there a way to move this over to the OCP team? Unfortunately no, flags and defaults get messed up. I Created https://bugzilla.redhat.com/show_bug.cgi?id=1440548. Ian please help the hawkular team figure it out, we need to get to the bottom of this.
@Mooli, Will do. I will track down with OCP hawkular folks and see what we can come up with. Thanks for the support.
submitted upstream: https://github.com/ManageIQ/manageiq-ui-classic/pull/1018
merged upstream: https://github.com/ManageIQ/manageiq-ui-classic/pull/1018
Verified on 5.9.0.12