Description of problem: When Hawkular endpoint has been modified (i.e. port changed) and CFME can no longer communicate with Openshift for C&U data gathering, CFME does not indicate that it is no longer gathering data. Version-Release number of selected component (if applicable): 5.6.0.10-rc2 How reproducible: Always Steps to Reproduce: 1.Define Provider and do not define specific Hawkular Endpoint 2.Wait (an hour at least) for CFME to gather C&U data. 3.Ensure C&U data can be viewed in CFME 4. In Openshift, change the hawkular router endpoint 5. DO NOT modify any details in CFME (Hawkular endpoint tab) 6. Wait for CFME to refresh data Actual results: CFME shows (in Provider view) that "Last refresh was successful" Expected results: CFME should indicate that it could not communicate with Openshift to gather C&U data. Additional info:
We were supposed to have this information (Hawkular endpoint status) in the summary page of the provider where the status of the endpoints is reported.
Yaacov, can you look at this? I think that the Hawkular endpoint status in the summary page of the provider should resolve this (if so you can CLOSE CURRENTRELEASE).
> Hawkular endpoint status in the summary page of the provider should resolve this The Hawkular endpoint status is not updated if the metrics collection fails.
(In reply to Yaacov Zamir from comment #4) > > Hawkular endpoint status in the summary page of the provider should resolve this > > The Hawkular endpoint status is not updated if the metrics collection fails. OK so what's the multi-endpoint approach to report this? Should we raise a specific exception during metrics collection? How the infrastructure would know what endpoint to mark as failed? Are we missing infrastructure support here?
(In reply to Federico Simoncelli from comment #5) > Are we missing infrastructure support here? I think it is ok to use the Hawkular endpoint status (acutely the authentication status ... ) but we need to update it when metrics collection fails, WDYT ?
Submitted upstream: https://github.com/ManageIQ/manageiq/pull/9785
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/7d3be77691df12277238cffd2933d7b29c6c0fb2 commit 7d3be77691df12277238cffd2933d7b29c6c0fb2 Author: Yaacov Zamir <yzamir> AuthorDate: Wed Jul 13 17:10:02 2016 +0300 Commit: Yaacov Zamir <yzamir> CommitDate: Sun Aug 14 10:29:35 2016 +0300 No indication of failed data gathering from Hawkular endpoint When gathering hawkular metrics fails, user has no indication in the status ui page. This PR updates the status of the Hawkular endpoint each time metrics is collected. It also display the last update date and time as an html title for the endpoint status, so users can know when the status was last checked. Bugzzila https://bugzilla.redhat.com/show_bug.cgi?id=1344308 .../ems_container_helper/textual_summary.rb | 23 ++---------- .../textual_authentications_status.rb | 43 ++++++++++++++++++++++ .../textual_mixins/textual_metrics_status.rb | 19 ++++++++++ .../container_manager/metrics_capture.rb | 11 +++++- .../kubernetes/container_manager_mixin.rb | 6 +++ ..._add_metrics_status_to_ext_management_system.rb | 7 ++++ db/schema.yml | 3 ++ 7 files changed, 91 insertions(+), 21 deletions(-) create mode 100644 app/helpers/textual_mixins/textual_authentications_status.rb create mode 100644 app/helpers/textual_mixins/textual_metrics_status.rb create mode 100644 db/migrate/20160808150745_add_metrics_status_to_ext_management_system.rb
Merged upstream: https://github.com/ManageIQ/manageiq/pull/9785
I can see the "Last Metrics Collection", but when the metrics collection fails, I don't see any error message about it. Only None in the "Last Metrics Collection".
(In reply to Jaroslav Henner from comment #11) > I can see the "Last Metrics Collection", but when the metrics collection > fails, I don't see any error message about it. Only None in the "Last > Metrics Collection". Jaroslav by simply reading the description of the fix: https://github.com/ManageIQ/manageiq/pull/9785 and looking at the screenshot, that's the expected behavior introduced. Also the description of this BZ is not mentioning the display of an error. Please discuss this with Yaacov.
Created attachment 1213749 [details] an image of an error message about why metrics are not collected
> I can see the "Last Metrics Collection", but when the metrics collection fails, I don't see any error message about it. Only None in the "Last Metrics Collection". The "None" is for when no attempt to read metrics is made, this can happen if no worker roles are set for C&U roles, or the validation fails and no reading is made. If an error happens ( we do try to read metrics and fail )we get an error mesage: https://bugzilla.redhat.com/attachment.cgi?id=1213749 What do you think we should write if no metrics reading attempt was made ? ------------------- How to reproduce the error message (for testing): a. create a provider and read some valid metrics. b. see a valid reading message. c. shut down the metrics provider / disconnect internet connection to provider. d. wait for metrics reading attempt (~15 minutes) e. see an error message.
(In reply to Yaacov Zamir from comment #15) > > I can see the "Last Metrics Collection", but when the metrics collection fails, I don't see any error message about it. Only None in the "Last Metrics Collection". > > The "None" is for when no attempt to read metrics is made, this can happen > if no worker roles are set for C&U roles, or the validation fails and no > reading is made. > > If an error happens ( we do try to read metrics and fail )we get an error > mesage: > https://bugzilla.redhat.com/attachment.cgi?id=1213749 > > What do you think we should write if no metrics reading attempt was made ? > > ------------------- > > How to reproduce the error message (for testing): > a. create a provider and read some valid metrics. > b. see a valid reading message. > c. shut down the metrics provider / disconnect internet connection to > provider. > d. wait for metrics reading attempt (~15 minutes) > e. see an error message. Well there must be something wrong with the metrics collection. The when I clicked the verification in the hawkular tab, it didn't complain. This is what I can see on the provider Summary page: Endpoints Hawkular Host Name ose3-master-ki4mb Hawkular API Port 443 Status Bearer Authentication Valid - 24 Minutes Ago Hawkular Authentication Valid - 24 Minutes Ago Last Metrics Collection None Last Refresh Success - 7 Minutes Ago As you see, I have added the metrics 24 minutes ago, but I got no update and I can't see no error message. I have added it using only a part of domain name (I have the remaining in the search in resolv.conf). If I use the curl from the CFME instance, it works: # curl -k https://ose3-master-ki4mb:443 ... <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="refresh" content="0;hawkular/metrics"> <title>Hawkular Metrics</title> </head> <body> <h1>Hawkular Metrics</h1> <h3>A time series metrics engine based on Cassandra</h3> <p>0.8.2</p> </body> </html> I can't see anything suspicious in the evm.log.
OK, I found the issue. I had to enable the collection on https://cfme/ops/explorer The On Off Capacity & Utilization * switches
> OK, I found the issue. I had to enable the On Off Capacity & Utilization * switches Great :-) Closing, no metrics is collected for zones that do not have U&C roles :-)
I removed the route to metrics from openshift, waited some time and now I can see Status Bearer Authentication Error - About 2 Hours Ago Default Authentication Valid - 7 Days Ago Hawkular Authentication Valid - 1 Day Ago Last Metrics Collection Unavailable - About 2 Hours Ago Last Refresh Error - About 2 Hours Ago 503 "Service Unavailable" Which is quite fine.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0012.html