Bug 1344308 - C&U - No indication of failed data gathering if Hawkular endpoint has been modified in Openshift
Summary: C&U - No indication of failed data gathering if Hawkular endpoint has been mo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: C&U Capacity and Utilization
Version: 5.6.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: GA
: 5.7.0
Assignee: Yaacov Zamir
QA Contact: Jaroslav Henner
URL:
Whiteboard: container:c&u
Depends On:
Blocks: 1356130
TreeView+ depends on / blocked
 
Reported: 2016-06-09 11:49 UTC by Einat Pacifici
Modified: 2017-05-08 14:58 UTC (History)
8 users (show)

Fixed In Version: 5.7.0.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1356130 (view as bug list)
Environment:
Last Closed: 2017-01-04 12:55:42 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
an image of an error message about why metrics are not collected (46.02 KB, image/png)
2016-10-25 08:28 UTC, Yaacov Zamir
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0012 0 normal SHIPPED_LIVE CFME 5.7.0 bug fixes and enhancement update 2017-01-04 17:50:36 UTC

Description Einat Pacifici 2016-06-09 11:49:04 UTC
Description of problem:
When Hawkular endpoint has been modified (i.e. port changed) and CFME can no longer communicate with Openshift for C&U data gathering, CFME does not indicate that it is no longer gathering data. 


Version-Release number of selected component (if applicable):
5.6.0.10-rc2

How reproducible:
Always

Steps to Reproduce:
1.Define Provider and do not define specific Hawkular Endpoint
2.Wait (an hour at least) for CFME to gather C&U data.
3.Ensure C&U data can be viewed in CFME
4. In Openshift, change the hawkular router endpoint 
5. DO NOT modify any details in CFME (Hawkular endpoint tab)
6. Wait for CFME to refresh data

Actual results:
CFME shows (in Provider view) that "Last refresh was successful"

Expected results:
CFME should indicate that it could not communicate with Openshift to gather C&U data. 

Additional info:

Comment 2 Federico Simoncelli 2016-07-02 15:29:14 UTC
We were supposed to have this information (Hawkular endpoint status) in the summary page of the provider where the status of the endpoints is reported.

Comment 3 Federico Simoncelli 2016-07-13 07:57:11 UTC
Yaacov, can you look at this? I think that the Hawkular endpoint status in the summary page of the provider should resolve this (if so you can CLOSE CURRENTRELEASE).

Comment 4 Yaacov Zamir 2016-07-13 08:49:05 UTC
> Hawkular endpoint status in the summary page of the provider should resolve this

The Hawkular endpoint status is not updated if the metrics collection fails.

Comment 5 Federico Simoncelli 2016-07-13 10:05:12 UTC
(In reply to Yaacov Zamir from comment #4)
> > Hawkular endpoint status in the summary page of the provider should resolve this
> 
> The Hawkular endpoint status is not updated if the metrics collection fails.

OK so what's the multi-endpoint approach to report this? Should we raise a specific exception during metrics collection?
How the infrastructure would know what endpoint to mark as failed?
Are we missing infrastructure support here?

Comment 6 Yaacov Zamir 2016-07-13 11:52:38 UTC
(In reply to Federico Simoncelli from comment #5)
> Are we missing infrastructure support here?

I think it is ok to use the Hawkular endpoint status (acutely the authentication status ... ) but we need to update it when metrics collection fails, WDYT ?

Comment 8 Yaacov Zamir 2016-07-13 14:18:42 UTC
Submitted upstream:
https://github.com/ManageIQ/manageiq/pull/9785

Comment 9 CFME Bot 2016-08-15 15:15:56 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/7d3be77691df12277238cffd2933d7b29c6c0fb2

commit 7d3be77691df12277238cffd2933d7b29c6c0fb2
Author:     Yaacov Zamir <yzamir>
AuthorDate: Wed Jul 13 17:10:02 2016 +0300
Commit:     Yaacov Zamir <yzamir>
CommitDate: Sun Aug 14 10:29:35 2016 +0300

    No indication of failed data gathering from Hawkular endpoint
    
    When gathering hawkular metrics fails, user has no indication in the status ui page.
    
    This PR updates the status of the Hawkular endpoint each time metrics is collected.
    It also display the last update date and time as an html title for the endpoint status, so users can know when the
    status was last checked.
    
    Bugzzila
    https://bugzilla.redhat.com/show_bug.cgi?id=1344308

 .../ems_container_helper/textual_summary.rb        | 23 ++----------
 .../textual_authentications_status.rb              | 43 ++++++++++++++++++++++
 .../textual_mixins/textual_metrics_status.rb       | 19 ++++++++++
 .../container_manager/metrics_capture.rb           | 11 +++++-
 .../kubernetes/container_manager_mixin.rb          |  6 +++
 ..._add_metrics_status_to_ext_management_system.rb |  7 ++++
 db/schema.yml                                      |  3 ++
 7 files changed, 91 insertions(+), 21 deletions(-)
 create mode 100644 app/helpers/textual_mixins/textual_authentications_status.rb
 create mode 100644 app/helpers/textual_mixins/textual_metrics_status.rb
 create mode 100644 db/migrate/20160808150745_add_metrics_status_to_ext_management_system.rb

Comment 10 Yaacov Zamir 2016-08-15 15:20:47 UTC
Merged upstream:
https://github.com/ManageIQ/manageiq/pull/9785

Comment 11 Jaroslav Henner 2016-10-18 18:41:41 UTC
I can see the  "Last Metrics Collection", but when the metrics collection fails, I don't see any error message about it. Only None in the "Last Metrics Collection".

Comment 12 Federico Simoncelli 2016-10-20 21:25:32 UTC
(In reply to Jaroslav Henner from comment #11)
> I can see the  "Last Metrics Collection", but when the metrics collection
> fails, I don't see any error message about it. Only None in the "Last
> Metrics Collection".

Jaroslav by simply reading the description of the fix:

https://github.com/ManageIQ/manageiq/pull/9785

and looking at the screenshot, that's the expected behavior introduced.

Also the description of this BZ is not mentioning the display of an error.

Please discuss this with Yaacov.

Comment 14 Yaacov Zamir 2016-10-25 08:28:11 UTC
Created attachment 1213749 [details]
an image of an error message about why metrics are not collected

Comment 15 Yaacov Zamir 2016-10-25 08:37:50 UTC
> I can see the  "Last Metrics Collection", but when the metrics collection fails, I don't see any error message about it. Only None in the "Last Metrics Collection".

The "None" is for when no attempt to read metrics is made, this can happen if no worker roles are set for C&U roles, or the validation fails and no reading is made.

If an error happens ( we do try to read metrics and fail )we get an error mesage:
https://bugzilla.redhat.com/attachment.cgi?id=1213749

What do you think we should write if no metrics reading attempt was made ?

-------------------

How to reproduce the error message (for testing):
a. create a provider and read some valid metrics.
b. see a valid reading message.
c. shut down the metrics provider / disconnect internet connection to provider.
d. wait for metrics reading attempt (~15 minutes)
e. see an error message.

Comment 16 Jaroslav Henner 2016-10-25 13:16:28 UTC
(In reply to Yaacov Zamir from comment #15)
> > I can see the  "Last Metrics Collection", but when the metrics collection fails, I don't see any error message about it. Only None in the "Last Metrics Collection".
> 
> The "None" is for when no attempt to read metrics is made, this can happen
> if no worker roles are set for C&U roles, or the validation fails and no
> reading is made.
> 
> If an error happens ( we do try to read metrics and fail )we get an error
> mesage:
> https://bugzilla.redhat.com/attachment.cgi?id=1213749
> 
> What do you think we should write if no metrics reading attempt was made ?
> 
> -------------------
> 
> How to reproduce the error message (for testing):
> a. create a provider and read some valid metrics.
> b. see a valid reading message.
> c. shut down the metrics provider / disconnect internet connection to
> provider.
> d. wait for metrics reading attempt (~15 minutes)
> e. see an error message.

Well there must be something wrong with the metrics collection. The when I clicked the verification in the hawkular tab, it didn't complain. This is what I can see on the provider Summary page:

Endpoints
Hawkular Host Name 	ose3-master-ki4mb
Hawkular API Port 	443 

Status
Bearer Authentication 	Valid - 24 Minutes Ago
Hawkular Authentication 	Valid - 24 Minutes Ago
Last Metrics Collection 	None
Last Refresh 	
Success - 7 Minutes Ago 


As you see, I have added the metrics 24 minutes ago, but I got no update and I can't see no error message.

I have added it using only a part of domain name (I have the remaining in the search in resolv.conf). If I use the curl from the CFME instance, it works:

# curl -k https://ose3-master-ki4mb:443
...
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta http-equiv="refresh" content="0;hawkular/metrics">
  <title>Hawkular Metrics</title>
</head>

<body>
  <h1>Hawkular Metrics</h1>
  <h3>A time series metrics engine based on Cassandra</h3>
  <p>0.8.2</p>
</body>
</html>

I can't see anything suspicious in the evm.log.

Comment 18 Jaroslav Henner 2016-10-25 14:09:22 UTC
OK, I found the issue. I had to enable the collection on https://cfme/ops/explorer

The On Off Capacity & Utilization * switches

Comment 19 Yaacov Zamir 2016-10-25 14:17:02 UTC
> OK, I found the issue. I had to enable the On Off Capacity & Utilization * switches

Great :-)

Closing, no metrics is collected for zones that do not have U&C roles :-)

Comment 20 Jaroslav Henner 2016-10-25 19:40:38 UTC
I removed the route to metrics from openshift, waited some time and now I can see

 Status
Bearer Authentication 	Error - About 2 Hours Ago
Default Authentication 	Valid - 7 Days Ago
Hawkular Authentication 	Valid - 1 Day Ago
Last Metrics Collection 	Unavailable - About 2 Hours Ago
Last Refresh 	
Error - About 2 Hours Ago
503 "Service Unavailable"

Which is quite fine.

Comment 22 errata-xmlrpc 2017-01-04 12:55:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0012.html


Note You need to log in before you can comment on or make changes to this bug.