Bug 1386405

Summary: Hawkular-Metrics can excessively log errors.
Product: OpenShift Container Platform Reporter: Eric Jones <erjones>
Component: HawkularAssignee: Matt Wringe <mwringe>
Status: CLOSED WONTFIX QA Contact: Peng Li <penli>
Severity: low Docs Contact:
Priority: medium    
Version: 3.3.0CC: aos-bugs, erjones
Target Milestone: ---Keywords: Unconfirmed
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-05 18:47:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Eric Jones 2016-10-18 20:45:16 UTC
Description of problem:
Hawkular-metrics pod is repeating [0] over and over (dozens of times every second) and rather than signaling that the pod is not healthy and needs to restart it just sits there retrying over and over again.

[0]
ERROR [org.hawkular.metrics.api.jaxrs.util.ApiUtils] (RxComputationScheduler-4) HAWKMETRICS200010: Failed to process request: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)

Additional Information:
Node the pod was deployed to ran out of Docker Storage if that might introduce an additional factor

Comment 1 Matt Wringe 2016-10-18 21:36:25 UTC
We should probably decrease the amount of error messages being generated in this case. I created a JIRA for this here: https://issues.jboss.org/browse/HWKMETRICS-513)

The Hawkular Metrics pod should not be restarted in this situation. The problem is it is not able to connect to the Cassandra instance as as such restarting the Hawkular Metrics pod is not going to resolve that. When the Cassandra connection is valid again, Hawkular Metrics should just be able to reconnect and continue to function.

If this is not the case (eg restarting the Hawkular Metrics pod did fix this issue) then we will need to fix that sitation.

When you say "Node the pod was deployed to ran out of Docker Storage" are you talking about the Cassandra Pod or the Hawkular Metrics pod?

Comment 2 Matt Wringe 2016-10-27 14:23:25 UTC
I am requesting more information about which pod ran out of storage.

The Hawkular Metrics pod not restarting is expected and desired behaviour, it should automatically restart once the Cassandra instance is available again.

The contestant logging is a more pressing concern that needs to be looked into.

Comment 3 Matt Wringe 2016-10-31 20:23:27 UTC
Lowing the priority of this issue as its more of a problem with logging in specific error conditions than with reduced functionality. It is something we need to handle in a better fashion though.

Comment 4 Eric Jones 2016-11-10 16:41:08 UTC
My apologies for the delay. 

The storage I referenced was not the the storage provided to the pod but the docker storage setup for the node that the cassandra and hawkular metrics pods were running on.

Unfortunately the associated case is closed but I will be attaching the logs, provided at the start of the case, shortly.

Comment 6 Matt Wringe 2017-02-09 20:27:40 UTC
Upstream tracking https://issues.jboss.org/browse/HWKMETRICS-513