Bug 1392105

Summary: Hawkular-Metrics having issues communicating with Cassandra
Product: OpenShift Container Platform Reporter: Eric Jones <erjones>
Component: HawkularAssignee: Matt Wringe <mwringe>
Status: CLOSED WORKSFORME QA Contact: Peng Li <penli>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.2.1CC: aos-bugs, erjones, pweil
Target Milestone: ---Keywords: Unconfirmed, UpcomingRelease
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-11 20:29:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Eric Jones 2016-11-04 20:08:37 UTC
Description of problem:
Customer identified issue because Heapster was stuck in a crashloopbackoff state. Looking at the logs, heapster points to hawkular-metrics, and there don't appear to be any issues with cassandra (at least evident in the logs). 

Behavior appears to persist after scaling the components down to 0 and then back up (cassandra, then hawkular-metrics, then heapster).

Version-Release number of selected component (if applicable):
[root@<SYSTEM> ~]# openshift version
openshift v3.2.1.13-1-gc2a90e1
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

[root@<SYSTEM> ~]# oc get rc -o yaml |grep -i image
          image: registry.access.redhat.com/openshift3/metrics-cassandra:3.2.1
          imagePullPolicy: IfNotPresent
          image: registry.access.redhat.com/openshift3/metrics-hawkular-metrics:3.2.1
          imagePullPolicy: IfNotPresent
          image: registry.access.redhat.com/openshift3/metrics-heapster:3.2.1
          imagePullPolicy: IfNotPresent


Attaching logs from before scaling down and from after scaling back up shortly

Comment 2 Matt Wringe 2016-11-07 14:55:52 UTC
From the post_restart files, none of the logs indicate any errors.

From the Heapster logs, these are the logs when Heapster is first starting up, and as such they have not reached any sort of error state yet.

Can you please attach the logs for the failed Heapster pod? You can usually get this with 'oc logs -p $POD_NAME' where -p specifies to return the previous logs (which in a crashloopbackoff would be the one with the error message).

If you cannot get the previous logs, then can you please run 'oc logs -f $POD_NAME', this will 'follow' the logs as they are written and will gather the full logs up until the pod is restarted.

Also, having access to the events which have been happening during this time would also be very helpful.

Comment 6 Matt Wringe 2016-11-11 20:29:26 UTC
The user has reported that it is now working after fixing an ip tables issues and restarting.

As this is not something we have been able to reproduce or determine if its caused by something specific in OpenShift Metrics, closing as 'WORKSFORME'