Bug 1473803 - Cassandra and hawkular-metrics appear stable but heapster cannot connect to hawkular-metrics
Cassandra and hawkular-metrics appear stable but heapster cannot connect to h...
Status: CLOSED INSUFFICIENT_DATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Metrics (Show other bugs)
3.5.0
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Matt Wringe
Junqi Zhao
: Reopened, Unconfirmed
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-21 13:30 EDT by Eric Jones
Modified: 2017-07-26 18:00 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-07-26 18:00:16 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Eric Jones 2017-07-21 13:30:36 EDT
Description of problem:
After running the upgrade playbooks, for the cluster to install some patches to the underlying OCP cluster, metrics appeared to be broken. After deleting everything (related to metrics) and deploying it from scratch, heapster is unable to connect to hawkular-metrics:

Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 28. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 28]. Retrying.

Version-Release number of selected component (if applicable):
3.5

Additional Information:

adding pod logs and output of oc get pods -o yaml shortly
Comment 2 Matt Wringe 2017-07-24 12:26:06 EDT
A curl exit code of 28 means that it timed out. 

Is the user able to connect to the Hawkular Metrics endpoint that the console would normally use? Can they manually access that endpoint?

The Hawkular Metrics pod is in the ready state, and its url should be resolvable.


The Hawkular Metrics pod cannot get into the ready state or remain running if that endpoint is not accessible. So that endpoint should be up and running, its just that Heapster cannot access it.

Can you please have them check if there is anything in the network which would be preventing Heapster from accessing Hawkular Metrics? Such as a firewall between nodes or a problem with the DNS server.
Comment 3 Taneem Ibrahim 2017-07-24 12:50:02 EDT
Hi Matt,

Yes, customer can view the metrics about page fine at https://$HAWKULAR_METRICS_HOSTNAME/hawkular/metrics . 

I will follow up on the firewall question with them. I thought about too earlier. 

Thank you for looking into this!

Cheers,
Taneem
Comment 4 Taneem Ibrahim 2017-07-24 12:50:24 EDT
Hi Matt,

Yes, customer can view the metrics about page fine at https://$HAWKULAR_METRICS_HOSTNAME/hawkular/metrics . 

I will follow up on the firewall question with them. I thought about too earlier. 

Thank you for looking into this!

Cheers,
Taneem
Comment 5 Taneem Ibrahim 2017-07-24 17:11:04 EDT
Matt,

Customer informed that they do not see any issue in their firewall or DNS.

--Taneem
Comment 6 Matt Wringe 2017-07-24 17:22:04 EDT
Sorry, I should have read the logs more. What is the problem here?

Heapster will checks if it can access Hawkular Metrics before entering the 'ready' state. It will output messages about checking this endpoint:

Endpoint Check in effect. Checking https://hawkular-metrics:443/hawkular/metrics/status
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 28. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 28]. Retrying.
....

If Heapster does not reach the Hawkular Metrics endpoint within a certain time limit it will restarted. I suspect what is happening here is that it took a while for Hawkular Metrics and Cassandra to download the new images and restart. This caused Heapster to restart itself (as expected and designed to do)

From the attached logs, I can see that Heapster has been able to connect to Hawkular Metrics.
Comment 7 Taneem Ibrahim 2017-07-26 11:11:03 EDT
Matt, 

Thank you for the explanation. However, in this case heapster actually never recovered on its own.  The cassandra and metrics pods were up within a few minutes, but heapster was in this error state for hours while killing and restarting pods over and over.  The 'resolution' was to scale it down to 0 and back up.

Is this expected?
Comment 8 Matt Wringe 2017-07-26 14:12:47 EDT
Heapster is stateless.

Scaling it back down to zero and then back up again shouldn't have any more affect than the pod restarting itself.

The difference could be that with a restart the pod is not rescheduled to another node, while scaling it down and back again could.

If you had a networking problem, say a firewall preventing access between nodes, then this could explain why scaling down the pod and then back up could have worked.

Can you please have them check if they have any firewalls between their nodes?
Comment 9 Taneem Ibrahim 2017-07-26 16:55:05 EDT
Hi Matt,

No firewall between the nodes.
Comment 10 Matt Wringe 2017-07-26 18:00:16 EDT
This issue is not reproducible and there is not enough information to proceed. 

The main culprit here is a problem with the network setup in this cluster. Whether this is a problem with the DNS server or an issue with a firewall preventing access.

I am closing this issue unless there is a reproducible step we can follow here.

Note You need to log in before you can comment on or make changes to this bug.