Bug 1460350

Summary: [starter][starter-us-east-1] metrics failing to show in console
Product: OpenShift Container Platform Reporter: Steve Speicher <sspeiche>
Component: HawkularAssignee: Matt Wringe <mwringe>
Status: CLOSED NOTABUG QA Contact: Liming Zhou <lizhou>
Severity: high Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: aos-bugs, dmcphers, eparis, jcantril, mhalachev, rbaumgar, sspeiche
Target Milestone: ---Keywords: OnlineStarter
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-17 13:40:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
oc get pod heapster -o json
none
console empty metrics graph none

Description Steve Speicher 2017-06-09 19:37:05 UTC
In using the webconsole: https://console.starter-us-east-1.openshift.com/console

Trying to view metrics for a pod, I'm seeing this error in the browser's console:

Response:

{"errorMsg":"Failed to perform operation due to an error: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.54.151:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_ONE (1 required but only 0 alive)))"}

Response headers:

HTTP/1.1 500 Internal Server Error
X-Powered-By: Undertow/1
Access-Control-Allow-Headers: origin,accept,content-type,hawkular-tenant,authorization
Server: JBoss-EAP/7
Date: Fri, 09 Jun 2017 19:15:10 GMT
Access-Control-Allow-Origin: https://console.starter-us-east-1.openshift.com
Access-Control-Allow-Credentials: true
Content-Type: application/json
Content-Length: 296
Access-Control-Allow-Methods: GET, POST, PUT, PATCH, DELETE, OPTIONS, HEAD
Access-Control-Max-Age: 259200
Set-Cookie: ebfa7d2b9ec400af3c79e2d068d9ce9b=14842d22ee85f2ac4e90aa05722bbee3; path=/; HttpOnly; Secure

Request headers:

POST /hawkular/metrics/metrics/stats/query HTTP/1.1
Host: metrics.starter-us-east-1.openshift.com
Connection: keep-alive
Content-Length: 150
Pragma: no-cache
Cache-Control: no-cache
Hawkular-Tenant: sspeiche1
Authorization: Bearer 
Origin: https://console.starter-us-east-1.openshift.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Content-Type: application/json
Accept: application/json
DNT: 1
Referer: https://console.starter-us-east-1.openshift.com/console/project/sspeiche1/overview
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.8

Comment 1 Steve Speicher 2017-06-09 19:38:41 UTC
Following IRC, sounds like starter-us-east-1 is suffering both AWS API rate limiting and etcd pressures

Comment 2 Matt Wringe 2017-06-20 18:03:03 UTC
I can't seem to access starter-us-east-1 (I always get redirected to starter-us-east-2). Is this still an issue? Or was it resolved when the other issues in the cluster were fixed?

Comment 4 Steve Speicher 2017-07-19 01:26:09 UTC
Now that it has been upgraded to 3.6, just simply getting 503 error hitting https://metrics.starter-us-east-1.openshift.com/hawkular/metrics

Comment 5 Matt Wringe 2017-07-19 15:41:48 UTC
The 503 error you are seeing for https://metrics.starter-us-east-1.openshift.com/hawkular/metrics is coming from the router, not Hawkular Metrics. This either means the route is not properly configured or the Hawkular Metrics pod is not in the running state.

Can you please check what the state is of the metric pods in this cluster.

Comment 6 Eric Paris 2017-07-21 16:19:50 UTC
The problem is not the router. It is the pods. I have no idea what is wrong with the pods:

# oc get pod -n openshift-infra
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-j5v0k   1/1       Running   0          2d
hawkular-cassandra-2-khd50   1/1       Running   0          2d
hawkular-metrics-rcwjh       0/1       Running   423        2d
heapster-0w9r6               0/1       Running   441        2d

# oc logs -n openshift-infra heapster-0w9r6 | tail
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.

I have no idea how to debug heapster itself.

Comment 7 Eric Paris 2017-07-21 16:20:29 UTC
Created attachment 1302495 [details]
oc get pod heapster -o json

Comment 8 Steve Speicher 2017-07-25 00:39:48 UTC
Created attachment 1303937 [details]
console empty metrics graph

Comment 11 Matt Wringe 2017-07-25 14:04:07 UTC
(In reply to Eric Paris from comment #6)
> The problem is not the router. It is the pods. I have no idea what is wrong
> with the pods:
> 
> # oc get pod -n openshift-infra
> NAME                         READY     STATUS    RESTARTS   AGE
> hawkular-cassandra-1-j5v0k   1/1       Running   0          2d
> hawkular-cassandra-2-khd50   1/1       Running   0          2d
> hawkular-metrics-rcwjh       0/1       Running   423        2d
> heapster-0w9r6               0/1       Running   441        2d
> 

Who is monitoring this cluster? Obviously if hawkular-metrics has failed 423 times over the past 2 days something is really wrong. Why has it been restarting so many times?

> 
> I have no idea how to debug heapster itself.

Heapster is not the problem, it wont start until Hawkular Metrics has started. So we need to fix Hawkular Metrics first.

Comment 12 Robert Baumgartner 2017-08-10 08:06:27 UTC
I have same issue on https://console.starter-us-west-2.openshift.com/console, no metrics data...
The log shows a 204(no data) on the POST to https://metrics.starter-us-west-2.openshift.com/hawkular/metrics/metrics/stats/query with Body: {"tags":"descriptor_name:network/tx_rate|network/rx_rate,type:pod,pod_id:d513a826-7d3f-11e7-8490-0a69cdf75e6f","bucketDuration":"1mn","start":"-15mn"}

Comment 13 Matt Wringe 2017-08-10 13:56:47 UTC
(In reply to Robert Baumgartner from comment #12)
> I have same issue on
> https://console.starter-us-west-2.openshift.com/console, no metrics data...
> The log shows a 204(no data) on the POST to
> https://metrics.starter-us-west-2.openshift.com/hawkular/metrics/metrics/
> stats/query with Body:
> {"tags":"descriptor_name:network/tx_rate|network/rx_rate,type:pod,pod_id:
> d513a826-7d3f-11e7-8490-0a69cdf75e6f","bucketDuration":"1mn","start":"-15mn"}

metrics failing to show up in the console is a very basic error condition for any number of problems. Can you please open a new issue so that we can properly track it there? Otherwise we end up with the situation where one bugzilla ends up covering multiple issues and its difficult to keep track of what is happening.

Comment 14 Robert Baumgartner 2017-09-04 11:41:13 UTC
(In reply to Matt Wringe from comment #13)
> (In reply to Robert Baumgartner from comment #12)
> > I have same issue on
> > https://console.starter-us-west-2.openshift.com/console, no metrics data...
> > The log shows a 204(no data) on the POST to
> > https://metrics.starter-us-west-2.openshift.com/hawkular/metrics/metrics/
> > stats/query with Body:
> > {"tags":"descriptor_name:network/tx_rate|network/rx_rate,type:pod,pod_id:
> > d513a826-7d3f-11e7-8490-0a69cdf75e6f","bucketDuration":"1mn","start":"-15mn"}
> 
> metrics failing to show up in the console is a very basic error condition
> for any number of problems. Can you please open a new issue so that we can
> properly track it there? Otherwise we end up with the situation where one
> bugzilla ends up covering multiple issues and its difficult to keep track of
> what is happening.

done, https://bugzilla.redhat.com/show_bug.cgi?id=1480261

Comment 15 Matt Wringe 2017-10-17 13:40:19 UTC
Closing as I think this is out of date. If we still have an issue here, please re-open