Bug 1460350 - [starter][starter-us-east-1] metrics failing to show in console [NEEDINFO]
[starter][starter-us-east-1] metrics failing to show in console
Status: NEW
Product: OpenShift Online
Classification: Red Hat
Component: Metrics (Show other bugs)
3.x
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Matt Wringe
Liming Zhou
: OnlineStarter
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-09 15:37 EDT by Steve Speicher
Modified: 2017-08-16 04:19 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
mwringe: needinfo? (rbaumgar)


Attachments (Terms of Use)
oc get pod heapster -o json (9.30 KB, text/plain)
2017-07-21 12:20 EDT, Eric Paris
no flags Details
console empty metrics graph (51.70 KB, image/png)
2017-07-24 20:39 EDT, Steve Speicher
no flags Details

  None (edit)
Description Steve Speicher 2017-06-09 15:37:05 EDT
In using the webconsole: https://console.starter-us-east-1.openshift.com/console

Trying to view metrics for a pod, I'm seeing this error in the browser's console:

Response:

{"errorMsg":"Failed to perform operation due to an error: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.54.151:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_ONE (1 required but only 0 alive)))"}

Response headers:

HTTP/1.1 500 Internal Server Error
X-Powered-By: Undertow/1
Access-Control-Allow-Headers: origin,accept,content-type,hawkular-tenant,authorization
Server: JBoss-EAP/7
Date: Fri, 09 Jun 2017 19:15:10 GMT
Access-Control-Allow-Origin: https://console.starter-us-east-1.openshift.com
Access-Control-Allow-Credentials: true
Content-Type: application/json
Content-Length: 296
Access-Control-Allow-Methods: GET, POST, PUT, PATCH, DELETE, OPTIONS, HEAD
Access-Control-Max-Age: 259200
Set-Cookie: ebfa7d2b9ec400af3c79e2d068d9ce9b=14842d22ee85f2ac4e90aa05722bbee3; path=/; HttpOnly; Secure

Request headers:

POST /hawkular/metrics/metrics/stats/query HTTP/1.1
Host: metrics.starter-us-east-1.openshift.com
Connection: keep-alive
Content-Length: 150
Pragma: no-cache
Cache-Control: no-cache
Hawkular-Tenant: sspeiche1
Authorization: Bearer 
Origin: https://console.starter-us-east-1.openshift.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Content-Type: application/json
Accept: application/json
DNT: 1
Referer: https://console.starter-us-east-1.openshift.com/console/project/sspeiche1/overview
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.8
Comment 1 Steve Speicher 2017-06-09 15:38:41 EDT
Following IRC, sounds like starter-us-east-1 is suffering both AWS API rate limiting and etcd pressures
Comment 2 Matt Wringe 2017-06-20 14:03:03 EDT
I can't seem to access starter-us-east-1 (I always get redirected to starter-us-east-2). Is this still an issue? Or was it resolved when the other issues in the cluster were fixed?
Comment 4 Steve Speicher 2017-07-18 21:26:09 EDT
Now that it has been upgraded to 3.6, just simply getting 503 error hitting https://metrics.starter-us-east-1.openshift.com/hawkular/metrics
Comment 5 Matt Wringe 2017-07-19 11:41:48 EDT
The 503 error you are seeing for https://metrics.starter-us-east-1.openshift.com/hawkular/metrics is coming from the router, not Hawkular Metrics. This either means the route is not properly configured or the Hawkular Metrics pod is not in the running state.

Can you please check what the state is of the metric pods in this cluster.
Comment 6 Eric Paris 2017-07-21 12:19:50 EDT
The problem is not the router. It is the pods. I have no idea what is wrong with the pods:

# oc get pod -n openshift-infra
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-j5v0k   1/1       Running   0          2d
hawkular-cassandra-2-khd50   1/1       Running   0          2d
hawkular-metrics-rcwjh       0/1       Running   423        2d
heapster-0w9r6               0/1       Running   441        2d

# oc logs -n openshift-infra heapster-0w9r6 | tail
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.
Could not connect to https://hawkular-metrics:443/hawkular/metrics/status. Curl exit code: 7. Status Code 000
'https://hawkular-metrics:443/hawkular/metrics/status' is not accessible [HTTP status code: 000. Curl exit code 7]. Retrying.

I have no idea how to debug heapster itself.
Comment 7 Eric Paris 2017-07-21 12:20 EDT
Created attachment 1302495 [details]
oc get pod heapster -o json
Comment 8 Steve Speicher 2017-07-24 20:39 EDT
Created attachment 1303937 [details]
console empty metrics graph
Comment 11 Matt Wringe 2017-07-25 10:04:07 EDT
(In reply to Eric Paris from comment #6)
> The problem is not the router. It is the pods. I have no idea what is wrong
> with the pods:
> 
> # oc get pod -n openshift-infra
> NAME                         READY     STATUS    RESTARTS   AGE
> hawkular-cassandra-1-j5v0k   1/1       Running   0          2d
> hawkular-cassandra-2-khd50   1/1       Running   0          2d
> hawkular-metrics-rcwjh       0/1       Running   423        2d
> heapster-0w9r6               0/1       Running   441        2d
> 

Who is monitoring this cluster? Obviously if hawkular-metrics has failed 423 times over the past 2 days something is really wrong. Why has it been restarting so many times?

> 
> I have no idea how to debug heapster itself.

Heapster is not the problem, it wont start until Hawkular Metrics has started. So we need to fix Hawkular Metrics first.
Comment 12 Robert Baumgartner 2017-08-10 04:06:27 EDT
I have same issue on https://console.starter-us-west-2.openshift.com/console, no metrics data...
The log shows a 204(no data) on the POST to https://metrics.starter-us-west-2.openshift.com/hawkular/metrics/metrics/stats/query with Body: {"tags":"descriptor_name:network/tx_rate|network/rx_rate,type:pod,pod_id:d513a826-7d3f-11e7-8490-0a69cdf75e6f","bucketDuration":"1mn","start":"-15mn"}
Comment 13 Matt Wringe 2017-08-10 09:56:47 EDT
(In reply to Robert Baumgartner from comment #12)
> I have same issue on
> https://console.starter-us-west-2.openshift.com/console, no metrics data...
> The log shows a 204(no data) on the POST to
> https://metrics.starter-us-west-2.openshift.com/hawkular/metrics/metrics/
> stats/query with Body:
> {"tags":"descriptor_name:network/tx_rate|network/rx_rate,type:pod,pod_id:
> d513a826-7d3f-11e7-8490-0a69cdf75e6f","bucketDuration":"1mn","start":"-15mn"}

metrics failing to show up in the console is a very basic error condition for any number of problems. Can you please open a new issue so that we can properly track it there? Otherwise we end up with the situation where one bugzilla ends up covering multiple issues and its difficult to keep track of what is happening.

Note You need to log in before you can comment on or make changes to this bug.