Description of problem:
All the three pods from the openshift-infra projects are running but the Metrics for pods are not visible in the UI, but from the CLI metrics are available when checked with (oc adm top command).
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Install metrics in OCP
2. Going further can check if the liveness and readiness probes are failing or not & the metrics are visible in UI
Probes are failing and metrics not available in UI
Probes should not fail and metrics should be available on UI
The project events are showing that readiness and liveness probes are failing but the scripts for liveness and readiness probes inside the pods seem to be executed successfully.
Following are the events captured from the openshift-infra project,
11:18:00 AM hawkular-metrics-xr4g8 Pod Warning Unhealthy Readiness probe failed: The MetricService is not yet in the STARTED state [STARTING]. We need to wait until its in the STARTED state.
11:17:45 AM hawkular-metrics-xr4g8 Pod Warning Unhealthy Readiness probe failed: Failed to access the status endpoint : <urlopen error [Errno 111] Connection refused>. This may be due to Hawkular Metrics not being ready yet. Will try again.
2 times in the last 4 minutes
11:17:38 AM hawkular-cassandra-1-4ml4c Pod Warning Unhealthy Readiness probe failed: Could not get the Cassandra status. This may mean that the Cassandra instance is not up yet. Will try again nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'.
11:17:38 AM hawkular-metrics-xr4g8 Pod Warning Unhealthy Liveness probe failed: Failed to access the status endpoint : <urlopen error [Errno 111] Connection refused>. Traceback (most recent call last): File "/opt/hawkular/scripts/hawkular-metrics-liveness.py", line 48, in <module> if int(uptime) < int(timeout): ValueError: invalid literal for int() with base 10: ''
Are we sure the hawkular-metrics and hawkular-cassandra pods did not transition to successfully running later? Because these probe failures are normal during startup of the pods. Once the pods start up these warnings should stop occurring. I see these all are within 22 seconds, it would only be a concern if this continued to appear for more than a few minutes. Does "oc get pod -n openshift-infra" reveal that the pods are in Ready state?
Are we sure the hawkular-metrics and hawkular-cassandra pods did not transition to successfully running later?
-- Those were already running (probes are failing after successful pod start.)
Does "oc get pod -n openshift-infra" reveal that the pods are in Ready state?
-- Yes, pods from the openshift-infra project are in running state.
Then I don't understand - if the probes are failing after successful pod start, how can the pods be in running state? Do you mean that the readiness probe is failing, so they are Running, but not Ready? Could they share the output of "oc get pods -n openshift-infra" and perhaps "oc get events -n openshift-infra" so I can see what the exact status of the pods is? I need to see if the pods are getting restarted periodically, which would happen because a probe reaching is failure threshold.
Below are the results. oc get events is empty at the moment.
Here are the output:
# oc get pods -n openshift-infra
NAME READY STATUS RESTARTS AGE
hawkular-cassandra-1-klx5p 1/1 Running 0 13d
hawkular-metrics-pkfz2 1/1 Running 0 13d
hawkular-metrics-schema-n7n8l 0/1 Completed 0 24d
heapster-b6pds 1/1 Running 0 17d
# oc get events -n openshift-infra
No resources found.
From the "oc get events" it is apparent that the pods are running fine. If probes were failing, the pods would not be Running and Ready. There aren't any restarts either.
So if they are having with seeing the metrics in the console, could it be because the console is misconfigured? Is the console showing an error message such as "An error occurred getting metrics."? This happens when there is a value for metricsPublicURL but the metrics can't be found there. Do they have a metrics URL configured in the web console config?
oc get openshiftwebconsoleconfig instance -o json -n openshift-web-console
Here in the resulting JSON document, there should be a value at the path /spec/config/clusterInfo/metricsPublicURL, the path should point at the hawkular-metrics route. The path of the route can be found by running "oc get route hawkular-metrics -n openshift-infra" (and https:// must be prepended before it).
Another possible explanation for the error might be that Heapster is not collecting metrics about pods for some reason. Can they check Heapster logs if there are any errors? Heapster is a pod running in the openshift-infra namespace.
# rpm -qa | grep openshift-ansible
set openshift_metrics_heapster_standalone is defined as false with small f, they installation is successful
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.