Description of problem: Depoly metrics with the currently latest images, hawkular metrics pod failed at liveness check, pod can not be started up metrics-hawkular-metrics/images/v3.7.37-1 metrics-cassandra/images/v3.7.36-1 metrics-heapster/images/v3.7.36-1 Note: Try again with metrics-hawkular-metrics-v3.7.36-1, it does not have this issue. # oc get po NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-6wpjn 1/1 Running 0 16m hawkular-metrics-6xxrs 0/1 CrashLoopBackOff 8 16m heapster-sjzt5 0/1 Running 1 16m # oc describe po hawkular-metrics-6xxrs ***************************************snipped********************************** 16m 15m 4 kubelet, 172.16.120.80 spec.containers{hawkular-metrics} Warning Unhealthy Liveness probe failed: Failed to access the status endpoint : <urlopen error [Errno 111] Connection refused>. Traceback (most recent call last): File "/opt/hawkular/scripts/hawkular-metrics-liveness.py", line 48, in <module> if int(uptime) < int(timeout): ValueError: invalid literal for int() with base 10: '' 16m 15m 4 kubelet, 172.16.120.80 spec.containers{hawkular-metrics} Warning Unhealthy Readiness probe failed: Failed to access the status endpoint : <urlopen error [Errno 111] Connection refused>. This may be due to Hawkular Metrics not being ready yet. Will try again. 15m 15m 3 kubelet, 172.16.120.80 spec.containers{hawkular-metrics} Normal Pulled Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-hawkular-metrics:v3.7" already present on machine 15m 1m 64 kubelet, 172.16.120.80 spec.containers{hawkular-metrics} Warning BackOff Back-off restarting failed container ***************************************snipped********************************** Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Deploy metrics 3.7 via ansible 2. 3. Actual results: hawkular metrics pod failed at liveness check, pod can not be started up Expected results: All pods should be healthy Additional info:
Blocks metrics installation and other feature testings
Please provide logs, the output of `oc get pods -o yaml`, and `oc get pods --all-namespaces | wc -l`. A very common cause for the livenes probe failing is heap pressure. GC logs are written to /opt/eap/standalone/log. You can try to capture any GC log files with `oc cp <hawkular-metrics-pod>:/opt/eap/standalone/log hawkular-metrics-log`. That directory is lost on container restart so you may or may not be able to get GC log files.
Created attachment 1404375 [details] metrics pods log
Tested with metrics-hawkular-metrics-v3.7.36-2, issue does not happen Images: metrics-cassandra-v3.7.37-1 metrics-hawkular-metrics-v3.7.36-2 metrics-heapster-v3.7.37-1 # openshift version openshift v3.7.36 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 # oc get po -n openshift-infra NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-vql6d 1/1 Running 0 27m hawkular-metrics-lgt4m 1/1 Running 0 27m heapster-l6z7c 1/1 Running 0 27m
Tested with metrics-hawkular-metrics-v3.7.42-2, issue does not happen Images metrics-hawkular-metrics/images/v3.7.42-2 metrics-cassandra/images/v3.7.42-2 metrics-heapster/images/v3.7.42-2
Team can we have an update on this , Customer is facing the issue. Let us know if you need more information on this from customer end. Thanks, Giriraj Rajawat
Joel, did updating the image resolve the problem?
I am resetting the version to 3.7 since that is the version for which the problem was reported. Giriraj, can you please open a separate ticket (or clone this one)? Thanks.
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed. [1]: https://access.redhat.com/support/policy/updates/openshift
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days