Bug 1551474 - [3.7]hawkular metrics pod failed at liveness check, pod can not be started up
Summary: [3.7]hawkular metrics pod failed at liveness check, pod can not be started up
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.7.z
Assignee: Ruben Vargas Palma
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1567827 1613130
TreeView+ depends on / blocked
 
Reported: 2018-03-05 09:15 UTC by Junqi Zhao
Modified: 2023-09-15 00:06 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1567827 1613130 (view as bug list)
Environment:
Last Closed: 2019-11-20 18:49:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
metrics pods log (13.66 KB, application/x-gzip)
2018-03-05 15:46 UTC, Junqi Zhao
no flags Details

Description Junqi Zhao 2018-03-05 09:15:46 UTC
Description of problem:
Depoly metrics with the currently latest images, hawkular metrics pod failed at liveness check, pod can not be started up
metrics-hawkular-metrics/images/v3.7.37-1
metrics-cassandra/images/v3.7.36-1
metrics-heapster/images/v3.7.36-1

Note: Try again with metrics-hawkular-metrics-v3.7.36-1, it does not have this issue.

# oc get po
NAME                         READY     STATUS             RESTARTS   AGE
hawkular-cassandra-1-6wpjn   1/1       Running            0          16m
hawkular-metrics-6xxrs       0/1       CrashLoopBackOff   8          16m
heapster-sjzt5               0/1       Running            1          16m

# oc describe po hawkular-metrics-6xxrs
***************************************snipped**********************************
  16m		15m		4	kubelet, 172.16.120.80	spec.containers{hawkular-metrics}	Warning		Unhealthy		Liveness probe failed: Failed to access the status endpoint : <urlopen error [Errno 111] Connection refused>.
Traceback (most recent call last):
  File "/opt/hawkular/scripts/hawkular-metrics-liveness.py", line 48, in <module>
    if int(uptime) < int(timeout):
ValueError: invalid literal for int() with base 10: ''

  16m	15m	4	kubelet, 172.16.120.80	spec.containers{hawkular-metrics}	Warning	Unhealthy	Readiness probe failed: Failed to access the status endpoint : <urlopen error [Errno 111] Connection refused>. This may be due to Hawkular Metrics not being ready yet. Will try again.

  15m	15m	3	kubelet, 172.16.120.80	spec.containers{hawkular-metrics}	Normal	Pulled	Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-hawkular-metrics:v3.7" already present on machine
  15m	1m	64	kubelet, 172.16.120.80	spec.containers{hawkular-metrics}	Warning	BackOff	Back-off restarting failed container
***************************************snipped**********************************


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy metrics 3.7 via ansible
2.
3.

Actual results:
hawkular metrics pod failed at liveness check, pod can not be started up

Expected results:
All pods should be healthy

Additional info:

Comment 1 Junqi Zhao 2018-03-05 09:16:38 UTC
Blocks metrics installation and other feature testings

Comment 2 John Sanda 2018-03-05 14:37:00 UTC
Please provide logs, the output of `oc get pods -o yaml`, and `oc get pods --all-namespaces | wc -l`.

A very common cause for the livenes probe failing is heap pressure. GC logs are written to /opt/eap/standalone/log. You can try to capture any GC log files with `oc cp <hawkular-metrics-pod>:/opt/eap/standalone/log hawkular-metrics-log`. That directory is lost on container restart so you may or may not be able to get GC log files.

Comment 4 Junqi Zhao 2018-03-05 15:46:45 UTC
Created attachment 1404375 [details]
metrics pods log

Comment 8 Junqi Zhao 2018-03-09 08:59:33 UTC
Tested with metrics-hawkular-metrics-v3.7.36-2, issue does not happen

Images:
metrics-cassandra-v3.7.37-1
metrics-hawkular-metrics-v3.7.36-2
metrics-heapster-v3.7.37-1


# openshift version
openshift v3.7.36
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8


# oc get po -n openshift-infra
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-vql6d   1/1       Running   0          27m
hawkular-metrics-lgt4m       1/1       Running   0          27m
heapster-l6z7c               1/1       Running   0          27m

Comment 9 Junqi Zhao 2018-04-03 07:44:33 UTC
Tested with metrics-hawkular-metrics-v3.7.42-2, issue does not happen

Images
metrics-hawkular-metrics/images/v3.7.42-2
metrics-cassandra/images/v3.7.42-2
metrics-heapster/images/v3.7.42-2

Comment 21 giriraj rajawat 2018-08-02 08:13:57 UTC
Team can we have an update on this , Customer is facing the issue.
Let us know if you need more information on this from customer end.

Thanks,
Giriraj Rajawat

Comment 22 John Sanda 2018-08-06 19:31:36 UTC
Joel, did updating the image resolve the problem?

Comment 23 John Sanda 2018-08-06 19:34:00 UTC
I am resetting the version to 3.7 since that is the version for which the problem was reported.

Giriraj, can you please open a separate ticket (or clone this one)? Thanks.

Comment 27 Stephen Cuppett 2019-11-20 18:49:57 UTC
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed.

[1]: https://access.redhat.com/support/policy/updates/openshift

Comment 28 Red Hat Bugzilla 2023-09-15 00:06:45 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.