Bug 1551474

Summary: [3.7]hawkular metrics pod failed at liveness check, pod can not be started up
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: HawkularAssignee: Ruben Vargas Palma <rvargasp>
Status: CLOSED DEFERRED QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: high    
Version: 3.7.0CC: aos-bugs, jlee, jrosenta, juzhao, rvargasp, suchaudh
Target Milestone: ---Keywords: Regression
Target Release: 3.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1567827 1613130 (view as bug list) Environment:
Last Closed: 2019-11-20 18:49:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1567827, 1613130    
Attachments:
Description Flags
metrics pods log none

Description Junqi Zhao 2018-03-05 09:15:46 UTC
Description of problem:
Depoly metrics with the currently latest images, hawkular metrics pod failed at liveness check, pod can not be started up
metrics-hawkular-metrics/images/v3.7.37-1
metrics-cassandra/images/v3.7.36-1
metrics-heapster/images/v3.7.36-1

Note: Try again with metrics-hawkular-metrics-v3.7.36-1, it does not have this issue.

# oc get po
NAME                         READY     STATUS             RESTARTS   AGE
hawkular-cassandra-1-6wpjn   1/1       Running            0          16m
hawkular-metrics-6xxrs       0/1       CrashLoopBackOff   8          16m
heapster-sjzt5               0/1       Running            1          16m

# oc describe po hawkular-metrics-6xxrs
***************************************snipped**********************************
  16m		15m		4	kubelet, 172.16.120.80	spec.containers{hawkular-metrics}	Warning		Unhealthy		Liveness probe failed: Failed to access the status endpoint : <urlopen error [Errno 111] Connection refused>.
Traceback (most recent call last):
  File "/opt/hawkular/scripts/hawkular-metrics-liveness.py", line 48, in <module>
    if int(uptime) < int(timeout):
ValueError: invalid literal for int() with base 10: ''

  16m	15m	4	kubelet, 172.16.120.80	spec.containers{hawkular-metrics}	Warning	Unhealthy	Readiness probe failed: Failed to access the status endpoint : <urlopen error [Errno 111] Connection refused>. This may be due to Hawkular Metrics not being ready yet. Will try again.

  15m	15m	3	kubelet, 172.16.120.80	spec.containers{hawkular-metrics}	Normal	Pulled	Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-hawkular-metrics:v3.7" already present on machine
  15m	1m	64	kubelet, 172.16.120.80	spec.containers{hawkular-metrics}	Warning	BackOff	Back-off restarting failed container
***************************************snipped**********************************


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy metrics 3.7 via ansible
2.
3.

Actual results:
hawkular metrics pod failed at liveness check, pod can not be started up

Expected results:
All pods should be healthy

Additional info:

Comment 1 Junqi Zhao 2018-03-05 09:16:38 UTC
Blocks metrics installation and other feature testings

Comment 2 John Sanda 2018-03-05 14:37:00 UTC
Please provide logs, the output of `oc get pods -o yaml`, and `oc get pods --all-namespaces | wc -l`.

A very common cause for the livenes probe failing is heap pressure. GC logs are written to /opt/eap/standalone/log. You can try to capture any GC log files with `oc cp <hawkular-metrics-pod>:/opt/eap/standalone/log hawkular-metrics-log`. That directory is lost on container restart so you may or may not be able to get GC log files.

Comment 4 Junqi Zhao 2018-03-05 15:46:45 UTC
Created attachment 1404375 [details]
metrics pods log

Comment 8 Junqi Zhao 2018-03-09 08:59:33 UTC
Tested with metrics-hawkular-metrics-v3.7.36-2, issue does not happen

Images:
metrics-cassandra-v3.7.37-1
metrics-hawkular-metrics-v3.7.36-2
metrics-heapster-v3.7.37-1


# openshift version
openshift v3.7.36
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8


# oc get po -n openshift-infra
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-vql6d   1/1       Running   0          27m
hawkular-metrics-lgt4m       1/1       Running   0          27m
heapster-l6z7c               1/1       Running   0          27m

Comment 9 Junqi Zhao 2018-04-03 07:44:33 UTC
Tested with metrics-hawkular-metrics-v3.7.42-2, issue does not happen

Images
metrics-hawkular-metrics/images/v3.7.42-2
metrics-cassandra/images/v3.7.42-2
metrics-heapster/images/v3.7.42-2

Comment 21 giriraj rajawat 2018-08-02 08:13:57 UTC
Team can we have an update on this , Customer is facing the issue.
Let us know if you need more information on this from customer end.

Thanks,
Giriraj Rajawat

Comment 22 John Sanda 2018-08-06 19:31:36 UTC
Joel, did updating the image resolve the problem?

Comment 23 John Sanda 2018-08-06 19:34:00 UTC
I am resetting the version to 3.7 since that is the version for which the problem was reported.

Giriraj, can you please open a separate ticket (or clone this one)? Thanks.

Comment 27 Stephen Cuppett 2019-11-20 18:49:57 UTC
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed.

[1]: https://access.redhat.com/support/policy/updates/openshift

Comment 28 Red Hat Bugzilla 2023-09-15 00:06:45 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days