Bug 1816500

Summary:	Readiness and Liveness probes are failing for the application pods
Product:	OpenShift Container Platform	Reporter:	manisha <mdhanve>
Component:	Monitoring	Assignee:	Sergiusz Urbaniak <surbania>
Status:	CLOSED DUPLICATE	QA Contact:	Junqi Zhao <juzhao>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	3.11.0	CC:	alegrand, anpicker, aos-bugs, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania, wzheng
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-26 13:48:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description manisha 2020-03-24 07:04:33 UTC

Description of problem: Cu is facing an issue where readiness and liveness probe for application pods are failing when given the resources after analyzing the grafana dashboard statistics but since behind it Prometheus query includes 'namespace_pod_name_container_name:container_cpu_usage_seconds_total:sum_rate' is computed over a 5 minutes range so it doesn't reflect the instantaneous CPU usage. therefore asked to check cpu and memory usage using 'oc adm top pods'. According to it a pod use 2m CPU So he assigned 50-100m for this pod however healthchech fails continues for the pod. checked that issue is neither neither node specific nor application specific.

Also, when cpu and memory limits are increased then pods are working fine though the project events keeping logging with readiness and liveness probe failed.


Actual results: Application pod is failing with the readiness and the liveness probe failed.

Comment 4 Pawel Krupa 2020-03-26 13:48:10 UTC

Before OpenShift 4.5 a query responsible for data visualized with `oc adm top pods` came from a query which wasn't instant and output was smoothed over 5m time. After fixing this with https://bugzilla.redhat.com/show_bug.cgi?id=1812004 this is no longer the case. 

Bug won't be backported as it is not critical for cluster operations.

*** This bug has been marked as a duplicate of bug 1812004 ***