2099373 – Missing LivenessProbe for prometheus-adapter in 4.10

Bug 2099373 - Missing LivenessProbe for prometheus-adapter in 4.10

Summary: Missing LivenessProbe for prometheus-adapter in 4.10

Keywords:
Status:	CLOSED DUPLICATE of bug 2048333
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-20 18:18 UTC by Michael Shen
Modified:	2022-06-21 07:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-06-21 07:45:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
inflight requests over time, showing the pod falling over after ~150 minutes until being restarted (380.55 KB, image/jpeg) 2022-06-20 18:18 UTC, Michael Shen	no flags	Details
View All

Description Michael Shen 2022-06-20 18:18:47 UTC

Created attachment 1891363 [details]
inflight requests over time, showing the pod falling over after ~150 minutes until being restarted

Created attachment 1891363 [details]
inflight requests over time, showing the pod falling over after ~150 minutes until being restarted

Description of problem:
Related to https://bugzilla.redhat.com/show_bug.cgi?id=2091902, an architecture change in how prometheus-adapter is sending metrics is resulting in increased latency, which is causing inflight requests to pile up and eventually cause prometheus-adapter to fail. However, due to a lack of a LivenessProbe, the pod always reports as Ready. By deleting the pod/rolling the deployment, the metrics server runs successfully for a while.

OpenShift/Cluster Monitoring Operator version: 4.10.15

Expected Behavior: The prometheus-adapter pod is killed and restarted when it is not actually Running

Additional Info:
There are fixes to the "root cause" being discussed in https://bugzilla.redhat.com/show_bug.cgi?id=2091902, but I believe this LivenessProbe (https://github.com/openshift/cluster-monitoring-operator/pull/1621) should also be backported so that the pod will restart and not need manual intervention while the fallout from adding additional latency is resolved.

Comment 2 Michael Shen 2022-06-20 21:04:18 UTC

prometheus-adapter runs in the openshift-monitoring namespace, so customers are not responsible for restarting workloads in the namespace. This bug is requiring Red Hat on-call engineers to periodically restart pods manually across the fleet.

Comment 3 Simon Pasquier 2022-06-21 07:45:03 UTC

closing as a duplicate of bug 2048333 but don't worry, we'll proceed with the backport. I'm not sure that the readiness probe will fail when prometheus-adapter hits the max concurrent requests limit but the fix wouldn't hurt anyway.

*** This bug has been marked as a duplicate of bug 2048333 ***

Comment 4 Simon Pasquier 2022-06-21 07:45:38 UTC

bug 2099526 is the bugzilla for the 4.10.z backport.

Note You need to log in before you can comment on or make changes to this bug.