Created attachment 1891363 [details] inflight requests over time, showing the pod falling over after ~150 minutes until being restarted Created attachment 1891363 [details] inflight requests over time, showing the pod falling over after ~150 minutes until being restarted Description of problem: Related to https://bugzilla.redhat.com/show_bug.cgi?id=2091902, an architecture change in how prometheus-adapter is sending metrics is resulting in increased latency, which is causing inflight requests to pile up and eventually cause prometheus-adapter to fail. However, due to a lack of a LivenessProbe, the pod always reports as Ready. By deleting the pod/rolling the deployment, the metrics server runs successfully for a while. OpenShift/Cluster Monitoring Operator version: 4.10.15 Expected Behavior: The prometheus-adapter pod is killed and restarted when it is not actually Running Additional Info: There are fixes to the "root cause" being discussed in https://bugzilla.redhat.com/show_bug.cgi?id=2091902, but I believe this LivenessProbe (https://github.com/openshift/cluster-monitoring-operator/pull/1621) should also be backported so that the pod will restart and not need manual intervention while the fallout from adding additional latency is resolved.
prometheus-adapter runs in the openshift-monitoring namespace, so customers are not responsible for restarting workloads in the namespace. This bug is requiring Red Hat on-call engineers to periodically restart pods manually across the fleet.
closing as a duplicate of bug 2048333 but don't worry, we'll proceed with the backport. I'm not sure that the readiness probe will fail when prometheus-adapter hits the max concurrent requests limit but the fix wouldn't hurt anyway. *** This bug has been marked as a duplicate of bug 2048333 ***
bug 2099526 is the bugzilla for the 4.10.z backport.