2024173 – prometheus-adapter rollout is causing FailedScheduling event in specific environments

Bug 2024173 - prometheus-adapter rollout is causing FailedScheduling event in specific environments

Summary: prometheus-adapter rollout is causing FailedScheduling event in specific envi...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.9
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Philip Gough
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-17 13:46 UTC by Simon Reber
Modified:	2022-02-02 20:02 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-07 14:59:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RFE-2328	0	None	None	None	2021-12-08 14:44:48 UTC

Description Simon Reber 2021-11-17 13:46:14 UTC

Description of problem:

In OpenShift Container Platform 4.8.2, https://bugzilla.redhat.com/show_bug.cgi?id=1948711 was introduced which has set number of replicas for prometheus-adapter to 2.

The problem is that `maxSurge` is set to 25% and `maxUnavailable` to 1. With that, we are provisioning additional capacity during the rollout. But due to anti-affinity rules FailedScheduling can occur depending on the amount of OpenShift Container Platform - Node(s) avaialble.

For example, when creating 3 OpenShift Container Platform - Infra Node(s) (following https://docs.openshift.com/container-platform/4.9/machine_management/creating-infrastructure-machinesets.html) and enforcing the Cluster Monitoring stack to run on those 3 OpenShift Container Platform - Infra Node(s) (following https://docs.openshift.com/container-platform/4.9/machine_management/creating-infrastructure-machinesets.html#infrastructure-moving-monitoring_creating-infrastructure-machinesets) we'll see FailedScheduling events during the rollout.

> ip-10-0-144-148.ha-fooo-X.compute.internal   Ready    worker         57m   v1.22.0-rc.0+a44d0f0   192.168.144.148   <none>        Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8
> ip-10-0-145-68.ha-fooo-X.compute.internal    Ready    worker         58m   v1.22.0-rc.0+a44d0f0   192.168.145.68    <none>        Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8
> ip-10-0-149-190.ha-fooo-X.compute.internal   Ready    infra,worker   29m   v1.22.0-rc.0+a44d0f0   192.168.149.190   <none>        Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8
> ip-10-0-158-199.ha-fooo-X.compute.internal   Ready    master         64m   v1.22.0-rc.0+a44d0f0   192.168.158.199   <none>        Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8
> ip-10-0-185-177.ha-fooo-X.compute.internal   Ready    master         65m   v1.22.0-rc.0+a44d0f0   192.168.185.177   <none>        Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8
> ip-10-0-200-94.ha-fooo-X.compute.internal    Ready    infra,worker   29m   v1.22.0-rc.0+a44d0f0   192.168.200.94    <none>        Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8
> ip-10-0-245-246.ha-fooo-X.compute.internal   Ready    master         65m   v1.22.0-rc.0+a44d0f0   192.168.245.246   <none>        Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8
> ip-10-0-247-191.ha-fooo-X.compute.internal   Ready    infra,worker   20m   v1.22.0-rc.0+a44d0f0   192.168.247.191   <none>        Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8
> ip-10-0-250-250.ha-fooo-X.compute.internal   Ready    worker         58m   v1.22.0-rc.0+a44d0f0   192.168.250.250   <none>        Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa)   4.18.0-305.19.1.el8_4.x86_64   cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8

> Wed Nov 17 11:40:10 AM CET 2021
> prometheus-adapter-586b986d49-4csql            0/1     Pending             0          1s      <none>         <none>                                       <none>           <none>
> prometheus-adapter-586b986d49-q48z7            0/1     ContainerCreating   0          1s      <none>         ip-10-0-247-191.ha-fooo-X.compute.internal   <none>           <none>
> prometheus-adapter-7dcdbb4d87-4hlpp            1/1     Terminating         0          10m     10.130.2.6     ip-10-0-149-190.ha-fooo-X.compute.internal   <none>           <none>
> prometheus-adapter-7dcdbb4d87-fb9lj            1/1     Running             0          10m     10.131.2.7     ip-10-0-200-94.ha-fooo-X.compute.internal    <none>           <none>
> LAST SEEN   TYPE      REASON              OBJECT                                     MESSAGE
> 1s          Warning   FailedScheduling    pod/prometheus-adapter-586b986d49-4csql    0/9 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
> 1s          Normal    Scheduled           pod/prometheus-adapter-586b986d49-q48z7    Successfully assigned openshift-monitoring/prometheus-adapter-586b986d49-q48z7 to ip-10-0-247-191.ha-fooo-X.compute.internal
> 1s          Normal    AddedInterface      pod/prometheus-adapter-586b986d49-q48z7    Add eth0 [10.128.4.8/23] from openshift-sdn
> 0s          Normal    Pulling             pod/prometheus-adapter-586b986d49-q48z7    Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0be6fb325dd16975b2ace01362ebcaf0cc97075f1641c377f4580dea72941d64"
> 2s          Normal    SuccessfulCreate    replicaset/prometheus-adapter-586b986d49   Created pod: prometheus-adapter-586b986d49-q48z7
> 2s          Normal    SuccessfulCreate    replicaset/prometheus-adapter-586b986d49   Created pod: prometheus-adapter-586b986d49-4csql
> 2s          Normal    Killing             pod/prometheus-adapter-7dcdbb4d87-4hlpp    Stopping container prometheus-adapter
> 2s          Normal    SuccessfulDelete    replicaset/prometheus-adapter-7dcdbb4d87   Deleted pod: prometheus-adapter-7dcdbb4d87-4hlpp
> 2s          Normal    ScalingReplicaSet   deployment/prometheus-adapter              Scaled up replica set prometheus-adapter-586b986d49 to 1
> 2s          Normal    ScalingReplicaSet   deployment/prometheus-adapter              Scaled down replica set prometheus-adapter-7dcdbb4d87 to 1
> 2s          Normal    ScalingReplicaSet   deployment/prometheus-adapter              Scaled up replica set prometheus-adapter-586b986d49 to 2

Above we can see that we have "0/9 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate" reported. It's only transient and will reconcile quickly. But still, this something that can trigger unneeded activity/events or even alerts.

It therefore recommend to set `maxSurge` to 0 while keeping `maxUnavailable` at 1 as this will prevent the transient event being reported as no additional capacity is provided during the rollout and therefore even environments with a small amount of capacity available should be OK and not see any problematic event.

Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 4.9.5

How reproducible:

 - Always

Steps to Reproduce:
1. Setup 3 OpenShift Container Platform 4 - Infra Node(s) following https://docs.openshift.com/container-platform/4.9/machine_management/creating-infrastructure-machinesets.html
2. Assign Cluster Monitoring to these 3 OpenShift Container Platform 4 - Infra Node(s) following https://docs.openshift.com/container-platform/4.9/machine_management/creating-infrastructure-machinesets.html#infrastructure-moving-monitoring_creating-infrastructure-machinesets
3. Trigger a rollout of the `prometheus-adapater` deployment

Actual results:

"0/9 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate" is reported as event as one pod will be in pending state for a short period of time as no suitable OpenShift Container Platform 4 - Infra Node can be found due to the constrains put in place by the deployment (anti-affinity)

Expected results:

During a regular rollout of the  `prometheus-adapater` deployment there should not be any FailedScheduling reported as the deployment should be put in place that works in all possible scenarios.

Additional info:

Comment 1 Simon Pasquier 2021-11-17 14:00:09 UTC

Thanks for the detailed report! We followed the OCP guidelines about high-availability so we'll need to investigate further with the workload team whether we did something wrong or the guidelines should be amended.
Decreasing the severity to low since the failed scheduling error is transient and resolves by itself.

[1] https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability

Comment 3 Filip Krepinsky 2021-12-01 16:58:10 UTC

Following up on our previous conversation. As the events are a minor inconvenience I would leave the maxSurge at default %25. It is true that 0 would lower the amount of events but would not prevent them entirely in case replicas == number of nodes.

Also, I am planning to propose a feature that would support this scenario in upstream.

Comment 4 Philip Gough 2021-12-07 14:59:58 UTC

Closing this as deferred  based on Filip's comment about the upstream proposal. From the monitoring perspective, we have followed the best practices as per https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability. The error is transient and the issue resolves.

Comment 5 Simon Reber 2021-12-07 15:30:18 UTC

(In reply to Filip Krepinsky from comment #3)
> Also, I am planning to propose a feature that would support this scenario in
> upstream.
Do you have details where this effort is tracked so we can stay updated on the progress upstream.

Comment 6 Filip Krepinsky 2021-12-08 12:34:56 UTC

There is no upstream tracking yet, I am planning to follow up on this in January

Comment 7 Filip Krepinsky 2022-01-19 16:50:41 UTC

I have started an upstream thread about a feature that would support this: https://groups.google.com/g/kubernetes-sig-apps/c/BiFXV9FrAas

Comment 8 Filip Krepinsky 2022-02-02 20:02:37 UTC

The issue was discussed upstream with a following outcome:

feature tracking issue was created in order to obtain more feedback from the community before proceeding further: https://github.com/kubernetes/kubernetes/issues/107920

doc update to describe the current behaviour was created: https://github.com/kubernetes/website/pull/31603

Note You need to log in before you can comment on or make changes to this bug.