Description of problem: In OpenShift Container Platform 4.8.2, https://bugzilla.redhat.com/show_bug.cgi?id=1948711 was introduced which has set number of replicas for prometheus-adapter to 2. The problem is that `maxSurge` is set to 25% and `maxUnavailable` to 1. With that, we are provisioning additional capacity during the rollout. But due to anti-affinity rules FailedScheduling can occur depending on the amount of OpenShift Container Platform - Node(s) avaialble. For example, when creating 3 OpenShift Container Platform - Infra Node(s) (following https://docs.openshift.com/container-platform/4.9/machine_management/creating-infrastructure-machinesets.html) and enforcing the Cluster Monitoring stack to run on those 3 OpenShift Container Platform - Infra Node(s) (following https://docs.openshift.com/container-platform/4.9/machine_management/creating-infrastructure-machinesets.html#infrastructure-moving-monitoring_creating-infrastructure-machinesets) we'll see FailedScheduling events during the rollout. > ip-10-0-144-148.ha-fooo-X.compute.internal Ready worker 57m v1.22.0-rc.0+a44d0f0 192.168.144.148 <none> Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa) 4.18.0-305.19.1.el8_4.x86_64 cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8 > ip-10-0-145-68.ha-fooo-X.compute.internal Ready worker 58m v1.22.0-rc.0+a44d0f0 192.168.145.68 <none> Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa) 4.18.0-305.19.1.el8_4.x86_64 cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8 > ip-10-0-149-190.ha-fooo-X.compute.internal Ready infra,worker 29m v1.22.0-rc.0+a44d0f0 192.168.149.190 <none> Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa) 4.18.0-305.19.1.el8_4.x86_64 cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8 > ip-10-0-158-199.ha-fooo-X.compute.internal Ready master 64m v1.22.0-rc.0+a44d0f0 192.168.158.199 <none> Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa) 4.18.0-305.19.1.el8_4.x86_64 cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8 > ip-10-0-185-177.ha-fooo-X.compute.internal Ready master 65m v1.22.0-rc.0+a44d0f0 192.168.185.177 <none> Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa) 4.18.0-305.19.1.el8_4.x86_64 cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8 > ip-10-0-200-94.ha-fooo-X.compute.internal Ready infra,worker 29m v1.22.0-rc.0+a44d0f0 192.168.200.94 <none> Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa) 4.18.0-305.19.1.el8_4.x86_64 cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8 > ip-10-0-245-246.ha-fooo-X.compute.internal Ready master 65m v1.22.0-rc.0+a44d0f0 192.168.245.246 <none> Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa) 4.18.0-305.19.1.el8_4.x86_64 cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8 > ip-10-0-247-191.ha-fooo-X.compute.internal Ready infra,worker 20m v1.22.0-rc.0+a44d0f0 192.168.247.191 <none> Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa) 4.18.0-305.19.1.el8_4.x86_64 cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8 > ip-10-0-250-250.ha-fooo-X.compute.internal Ready worker 58m v1.22.0-rc.0+a44d0f0 192.168.250.250 <none> Red Hat Enterprise Linux CoreOS 49.84.202110220538-0 (Ootpa) 4.18.0-305.19.1.el8_4.x86_64 cri-o://1.22.0-74.rhaos4.9.gitd745cab.el8 > Wed Nov 17 11:40:10 AM CET 2021 > prometheus-adapter-586b986d49-4csql 0/1 Pending 0 1s <none> <none> <none> <none> > prometheus-adapter-586b986d49-q48z7 0/1 ContainerCreating 0 1s <none> ip-10-0-247-191.ha-fooo-X.compute.internal <none> <none> > prometheus-adapter-7dcdbb4d87-4hlpp 1/1 Terminating 0 10m 10.130.2.6 ip-10-0-149-190.ha-fooo-X.compute.internal <none> <none> > prometheus-adapter-7dcdbb4d87-fb9lj 1/1 Running 0 10m 10.131.2.7 ip-10-0-200-94.ha-fooo-X.compute.internal <none> <none> > LAST SEEN TYPE REASON OBJECT MESSAGE > 1s Warning FailedScheduling pod/prometheus-adapter-586b986d49-4csql 0/9 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. > 1s Normal Scheduled pod/prometheus-adapter-586b986d49-q48z7 Successfully assigned openshift-monitoring/prometheus-adapter-586b986d49-q48z7 to ip-10-0-247-191.ha-fooo-X.compute.internal > 1s Normal AddedInterface pod/prometheus-adapter-586b986d49-q48z7 Add eth0 [10.128.4.8/23] from openshift-sdn > 0s Normal Pulling pod/prometheus-adapter-586b986d49-q48z7 Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0be6fb325dd16975b2ace01362ebcaf0cc97075f1641c377f4580dea72941d64" > 2s Normal SuccessfulCreate replicaset/prometheus-adapter-586b986d49 Created pod: prometheus-adapter-586b986d49-q48z7 > 2s Normal SuccessfulCreate replicaset/prometheus-adapter-586b986d49 Created pod: prometheus-adapter-586b986d49-4csql > 2s Normal Killing pod/prometheus-adapter-7dcdbb4d87-4hlpp Stopping container prometheus-adapter > 2s Normal SuccessfulDelete replicaset/prometheus-adapter-7dcdbb4d87 Deleted pod: prometheus-adapter-7dcdbb4d87-4hlpp > 2s Normal ScalingReplicaSet deployment/prometheus-adapter Scaled up replica set prometheus-adapter-586b986d49 to 1 > 2s Normal ScalingReplicaSet deployment/prometheus-adapter Scaled down replica set prometheus-adapter-7dcdbb4d87 to 1 > 2s Normal ScalingReplicaSet deployment/prometheus-adapter Scaled up replica set prometheus-adapter-586b986d49 to 2 Above we can see that we have "0/9 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate" reported. It's only transient and will reconcile quickly. But still, this something that can trigger unneeded activity/events or even alerts. It therefore recommend to set `maxSurge` to 0 while keeping `maxUnavailable` at 1 as this will prevent the transient event being reported as no additional capacity is provided during the rollout and therefore even environments with a small amount of capacity available should be OK and not see any problematic event. Version-Release number of selected component (if applicable): - OpenShift Container Platform 4.9.5 How reproducible: - Always Steps to Reproduce: 1. Setup 3 OpenShift Container Platform 4 - Infra Node(s) following https://docs.openshift.com/container-platform/4.9/machine_management/creating-infrastructure-machinesets.html 2. Assign Cluster Monitoring to these 3 OpenShift Container Platform 4 - Infra Node(s) following https://docs.openshift.com/container-platform/4.9/machine_management/creating-infrastructure-machinesets.html#infrastructure-moving-monitoring_creating-infrastructure-machinesets 3. Trigger a rollout of the `prometheus-adapater` deployment Actual results: "0/9 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate" is reported as event as one pod will be in pending state for a short period of time as no suitable OpenShift Container Platform 4 - Infra Node can be found due to the constrains put in place by the deployment (anti-affinity) Expected results: During a regular rollout of the `prometheus-adapater` deployment there should not be any FailedScheduling reported as the deployment should be put in place that works in all possible scenarios. Additional info:
Thanks for the detailed report! We followed the OCP guidelines about high-availability so we'll need to investigate further with the workload team whether we did something wrong or the guidelines should be amended. Decreasing the severity to low since the failed scheduling error is transient and resolves by itself. [1] https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability
Following up on our previous conversation. As the events are a minor inconvenience I would leave the maxSurge at default %25. It is true that 0 would lower the amount of events but would not prevent them entirely in case replicas == number of nodes. Also, I am planning to propose a feature that would support this scenario in upstream.
Closing this as deferred based on Filip's comment about the upstream proposal. From the monitoring perspective, we have followed the best practices as per https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability. The error is transient and the issue resolves.
(In reply to Filip Krepinsky from comment #3) > Also, I am planning to propose a feature that would support this scenario in > upstream. Do you have details where this effort is tracked so we can stay updated on the progress upstream.
There is no upstream tracking yet, I am planning to follow up on this in January
I have started an upstream thread about a feature that would support this: https://groups.google.com/g/kubernetes-sig-apps/c/BiFXV9FrAas
The issue was discussed upstream with a following outcome: feature tracking issue was created in order to obtain more feedback from the community before proceeding further: https://github.com/kubernetes/kubernetes/issues/107920 doc update to describe the current behaviour was created: https://github.com/kubernetes/website/pull/31603