This bug was initially created as a copy of Bug #2062459
I am copying this bug because David Eads has implemented a short-term workaround for it in the ingress operator. This new bug tracks the short-term workaround while the old bug tracks the longer-term scheduler fix.
Description of problem:
Two router pod replicas for the same generation of the same ingresscontroller can be scheduled to the same node despite pod anti-affinity rules that should prevent colocated pods.
OpenShift release version:
At least 4.11 and 4.10.
The scheduling issue affects router pods for at least AWS, Azure, and GCP. It most likely affects all cloud platforms. I don't know whether the scheduling issue affects router pods for on-premise platforms, which use the "HostNetwork" endpoint publishing strategy by default.
See bug 2062459.
Steps to Reproduce (in detail):
The issue has been observed to be causing as significant number of CI job failures. The cause appears to be a race condition. As far as I know, we do not have a reliable reproducer.
Pods for the same generation of the same ingresscontroller are sometimes schedule to the same node; see bug 2062459.
These pods should always be spread across nodes.
Impact of the problem:
Failure to spread router pods out across nodes increases the impact of rolling updates of nodes or outages of individual nodes or availability zones.
See bug 2062459 for example CI failures.
The router pod anti-affinity rule is defined here: https://github.com/openshift/cluster-ingress-operator/blob/5040f65551851b3ee284f0803bfdd1c64631c4c6/pkg/operator/controller/ingress/deployment.go#L337-L357
This anti-affinity rule is only added when using the "LoadBalancerService" endpoint publishing strategy. By default, cloud platforms (Alibaba, AWS, Azure, GCP, IBM Cloud, and Power VS) use "LoadBalancerService" while other platforms use "HostNetwork".
With "HostNetwork", router pods use the host network, which prevents them from being colocated on the same node: every router pod requires ports 80, 443, and 1936, so when using the host network, the scheduler already prevents two router pods from being scheduled to the same node, even without the use of pod anti-affinity. I am not aware of an issue with the scheduler as pertains to host port conflicts.
There are no more failures noted in the recent runs for the "sig-scheduling][Early] The HAProxy router pods should be scheduled on different nodes" test. Marking this as "verified":
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.