Bug 2071139 - Ingress pods scheduled on the same node
Summary: Ingress pods scheduled on the same node
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.11.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Arvind iyengar
Depends On:
TreeView+ depends on / blocked
Reported: 2022-04-01 23:43 UTC by Miciah Dashiel Butler Masters
Modified: 2022-08-10 11:04 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: An issue with the scheduler can cause it to ignore pod anti-affinity rules when scheduling pods. Consequence: Two router pod replicas for the same generation of the same IngressController could be scheduled to the same node, increasing the risk of disruption to ingress during cluster upgrades or node outages. Fix: Logic was added to the ingress operator to evict misscheduled router pods. Result: Router pods are properly spread across multiple nodes to reduce disruption during upgrades and increase resilience to node outages.
Clone Of:
Last Closed: 2022-08-10 11:03:06 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 720 0 None Merged Bug 2071139: delete default ingress pod if it is scheduled where another router pod already is 2022-04-23 03:03:05 UTC
Github openshift cluster-ingress-operator pull 734 0 None Merged Bug 2071139: add pod eviction permission 2022-04-23 03:03:04 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:04:03 UTC

Description Miciah Dashiel Butler Masters 2022-04-01 23:43:03 UTC
This bug was initially created as a copy of Bug #2062459

I am copying this bug because David Eads has implemented a short-term workaround for it in the ingress operator.  This new bug tracks the short-term workaround while the old bug tracks the longer-term scheduler fix.  

Description of problem:

Two router pod replicas for the same generation of the same ingresscontroller can be scheduled to the same node despite pod anti-affinity rules that should prevent colocated pods.  

OpenShift release version:

At least 4.11 and 4.10.  

Cluster Platform:

The scheduling issue affects router pods for at least AWS, Azure, and GCP.  It most likely affects all cloud platforms.  I don't know whether the scheduling issue affects router pods for on-premise platforms, which use the "HostNetwork" endpoint publishing strategy by default.  

How reproducible:

See bug 2062459.

Steps to Reproduce (in detail):

The issue has been observed to be causing as significant number of CI job failures.  The cause appears to be a race condition.  As far as I know, we do not have a reliable reproducer.

Actual results:

Pods for the same generation of the same ingresscontroller are sometimes schedule to the same node; see bug 2062459.

Expected results:

These pods should always be spread across nodes.

Impact of the problem:

Failure to spread router pods out across nodes increases the impact of rolling updates of nodes or outages of individual nodes or availability zones.  

Additional info:

See bug 2062459 for example CI failures.  

The router pod anti-affinity rule is defined here: https://github.com/openshift/cluster-ingress-operator/blob/5040f65551851b3ee284f0803bfdd1c64631c4c6/pkg/operator/controller/ingress/deployment.go#L337-L357

This anti-affinity rule is only added when using the "LoadBalancerService" endpoint publishing strategy.  By default, cloud platforms (Alibaba, AWS, Azure, GCP, IBM Cloud, and Power VS) use "LoadBalancerService" while other platforms use "HostNetwork".  

With "HostNetwork", router pods use the host network, which prevents them from being colocated on the same node: every router pod requires ports 80, 443, and 1936, so when using the host network, the scheduler already prevents two router pods from being scheduled to the same node, even without the use of pod anti-affinity.  I am not aware of an issue with the scheduler as pertains to host port conflicts.

Comment 5 Arvind iyengar 2022-05-09 02:53:07 UTC
There are no more failures noted in the recent runs for the "sig-scheduling][Early] The HAProxy router pods should be scheduled on different nodes" test. Marking this as "verified":

Comment 8 errata-xmlrpc 2022-08-10 11:03:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.