Bug 1744370

Summary: Ingress erroneously reports available on cloud platforms when all nodes are masters
Product: OpenShift Container Platform Reporter: Dan Mace <dmace>
Component: NetworkingAssignee: Andrew McDermott <amcdermo>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED WONTFIX Docs Contact:
Severity: medium    
Priority: low CC: amcdermo, aos-bugs, ccoleman, wking
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-24 20:17:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Mace 2019-08-22 01:13:49 UTC
Description of problem:

On a cloud platform like AWS and given node topology like:

NAME                                         STATUS   ROLES           AGE   VERSION
ip-10-0-139-134.us-east-2.compute.internal   Ready    master,worker   21h   v1.14.0+17b784327
ip-10-0-157-164.us-east-2.compute.internal   Ready    master,worker   21h   v1.14.0+17b784327
ip-10-0-167-50.us-east-2.compute.internal    Ready    master,worker   21h   v1.14.0+17b784327

The default ingress controller will be scheduled to masters even though the master nodes will never be registered as ELB instances. This means ingress is effectively broken, but the ingress operator reports available and not-degraded.

The fact that ingress controllers are scheduled to masters at all when using LoadBalancer publishing strategy seems like a bug, as Kubernetes won't support the cloud balancer wiring. If the ingress controllers couldn't schedule on masters, the ingress operator would correctly report unavailable.

Version-Release number of selected component (if applicable):

4.2.0-0.nightly-2019-08-10-002649

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Hongan Li 2019-08-22 09:57:00 UTC
Since the problem always affects some special installations and misleads to other components (e.g. auth and console), so I expect we can fix it in 4.2.

Comment 3 Dan Mace 2019-08-22 14:00:08 UTC
Ingress is not supported for the topology under test in 4.2, so I do not agree with a decision to block the release.

I do agree we should add more signals to inform the user that the cluster is in a degraded state, and stepping back, docs should have set the users' expectations in the first place.

Comment 4 Dan Mace 2019-09-24 11:38:19 UTC
One option would be to change a node placement of the default ingresscontroller to explicitly exclude masters when using the LoadBalancer publishing strategy. That would codify the upstream rules.

Comment 5 Clayton Coleman 2019-09-24 20:02:55 UTC
Note that in 4.3 we will allow this, so don't do anything that complicates 4.3

Comment 6 Dan Mace 2019-09-24 20:17:09 UTC
https://github.com/kubernetes/kubernetes/pull/80238 and https://github.com/openshift/cluster-kube-apiserver-operator/pull/572 will actually finally enable this use case, so looking forward the current won't actually be a bug.