Description of problem: UPI fails when installing to a cloud provider. Version-Release number of the following components: $ bin/openshift-install version bin/openshift-install unreleased-master-1780-gf96afb99f1ce4f8976ce62f7df44acb24d2062d6-dirty built from commit f96afb99f1ce4f8976ce62f7df44acb24d2062d6 release image registry.svc.ci.openshift.org/origin/release:4.2 How reproducible: Always Steps to Reproduce: 1. Follow UPI install docs (either repo or official). When creating the install-config, you must select a cloud provider. In my case, I select aws. 2. Watch the install timeout. 3. Check clusteroperator status. You will see auth and console operators fail. Actual results: $ oc get clusteroperator/authentication NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication Unknown Unknown True 61m Expected results: $ oc get clusteroperator/authentication NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.2.0-0.okd-2019-09-23-171417 True False False 54m Additional info: When a cluster is installed on a cloud platform, the ingress operator creates a service of type loadBalancer to front-end ingress controller pods. These pods can not run on master nodes, b/c Kubernetes excludes masters from target LB pools. [1] in the UPI docs causes masters to get provisioned with `master,worker` roles. When masters include the worker role, the scheduler will schedule ingress controller pods to masters, causing the ingress controller service to NOT load balance ingress traffic to the ingress controller pods. The authentication operator uses the oauth route as a health check as part of its bootstrap. The associated ingress has nowhere to forward the traffic because the ingress controller pods are running on master nodes, which are not included in the target LB pool, effectively making ingress useless. The console operator does not start properly b/c it depends on a functional auth-operator. [1] https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#empty-compute-pools
Do we have clarity on why this doesn't happen in CI? For example, [1] is a successful UPI run on 4.2. The control-plane nodes are all labeled with both master and worker roles [2,3,4], e.g.: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-aws-upi-4.2/183/artifacts/e2e-aws-upi/must-gather/cluster-scoped-resources/core/nodes/ip-10-0-87-9.ec2.internal.yaml | grep node-role node-role.kubernetes.io/master: "" node-role.kubernetes.io/worker: "" And the router pods were scheduled on ...[5,6], e.g.: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-aws-upi-4.2/183/artifacts/e2e-aws-upi/must-gather/namespaces/openshift-ingress/pods/router-default-6ff887557-2xzw6/router-default-6ff887557-2xzw6.yaml | grep nodeName nodeName: ip-10-0-70-36.ec2.internal Are we just getting lucky? I'd have expected them to get scheduled on the control-plane machines with control-plane machines coming up faster than compute machines. Apparently by the time we try to schedule the router all the nodes are up: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-aws-upi-4.2/183/artifacts/e2e-aws-upi/must-gather/namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-ip-10-0-48-105.ec2.internal/scheduler/scheduler/logs/current.log | grep router | head -n1 2019-09-24T15:59:26.6240642Z I0924 15:59:26.623771 1 scheduler.go:572] pod e2e-test-router-scoped-kskr7/endpoint-1 is bound successfully on node ip-10-0-51-107.ec2.internal, 6 nodes evaluated, 6 nodes were found feasible and we are just getting lucky. Anyhow, dropping the 'replicas: 0' change or [7] for load-balancer-backed router platforms makes sense to me until we can address bug 1671136 / [8]. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-aws-upi-4.2/183 [2]: https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-aws-upi-4.2/183/artifacts/e2e-aws-upi/must-gather/cluster-scoped-resources/core/nodes/ip-10-0-87-9.ec2.internal.yaml [3]: https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-aws-upi-4.2/183/artifacts/e2e-aws-upi/must-gather/cluster-scoped-resources/core/nodes/ip-10-0-78-41.ec2.internal.yaml [4]: https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-aws-upi-4.2/183/artifacts/e2e-aws-upi/must-gather/cluster-scoped-resources/core/nodes/ip-10-0-48-105.ec2.internal.yaml [5]: https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-aws-upi-4.2/183/artifacts/e2e-aws-upi/must-gather/namespaces/openshift-ingress/pods/router-default-6ff887557-2xzw6/router-default-6ff887557-2xzw6.yaml [6]: https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-aws-upi-4.2/183/artifacts/e2e-aws-upi/must-gather/namespaces/openshift-ingress/pods/router-default-6ff887557-v5ss4/router-default-6ff887557-v5ss4.yaml [7]: https://github.com/openshift/installer/pull/2004 [8]: https://github.com/kubernetes/kubernetes/issues/65618
This is going to be fixed in 4.3. We do not support ingress controllers on masters at this time because of a limitation in Kubernetes that we have provisionally fixed in 1.16 and once we have tested may consider back porting until 4.2. Until then service load balancers do not work against master nodes.
Related bug 1744370 about ingress placing the router pods.
*** Bug 1753761 has been marked as a duplicate of this bug. ***
also background discussion in bug 1671136.
The 2402 alternative was rejected, the 2440 approach landed and bug 1738456 is tracking exposing that in the official docs. We don't need to independently verify the installer-side docs, so closing as a dup of the docs bug. *** This bug has been marked as a duplicate of bug 1738456 ***