Description of problem: This seems to be very common installation flake today, showing up across PR and release jobs and cloud platforms (71 occurrences in last 24 hours): https://search.svc.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+ingress+is+still+updating&maxAge=12h&context=2&type=all level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default" level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.\nMoving to release version \"4.4.0-0.ci-2020-01-29-015253\".\nMoving to ingress-controller image version \"registry.svc.ci.openshift.org/ocp/4.4-2020-01-29-015253@sha256:c5aa779b80bf6b7f9e98a4f85a3fec5a17543ce89376fc13e924deedcd7298cf\"." level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available." level=info msg="Cluster operator insights Disabled is False with : " level=fatal msg="failed to initialize the cluster: Cluster operator ingress is still updating" https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.4/630 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.4/866 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.4/1150 (I'm not sure whether I filed this for the right component - I know ingress is networking related, but that's all)
[1] has the expected three control-plane machines, but only one compute machine, so the router cannot be scheduled. From [2]: readyReplicas: 1 replicas: 2 unavailableReplicas: 1 I'd expect [3] to show an unscheduled second Pod, but it only has the one ready Pod. Two Provisioned (but not Running) Machines [4,5]. Not sure why those machines' kubelets could not join the cluster. [1]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/cluster-scoped-resources/core/nodes/ [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-ingress/apps/deployments.yaml [3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-ingress/pods/ [4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-g9sm8.yaml [5]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-i8jwk853-77109-rh9g6-worker-us-east-1c-f5nrr.yaml
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-machine-api/pods/machine-api-controllers-674cfb8f66-kn577/machine-controller/machine-controller/logs/current.log | grep ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-g9sm8 2020-01-29T16:03:57.753337743Z I0129 16:03:57.753309 1 controller.go:161] Reconciling Machine "ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-g9sm8" ... 2020-01-29T16:04:02.788061916Z I0129 16:04:02.788015 1 actuator.go:202] ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-g9sm8: ProviderID set at machine spec: aws:///us-east-1b/i-078909486b3234e41 ... 2020-01-29T16:04:02.804690428Z I0129 16:04:02.804654 1 actuator.go:651] ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-g9sm8: Instance state still pending, returning an error to requeue ... 2020-01-29T16:04:23.179577848Z I0129 16:04:23.179552 1 actuator.go:437] ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-g9sm8: found 1 running instances for machine ... 2020-01-29T16:04:23.200871769Z I0129 16:04:23.200831 1 controller.go:383] Machine "ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-g9sm8" going into phase "Provisioned" ... Looks fine. Will check MCO logs to see if we can confirm Ignition configs getting pulled or some such...
Machine-config server shows only a single request coming in [1]: 2020-01-29T16:04:36.979077584Z I0129 16:04:36.979026 1 api.go:97] Pool worker requested by 10.0.143.168:19227 but the IP for the active compute node is 10.0.129.217 [2]. Maybe we're just seeing the load balancer's IP? If so, are there HTTP headers we should be looking at and logging to help traverse that kind of proxying? I don't know whether our provider load-balancers support Forwarded and such [3]. [1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-machine-config-operator/pods/machine-config-server-h2st5/machine-config-server/machine-config-server/logs/current.log [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-r9p28.yaml [3]: https://tools.ietf.org/html/rfc7239
I think if mcs just shows 1 hit and we are missing 2 machines we may want mao to look at this as well
https://github.com/openshift/release/pull/6906 <- debugging help (pulling AWS console logs from any instances referenced from Machines, in addition to any referenced by nodes, at least for jobs which use that template).
Abhinav found a running compute machine which was showing: Jan 29 19:55:20 ip-10-0-130-205 hyperkube[2623]: E0129 19:55:20.869251 2623 certificate_manager.go:421] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Unauthorized so the current theory is that the troubled machines are getting their Ignition config from the bootstrap machine-config server, but that by the time they attempt to create a CSR the kube-apiserver has rotated that bootstrap chain of trust out.
Plan is to stop serving bootstrap Ignition configs to the compute machines, so they have to wait for the production MCS and production certs, which will continue to work after bootstrap trust is gone.
The timing on this issue seems to be related to [1], although there's been a fair bit of churn in installer-space over the past week with the etcd-operator getting turned on (see [2,3]). The underlying "we serve bootstrap creds to compute machines" is older, but the etcd-operator shuffling is turning up races in this space as it settles in. [1]: https://github.com/openshift/installer/pull/3007 [2]: https://github.com/openshift/installer/pull/2730 [3]: https://github.com/openshift/cluster-etcd-operator/pull/53
Bumping the priority. This is impacting CI.
Verified on 4.4.0-0.nightly-2020-02-27-070700. On a BM install with PXE, started a compute node and verified that it is not served an ignition file. Meanwhile, start the install of the master nodes / control plane. Verified that the compute node only received an ignition config after the control plane was ready. cisearch is also showing no occurrences of the error in the past 14 days.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581