1796147 – Installation fails with "failed to initialize the cluster: Cluster operator ingress is still updating"

Bug 1796147 - Installation fails with "failed to initialize the cluster: Cluster operator ingress is still updating"

Summary: Installation fails with "failed to initialize the cluster: Cluster operator i...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	unspecified
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-29 17:32 UTC by Petr Muller
Modified:	2020-05-04 11:27 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-04 11:27:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1421	0	None	closed	Bug 1796147: pkg/server: serve config only to master in bootstrap server	2021-01-28 07:28:47 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:27:45 UTC

Description Petr Muller 2020-01-29 17:32:29 UTC

Description of problem:

This seems to be very common installation flake today, showing up across PR and release jobs and cloud platforms (71 occurrences in last 24 hours):

https://search.svc.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+ingress+is+still+updating&maxAge=12h&context=2&type=all

level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default"
level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.\nMoving to release version \"4.4.0-0.ci-2020-01-29-015253\".\nMoving to ingress-controller image version \"registry.svc.ci.openshift.org/ocp/4.4-2020-01-29-015253@sha256:c5aa779b80bf6b7f9e98a4f85a3fec5a17543ce89376fc13e924deedcd7298cf\"."
level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available."
level=info msg="Cluster operator insights Disabled is False with : "
level=fatal msg="failed to initialize the cluster: Cluster operator ingress is still updating"

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.4/630
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.4/866
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.4/1150

(I'm not sure whether I filed this for the right component - I know ingress is networking related, but that's all)

Comment 1 W. Trevor King 2020-01-29 19:53:20 UTC

[1] has the expected three control-plane machines, but only one compute machine, so the router cannot be scheduled.  From [2]:

    readyReplicas: 1
    replicas: 2
    unavailableReplicas: 1

I'd expect [3] to show an unscheduled second Pod, but it only has the one ready Pod.  Two Provisioned (but not Running) Machines [4,5].  Not sure why those machines' kubelets could not join the cluster.

[1]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/cluster-scoped-resources/core/nodes/
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-ingress/apps/deployments.yaml
[3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-ingress/pods/
[4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-g9sm8.yaml
[5]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-i8jwk853-77109-rh9g6-worker-us-east-1c-f5nrr.yaml

Comment 2 W. Trevor King 2020-01-29 20:02:20 UTC

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-machine-api/pods/machine-api-controllers-674cfb8f66-kn577/machine-controller/machine-controller/logs/current.log | grep ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-g9sm8
2020-01-29T16:03:57.753337743Z I0129 16:03:57.753309       1 controller.go:161] Reconciling Machine "ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-g9sm8"
...
2020-01-29T16:04:02.788061916Z I0129 16:04:02.788015       1 actuator.go:202] ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-g9sm8: ProviderID set at machine spec: aws:///us-east-1b/i-078909486b3234e41
...
2020-01-29T16:04:02.804690428Z I0129 16:04:02.804654       1 actuator.go:651] ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-g9sm8: Instance state still pending, returning an error to requeue
...
2020-01-29T16:04:23.179577848Z I0129 16:04:23.179552       1 actuator.go:437] ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-g9sm8: found 1 running instances for machine
...
2020-01-29T16:04:23.200871769Z I0129 16:04:23.200831       1 controller.go:383] Machine "ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-g9sm8" going into phase "Provisioned"
...

Looks fine.  Will check MCO logs to see if we can confirm Ignition configs getting pulled or some such...

Comment 3 W. Trevor King 2020-01-29 20:15:01 UTC

Machine-config server shows only a single request coming in [1]:

  2020-01-29T16:04:36.979077584Z I0129 16:04:36.979026       1 api.go:97] Pool worker requested by 10.0.143.168:19227

but the IP for the active compute node is 10.0.129.217 [2].  Maybe we're just seeing the load balancer's IP?  If so, are there HTTP headers we should be looking at and logging to help traverse that kind of proxying?  I don't know whether our provider load-balancers support Forwarded and such [3].

[1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-machine-config-operator/pods/machine-config-server-h2st5/machine-config-server/machine-config-server/logs/current.log
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/15825/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2020-01-29-015253-sha256-dd03a9e0ac7b6e710037d8f0d0a5001b47c27bf859e49b0256e8ac54f5bd8198/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-i8jwk853-77109-rh9g6-worker-us-east-1b-r9p28.yaml
[3]: https://tools.ietf.org/html/rfc7239

Comment 4 Antonio Murdaca 2020-01-29 20:22:21 UTC

I think if mcs just shows 1 hit and we are missing 2 machines we may want mao to look at this as well

Comment 5 W. Trevor King 2020-01-29 20:40:11 UTC

https://github.com/openshift/release/pull/6906 <- debugging help (pulling AWS console logs from any instances referenced from Machines, in addition to any referenced by nodes, at least for jobs which use that template).

Comment 6 W. Trevor King 2020-01-29 21:00:01 UTC

Abhinav found a running compute machine which was showing:

  Jan 29 19:55:20 ip-10-0-130-205 hyperkube[2623]: E0129 19:55:20.869251    2623 certificate_manager.go:421] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Unauthorized

so the current theory is that the troubled machines are getting their Ignition config from the bootstrap machine-config server, but that by the time they attempt to create a CSR the kube-apiserver has rotated that bootstrap chain of trust out.

Comment 7 W. Trevor King 2020-01-29 21:06:38 UTC

Plan is to stop serving bootstrap Ignition configs to the compute machines, so they have to wait for the production MCS and production certs, which will continue to work after bootstrap trust is gone.

Comment 9 W. Trevor King 2020-01-29 21:34:27 UTC

The timing on this issue seems to be related to [1], although there's been a fair bit of churn in installer-space over the past week with the etcd-operator getting turned on (see [2,3]).  The underlying "we serve bootstrap creds to compute machines" is older, but the etcd-operator shuffling is turning up races in this space as it settles in.

[1]: https://github.com/openshift/installer/pull/3007
[2]: https://github.com/openshift/installer/pull/2730
[3]: https://github.com/openshift/cluster-etcd-operator/pull/53

Comment 10 Alex Crawford 2020-01-29 21:57:49 UTC

Bumping the priority. This is impacting CI.

Comment 12 Michael Nguyen 2020-02-27 19:04:43 UTC

Verified on 4.4.0-0.nightly-2020-02-27-070700.

On a BM install with PXE, started a compute node and verified that it is not served an ignition file.  Meanwhile, start the install of the master nodes / control plane.  Verified that the compute node only received an ignition config after the control plane was ready.

cisearch is also showing no occurrences of the error in the past 14 days.

Comment 14 errata-xmlrpc 2020-05-04 11:27:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.