1859230 – 4.6 vsphere CI: Worker nodes are not being created

Bug 1859230 - 4.6 vsphere CI: Worker nodes are not being created

Summary: 4.6 vsphere CI: Worker nodes are not being created

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	aos-install
QA Contact:	jima
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1860783 (view as bug list)
Depends On:
Blocks:	1865743 1868448
TreeView+	depends on / blocked

Reported:	2020-07-21 13:54 UTC by Petr Muller
Modified:	2020-08-21 16:29 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1868448 (view as bug list)
Environment:
Last Closed:	2020-08-21 16:29:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1945	0	None	closed	Bug 1859230: operator/sync: update getIgnitionHost to account for nil PlatformStatus	2021-02-17 04:05:03 UTC

Internal Links: 2031049

Description Petr Muller 2020-07-21 13:54:47 UTC

Description of problem:

Both release-*-e2e-vspere-*4.6 CI jobs are consistently failing installation with the following:

~~~~~
level=error msg="Cluster operator authentication Degraded is True with ConfigObservation_Error::IngressStateEndpoints_MissingSubsets::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nConfigObservationDegraded: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server"
level=info msg="Cluster operator authentication Progressing is Unknown with NoData: "
level=info msg="Cluster operator authentication Available is Unknown with NoData: "
level=error msg="Cluster operator kube-apiserver Degraded is True with NodeInstaller_InstallerPodFailed::StaticPods_Error: NodeInstallerDegraded: 1 nodes are failing on revision 2:\nNodeInstallerDegraded: static pod of revision 2 has been installed, but is not ready while new revision 3 is pending; 1 nodes are failing on revision 3:\nNodeInstallerDegraded: static pod of revision 3 has been installed, but is not ready while new revision 4 is pending\nStaticPodsDegraded: pod/kube-apiserver-control-plane-2 container \"kube-apiserver\" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-control-plane-2_openshift-kube-apiserver(40cb13413734fb03c21ce18ba60c9735)\nStaticPodsDegraded: pod/kube-apiserver-control-plane-2 container \"kube-apiserver\" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-control-plane-2_openshift-kube-apiserver(40cb13413734fb03c21ce18ba60c9735)\nStaticPodsDegraded: pod/kube-apiserver-control-plane-2 container \"kube-apiserver-check-endpoints\" is not ready: unknown reason\nStaticPodsDegraded: pod/kube-apiserver-control-plane-1 container \"kube-apiserver-check-endpoints\" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-control-plane-1_openshift-kube-apiserver(b034e27783ace7180a13393497ce7435)\nStaticPodsDegraded: pod/kube-apiserver-control-plane-1 container \"kube-apiserver-check-endpoints\" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-control-plane-1_openshift-kube-apiserver(b034e27783ace7180a13393497ce7435)\nStaticPodsDegraded: pod/kube-apiserver-control-plane-0 container \"kube-apiserver-check-endpoints\" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-control-plane-0_openshift-kube-apiserver(c96a5a43a055653004ab702f0ebbae7f)\nStaticPodsDegraded: pod/kube-apiserver-control-plane-0 container \"kube-apiserver-check-endpoints\" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-control-plane-0_openshift-kube-apiserver(c96a5a43a055653004ab702f0ebbae7f)"
level=info msg="Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 4"
level=info msg="Cluster operator kube-apiserver Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 4"
level=error msg="Cluster operator kube-controller-manager Degraded is True with NodeInstaller_InstallerPodFailed: NodeInstallerDegraded: 1 nodes are failing on revision 2:\nNodeInstallerDegraded: static pod of revision 2 has been installed, but is not ready while new revision 3 is pending; 1 nodes are failing on revision 3:\nNodeInstallerDegraded: "
level=info msg="Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 4"
level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available"
level=info msg="Cluster operator machine-config Progressing is True with : Working towards 4.6.0-0.nightly-2020-07-18-035613"
level=error msg="Cluster operator machine-config Degraded is True with RenderConfigFailed: Unable to apply 4.6.0-0.nightly-2020-07-18-035613: openshift-config-managed/kube-cloud-config configmap is required on platform VSphere but not found: configmap \"kube-cloud-config\" not found"
~~~~~

Recent release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.6 failures:

- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.6/1284851991570288640#1:build-log.txt%3A368
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.6/1284489511203508224#1:build-log.txt%3A347
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.6/1284126896174403584#1:build-log.txt%3A380
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.6/1283764458539192320#1:build-log.txt%3A377
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.6/1283401794281541632#1:build-log.txt%3A365

Recent release-openshift-ocp-installer-e2e-vsphere-upi-4.6 failures:
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284851991540928512#1:build-log.txt%3A359
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284489411332935680#1:build-log.txt%3A352
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284126896153432064#1:build-log.txt%3A456
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1283764458514026496#1:build-log.txt%3A374
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1283764458514026496#1:build-log.txt%3A374


Version-Release number of selected component (if applicable):
4.6

How reproducible:
https://sippy-bparees.svc.ci.openshift.org/?release=4.6 now reports vsphere jobs as 100% failing, and all jobs that even get to this point (10 out of last 14 jobs) fail in this issue.


Additional info:
- Here are possible duplicates:
- https://bugzilla.redhat.com/show_bug.cgi?id=1856425 was filed on azure (not vsphere) and closed as a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1856316 which is supposed to be fixed.
- CLOSED WONTFIX https://bugzilla.redhat.com/show_bug.cgi?id=1811221 for vsphere in 4.4

Comment 1 Maru Newby 2020-07-21 15:17:09 UTC

Our team doesn't have the capability to be primary support for a platform that doesn't gate control plane components. Please reassign to a vsphere SME and have them work with us to resolve.

Comment 2 Abhinav Dahiya 2020-07-21 16:47:19 UTC

picking one failure https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284851991540928512/artifacts/e2e-vsphere-upi/configmaps.json

> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284851991540928512/artifacts/e2e-vsphere-upi/configmaps.json
```
        {
            "apiVersion": "v1",
            "data": {
                "cloud.conf": "[Global]\nsecret-name = \"vsphere-creds\"\nsecret-namespace = \"kube-system\"\ninsecure-flag = \"1\"\n\n[Workspace]\nserver = \"vcsa-ci.vmware.devcluster.openshift.com\"\ndatacenter = \"dc1\"\ndefault-datastore = \"vsanDatastore\"\nfolder = \"/dc1/vm/ci-op-hksvhgv6-8c73e\"\n\n[VirtualCenter \"vcsa-ci.vmware.devcluster.openshift.com\"]\ndatacenters = \"dc1\"\n"
            },
            "kind": "ConfigMap",
            "metadata": {
                "creationTimestamp": "2020-07-19T14:14:31Z",
                "managedFields": [
                    {
                        "apiVersion": "v1",
                        "fieldsType": "FieldsV1",
                        "fieldsV1": {
                            "f:data": {
                                ".": {},
                                "f:cloud.conf": {}
                            }
                        },
                        "manager": "cluster-config-operator",
                        "operation": "Update",
                        "time": "2020-07-19T14:14:31Z"
                    }
                ],
                "name": "kube-cloud-config",
                "namespace": "openshift-config-managed",
                "resourceVersion": "3490",
                "selfLink": "/api/v1/namespaces/openshift-config-managed/configmaps/kube-cloud-config",
                "uid": "001d9c4f-d2cf-4ef4-a22e-8d430a79b9b2"
            }
        },
```


The `kube-cloud-config` is available and created by kube_cloud_config controller in cluster-config-operator, so it looks like a machine config operator problem wrt accessing the correct object.


If you look at the logs for MCO
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284851991540928512/artifacts/e2e-vsphere-upi/pods/openshift-machine-config-operator_machine-config-operator-64cfc94945-hl5c7_machine-config-operator_previous.log
```
I0719 15:00:17.939227       1 operator.go:270] Starting MachineConfigOperator
E0719 15:00:17.961963       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 223 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1814240, 0x2a34840)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x1814240, 0x2a34840)
	/opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/panic.go:969 +0x166
github.com/openshift/machine-config-operator/pkg/operator.getIgnitionHost(0xc000976188, 0xc000997890, 0x1a41f59, 0x18, 0xc000307758)
	/go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:330 +0x314
github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncRenderConfig(0xc000206d80, 0x0, 0xc038917aaf, 0x29a2ec27ede)
	/go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:295 +0x10c7
github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncAll(0xc000206d80, 0xc0003dfca8, 0x6, 0x6, 0xc00009c180, 0x6)
	/go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:69 +0x177
github.com/openshift/machine-config-operator/pkg/operator.(*Operator).sync(0xc000206d80, 0xc00098eb70, 0x30, 0x0, 0x0)
	/go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:362 +0x40a
github.com/openshift/machine-config-operator/pkg/operator.(*Operator).processNextWorkItem(0xc000206d80, 0x203000)
	/go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:318 +0xd2
github.com/openshift/machine-config-operator/pkg/operator.(*Operator).worker(0xc000206d80)
	/go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:307 +0x2b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000bdc440)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000bdc440, 0x1cbf1a0, 0xc0007f1f20, 0xc0007da701, 0xc0000b63c0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000bdc440, 0x3b9aca00, 0x0, 0x1b2ca01, 0xc0000b63c0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xe2
k8s.io/apimachinery/pkg/util/wait.Until(0xc000bdc440, 0x3b9aca00, 0xc0000b63c0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by github.com/openshift/machine-config-operator/pkg/operator.(*Operator).Run
	/go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:276 +0x3dc
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x168ae94]

goroutine 223 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105
panic(0x1814240, 0x2a34840)
	/opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/panic.go:969 +0x166
github.com/openshift/machine-config-operator/pkg/operator.getIgnitionHost(0xc000976188, 0xc000997890, 0x1a41f59, 0x18, 0xc000307758)
	/go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:330 +0x314
github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncRenderConfig(0xc000206d80, 0x0, 0xc038917aaf, 0x29a2ec27ede)
	/go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:295 +0x10c7
github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncAll(0xc000206d80, 0xc0003dfca8, 0x6, 0x6, 0xc00009c180, 0x6)
	/go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:69 +0x177
github.com/openshift/machine-config-operator/pkg/operator.(*Operator).sync(0xc000206d80, 0xc00098eb70, 0x30, 0x0, 0x0)
	/go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:362 +0x40a
github.com/openshift/machine-config-operator/pkg/operator.(*Operator).processNextWorkItem(0xc000206d80, 0x203000)
	/go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:318 +0xd2
github.com/openshift/machine-config-operator/pkg/operator.(*Operator).worker(0xc000206d80)
	/go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:307 +0x2b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000bdc440)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000bdc440, 0x1cbf1a0, 0xc0007f1f20, 0xc0007da701, 0xc0000b63c0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000bdc440, 0x3b9aca00, 0x0, 0x1b2ca01, 0xc0000b63c0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xe2
k8s.io/apimachinery/pkg/util/wait.Until(0xc000bdc440, 0x3b9aca00, 0xc0000b63c0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by github.com/openshift/machine-config-operator/pkg/operator.(*Operator).Run
	/go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:276 +0x3dc
```
you see a panic

and then the next pod

> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284851991540928512/artifacts/e2e-vsphere-upi/pods/openshift-machine-config-operator_machine-config-operator-64cfc94945-hl5c7_machine-config-operator.log
```
I0719 15:05:29.828172       1 start.go:46] Version: 4.6.0-0.nightly-2020-07-19-093912 (Raw: v4.6.0-202007181013.p0-dirty, Hash: fe81baa202b6644dc5521bf229372a282088a445)
I0719 15:05:29.830060       1 leaderelection.go:242] attempting to acquire leader lease  openshift-machine-config-operator/machine-config...

```

Is stuck waiting for lease to start ..

Comment 3 Yu Qi Zhang 2020-07-23 16:12:03 UTC

I don't think the MCO is the source of this error. The main MCO pod shouldn't be doing anything other than waiting for workers. In other jobs both the current and the previous MCO panics: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284126896153432064/artifacts/e2e-vsphere-upi/pods/openshift-machine-config-operator_machine-config-operator-788f669cd8-fdhdx_machine-config-operator.log

So unless we're doing something wrong here https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/operator.go#L276

I think something is the source of error. For many of these jobs I can see:

authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, machine-config, monitoring

Being degraded operators. Maybe another component would have more insight? Not sure where to move this though.

Comment 7 Antonio Murdaca 2020-07-27 08:26:24 UTC

*** Bug 1860783 has been marked as a duplicate of this bug. ***

Comment 8 Michael Nguyen 2020-07-29 14:16:28 UTC

This more recent run from 2020-07-28 (
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1288115144924073984) which has the fix is still encountering the same error:

error: some steps failed:
  * could not run steps: step e2e-vsphere-upi failed: template pod "e2e-vsphere-upi" failed: the pod ci-op-3npxvqqf/e2e-vsphere-upi failed after 51m11s (failed containers: setup): ContainerFailed one or more containers exited
Container setup exited with code 1, reason Error
---
ed is False with AsExpected: "
level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available"
level=info msg="Cluster operator machine-config Progressing is True with : Working towards 4.6.0-0.nightly-2020-07-25-091217"
level=error msg="Cluster operator machine-config Degraded is True with RenderConfigFailed: Unable to apply 4.6.0-0.nightly-2020-07-25-091217: openshift-config-managed/kube-cloud-config configmap is required on platform VSphere but not found: configmap \"kube-cloud-config\" not found"
level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.6.0-0.nightly-2020-07-25-091217"
level=info msg="Cluster operator monitoring Available is False with : "
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."

Comment 9 jima 2020-08-05 09:00:55 UTC

Issue still exist on upi on vsphere with https proxy, nightly build is 4.6.0-0.nightly-2020-08-04-210224
~~~~~~~~~~~~
level=error msg="Cluster operator authentication Degraded is True with APIServerDeployment_UnavailablePod::IngressStateEndpoints_MissingSubsets::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError::RouterCerts_NoRouterCertSecret: OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready\nOAuthServiceCheckEndpointAccessibleControllerDegraded: Get \"https://172.30.126.218:443/healthz\": dial tcp 172.30.126.218:443: connect: connection timed out\nRouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server\nAPIServerDeploymentDegraded: 3 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver"
level=info msg="Cluster operator authentication Available is False with APIServerDeployment_NoPod: APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node."
level=error msg="Cluster operator kube-apiserver Degraded is True with NodeInstaller_InstallerPodFailed::StaticPods_Error: NodeInstallerDegraded: 1 nodes are failing on revision 2:\nNodeInstallerDegraded: static pod of revision 2 has been installed, but is not ready while new revision 3 is pending\nStaticPodsDegraded: pods \"kube-apiserver-control-plane-0\" not found"
level=info msg="Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 3"
level=error msg="Cluster operator kube-controller-manager Degraded is True with NodeInstaller_InstallerPodFailed::StaticPods_Error: StaticPodsDegraded: pod/kube-controller-manager-control-plane-0 container \"cluster-policy-controller\" is not ready: CrashLoopBackOff: back-off 2m40s restarting failed container=cluster-policy-controller pod=kube-controller-manager-control-plane-0_openshift-kube-controller-manager(f8ae47a0c6efdb3244d1bc2e3725812e)\nStaticPodsDegraded: pod/kube-controller-manager-control-plane-0 container \"cluster-policy-controller\" is waiting: CrashLoopBackOff: back-off 2m40s restarting failed container=cluster-policy-controller pod=kube-controller-manager-control-plane-0_openshift-kube-controller-manager(f8ae47a0c6efdb3244d1bc2e3725812e)\nStaticPodsDegraded: pod/kube-controller-manager-control-plane-0 container \"kube-controller-manager\" is not ready: unknown reason\nNodeInstallerDegraded: 1 nodes are failing on revision 5:\nNodeInstallerDegraded: ; 1 nodes are failing on revision 6:\nNodeInstallerDegraded: static pod of revision 6 has been installed, but is not ready while new revision 7 is pending"
level=info msg="Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 7"
level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available"
level=error msg="Cluster operator network Degraded is True with RolloutHung: DaemonSet \"openshift-sdn/sdn-metrics\" rollout is not making progress - last change 2020-08-05T01:27:36Z"
level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-sdn/sdn-metrics\" is not yet scheduled on any nodes"
level=info msg="Cluster operator network Available is False with Startup: The network is starting up"
level=info msg="Cluster operator openshift-apiserver Available is False with APIServices_PreconditionNotReady: APIServicesAvailable: PreconditionNotReady"
level=info msg="Cluster operator openshift-controller-manager Progressing is True with : "
level=info msg="Cluster operator openshift-controller-manager Available is False with : "
level=info msg="Cluster operator operator-lifecycle-manager-packageserver Available is False with : "
level=info msg="Cluster operator operator-lifecycle-manager-packageserver Progressing is True with : Working toward 0.16.0"
~~~~~~~~~~~~~~~

Comment 10 Kirsten Garrison 2020-08-10 17:51:15 UTC

Looking through the CI run logs, no longer seeing panic in MCO, but this is still failing across several operators, moving to auth since they are failing.. For ex:

evel=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthRouteCheckEndpointAccessibleController_SyncError::OAuthServerRoute_InvalidCanonicalHost::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError::OAuthVersionDeployment_GetFailed::Route_InvalidCanonicalHost: OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready\nOAuthServiceCheckEndpointAccessibleControllerDegraded: Get \"https://172.30.147.209:443/healthz\": dial tcp 172.30.147.209:443: connect: connection timed out\nIngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server\nOAuthRouteCheckEndpointAccessibleControllerDegraded: route status does not have host address\nOAuthVersionDeploymentDegraded: Unable to get OAuth server deployment: deployment.apps \"oauth-openshift\" not found\nRouteDegraded: Route is not available at canonical host oauth-openshift.apps.ci-op-2fnvilpp-8c73e.origin-ci-int-aws.dev.rhcloud.com: route status ingress is empty\nOAuthServerRouteDegraded: Route is not available at canonical host oauth-openshift.apps.ci-op-2fnvilpp-8c73e.origin-ci-int-aws.dev.rhcloud.com: route status ingress is empty"

level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring, network"

Comment 11 Maru Newby 2020-08-10 20:37:45 UTC

According to the artifacts for a recent upi ci job [1]:

- The auth operator is degraded because its route has not been processed by ingress
- The ingress operator is unable to fully deploy because there no worker nodes to schedule to

Assigning to the installer team to figure out why the worker nodes are not being created properly as part of the install.

1: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1292825401403379712

Comment 12 W. Trevor King 2020-08-10 20:47:44 UTC

Possible direction for debugging why the compute nodes never joined would be gathering their console logs in the old-style template.  Example AWS implementation: [1].

[1]: https://github.com/openshift/release/blob/bde0ab9f545d5197338a248ac24c36a991af48fb/ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e.yaml#L2381-L2394

Comment 13 Jeremiah Stuever 2020-08-10 21:28:53 UTC

This might be due to using an oc v4.5 binary on a 4.6 cluster, causing certificate approvals to fail. The prow job linked in Comment 11 is missing a few lines in the output, and we are seeing this elsewhere. In the last successful run on this job we saw the following lines:

certificatesigningrequest.certificates.k8s.io/csr-6ptpf approved
certificatesigningrequest.certificates.k8s.io/csr-fbtgm approved
certificatesigningrequest.certificates.k8s.io/csr-kl86x approved
certificatesigningrequest.certificates.k8s.io/csr-kzc7v approved
certificatesigningrequest.certificates.k8s.io/csr-hfmjt approved
certificatesigningrequest.certificates.k8s.io/csr-psn6g approved

Comment 15 jima 2020-08-11 04:36:00 UTC

After using latest oc version on 4.6, issue is not reproduced any more when installing ocp4.6 on upi on vsphere.

Comment 16 RamaKasturi 2020-08-12 07:13:25 UTC

Tried to perform a fresh install on profile ipi-on-vsphere/versioned-installer-6_7-http_proxy-vsphere_slave-ci using payload 4.6.0-0.nightly-2020-08-11-040013 but hit the issue below.

level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthRouteCheckEndpointAccessibleController_SyncError::OAuthServerRoute_InvalidCanonicalHost::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError::OAuthVersionDeployment_GetFailed::Route_InvalidCanonicalHost: OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready\nOAuthServiceCheckEndpointAccessibleControllerDegraded: Get \"https://172.30.62.70:443/healthz\": dial tcp 172.30.62.70:443: connect: connection timed out\nIngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server\nOAuthRouteCheckEndpointAccessibleControllerDegraded: route status does not have host address\nRouteDegraded: Route is not available at canonical host oauth-openshift.apps.knarra-08debug1.qe.devcluster.openshift.com: route status ingress is empty\nOAuthVersionDeploymentDegraded: Unable to get OAuth server deployment: deployment.apps \"oauth-openshift\" not found\nOAuthServerRouteDegraded: Route is not available at canonical host oauth-openshift.apps.knarra-08debug1.qe.devcluster.openshift.com: route status ingress is empty"
level=info msg="Cluster operator console Progressing is True with DefaultRouteSync_FailedAdmitDefaultRoute::OAuthClientSync_FailedHost: DefaultRouteSyncProgressing: route \"console\" is not available at canonical host []\nOAuthClientSyncProgressing: route \"console\" is not available at canonical host []"
level=info msg="Cluster operator console Available is Unknown with NoData: "
level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available."
level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available."
level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default"
level=info msg="Cluster operator insights Disabled is False with AsExpected: "
level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available"
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available"
level=info msg="Cluster operator monitoring Available is False with : "
level=error msg="Cluster operator network Degraded is True with RolloutHung: DaemonSet \"openshift-sdn/sdn-metrics\" rollout is not making progress - last change 2020-08-11T16:29:30Z"
level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-sdn/sdn-metrics\" is not yet scheduled on any nodes"
level=info msg="Cluster operator network Available is False with Startup: The network is starting up"
level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring, network"

on further checking Jinyun Ma said csr is in pending state and For IPI, this should be automatically performed through installer which has not happened. We have found two similar installs hitting the same issue.

Below link contains the must-gather logs:
===========================================
 http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1859230/

Comment 17 Abhinav Dahiya 2020-08-12 17:57:55 UTC

The CSRs need to be approved by cluster-machine-approver

seeing the logs in http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1859230/must-gather.local.3578202702302146318/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8a84ccbdd140bb7151a774df04b6ba8a310ab4bbf025405a9af18b5e63847912/namespaces/openshift-cluster-machine-approver/pods/machine-approver-5f6457695b-q8t77/machine-approver-controller/machine-approver-controller/logs/current.log
There are bunch of these
```
2020-08-12T06:09:40.916502422Z I0812 06:09:40.914467       1 main.go:147] CSR csr-bn8j8 added
2020-08-12T06:09:40.945201911Z I0812 06:09:40.945119       1 main.go:182] CSR csr-bn8j8 not authorized: failed to find machine for node knarra-08debug1-8dlwg-worker-gp8wk
```

When i can see the machine object http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1859230/must-gather.local.3578202702302146318/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8a84ccbdd140bb7151a774df04b6ba8a310ab4bbf025405a9af18b5e63847912/namespaces/openshift-machine-api/machine.openshift.io/machines/

http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1859230/must-gather.local.3578202702302146318/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8a84ccbdd140bb7151a774df04b6ba8a310ab4bbf025405a9af18b5e63847912/namespaces/openshift-machine-api/machine.openshift.io/machines/knarra-08debug1-8dlwg-worker-gp8wk.yaml

So there is a problem which is causing the approver to reject the CSR.

So looking at the controller for machine-api

> http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1859230/must-gather.local.3578202702302146318/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8a84ccbdd140bb7151a774df04b6ba8a310ab4bbf025405a9af18b5e63847912/namespaces/openshift-machine-api/pods/machine-api-controllers-7f468664b5-zhx4b/machine-controller/machine-controller/logs/current.log
```
2020-08-11T16:42:40.557574474Z I0811 16:42:40.557518       1 machine_scope.go:102] knarra-08debug1-8dlwg-worker-gp8wk: patching machine
2020-08-11T16:42:40.583703973Z E0811 16:42:40.583648       1 machine_scope.go:114] Failed to patch machine "knarra-08debug1-8dlwg-worker-gp8wk": admission webhook "validation.machine.machine.openshift.io" denied the request: providerSpec.diskGiB: Invalid value: 50: diskGiB is below minimum value (120)
2020-08-11T16:42:40.583777731Z E0811 16:42:40.583750       1 controller.go:279] knarra-08debug1-8dlwg-worker-gp8wk: error updating machine: admission webhook "validation.machine.machine.openshift.io" denied the request: providerSpec.diskGiB: Invalid value: 50: diskGiB is below minimum value (120)
```

I think this causing the controller to fail the updating the status of vsphere machines which means the approver doesn't have information to make the decision to approve the CSR.


Now since this bug mostly has vSphere UPI failures, I would recommend you create another bug for this IPI failure and assign that to cloud team.

Comment 18 Abhinav Dahiya 2020-08-21 16:29:47 UTC

So the bug https://bugzilla.redhat.com/show_bug.cgi?id=1868448 is the main reason why IPI workers are not joining and the UPI workers CSRs have been fixed already https://github.com/openshift/release/pull/10950. So closing this bug.

Please free to create a new bug when a different reason comes up.

Note You need to log in before you can comment on or make changes to this bug.