Description of problem: Both release-*-e2e-vspere-*4.6 CI jobs are consistently failing installation with the following: ~~~~~ level=error msg="Cluster operator authentication Degraded is True with ConfigObservation_Error::IngressStateEndpoints_MissingSubsets::RouterCerts_NoRouterCertSecret: RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nConfigObservationDegraded: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server" level=info msg="Cluster operator authentication Progressing is Unknown with NoData: " level=info msg="Cluster operator authentication Available is Unknown with NoData: " level=error msg="Cluster operator kube-apiserver Degraded is True with NodeInstaller_InstallerPodFailed::StaticPods_Error: NodeInstallerDegraded: 1 nodes are failing on revision 2:\nNodeInstallerDegraded: static pod of revision 2 has been installed, but is not ready while new revision 3 is pending; 1 nodes are failing on revision 3:\nNodeInstallerDegraded: static pod of revision 3 has been installed, but is not ready while new revision 4 is pending\nStaticPodsDegraded: pod/kube-apiserver-control-plane-2 container \"kube-apiserver\" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-control-plane-2_openshift-kube-apiserver(40cb13413734fb03c21ce18ba60c9735)\nStaticPodsDegraded: pod/kube-apiserver-control-plane-2 container \"kube-apiserver\" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-control-plane-2_openshift-kube-apiserver(40cb13413734fb03c21ce18ba60c9735)\nStaticPodsDegraded: pod/kube-apiserver-control-plane-2 container \"kube-apiserver-check-endpoints\" is not ready: unknown reason\nStaticPodsDegraded: pod/kube-apiserver-control-plane-1 container \"kube-apiserver-check-endpoints\" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-control-plane-1_openshift-kube-apiserver(b034e27783ace7180a13393497ce7435)\nStaticPodsDegraded: pod/kube-apiserver-control-plane-1 container \"kube-apiserver-check-endpoints\" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-control-plane-1_openshift-kube-apiserver(b034e27783ace7180a13393497ce7435)\nStaticPodsDegraded: pod/kube-apiserver-control-plane-0 container \"kube-apiserver-check-endpoints\" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-control-plane-0_openshift-kube-apiserver(c96a5a43a055653004ab702f0ebbae7f)\nStaticPodsDegraded: pod/kube-apiserver-control-plane-0 container \"kube-apiserver-check-endpoints\" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-control-plane-0_openshift-kube-apiserver(c96a5a43a055653004ab702f0ebbae7f)" level=info msg="Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 4" level=info msg="Cluster operator kube-apiserver Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 4" level=error msg="Cluster operator kube-controller-manager Degraded is True with NodeInstaller_InstallerPodFailed: NodeInstallerDegraded: 1 nodes are failing on revision 2:\nNodeInstallerDegraded: static pod of revision 2 has been installed, but is not ready while new revision 3 is pending; 1 nodes are failing on revision 3:\nNodeInstallerDegraded: " level=info msg="Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 4" level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available" level=info msg="Cluster operator machine-config Progressing is True with : Working towards 4.6.0-0.nightly-2020-07-18-035613" level=error msg="Cluster operator machine-config Degraded is True with RenderConfigFailed: Unable to apply 4.6.0-0.nightly-2020-07-18-035613: openshift-config-managed/kube-cloud-config configmap is required on platform VSphere but not found: configmap \"kube-cloud-config\" not found" ~~~~~ Recent release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.6 failures: - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.6/1284851991570288640#1:build-log.txt%3A368 - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.6/1284489511203508224#1:build-log.txt%3A347 - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.6/1284126896174403584#1:build-log.txt%3A380 - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.6/1283764458539192320#1:build-log.txt%3A377 - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.6/1283401794281541632#1:build-log.txt%3A365 Recent release-openshift-ocp-installer-e2e-vsphere-upi-4.6 failures: - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284851991540928512#1:build-log.txt%3A359 - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284489411332935680#1:build-log.txt%3A352 - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284126896153432064#1:build-log.txt%3A456 - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1283764458514026496#1:build-log.txt%3A374 - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1283764458514026496#1:build-log.txt%3A374 Version-Release number of selected component (if applicable): 4.6 How reproducible: https://sippy-bparees.svc.ci.openshift.org/?release=4.6 now reports vsphere jobs as 100% failing, and all jobs that even get to this point (10 out of last 14 jobs) fail in this issue. Additional info: - Here are possible duplicates: - https://bugzilla.redhat.com/show_bug.cgi?id=1856425 was filed on azure (not vsphere) and closed as a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1856316 which is supposed to be fixed. - CLOSED WONTFIX https://bugzilla.redhat.com/show_bug.cgi?id=1811221 for vsphere in 4.4
Our team doesn't have the capability to be primary support for a platform that doesn't gate control plane components. Please reassign to a vsphere SME and have them work with us to resolve.
picking one failure https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284851991540928512/artifacts/e2e-vsphere-upi/configmaps.json > https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284851991540928512/artifacts/e2e-vsphere-upi/configmaps.json ``` { "apiVersion": "v1", "data": { "cloud.conf": "[Global]\nsecret-name = \"vsphere-creds\"\nsecret-namespace = \"kube-system\"\ninsecure-flag = \"1\"\n\n[Workspace]\nserver = \"vcsa-ci.vmware.devcluster.openshift.com\"\ndatacenter = \"dc1\"\ndefault-datastore = \"vsanDatastore\"\nfolder = \"/dc1/vm/ci-op-hksvhgv6-8c73e\"\n\n[VirtualCenter \"vcsa-ci.vmware.devcluster.openshift.com\"]\ndatacenters = \"dc1\"\n" }, "kind": "ConfigMap", "metadata": { "creationTimestamp": "2020-07-19T14:14:31Z", "managedFields": [ { "apiVersion": "v1", "fieldsType": "FieldsV1", "fieldsV1": { "f:data": { ".": {}, "f:cloud.conf": {} } }, "manager": "cluster-config-operator", "operation": "Update", "time": "2020-07-19T14:14:31Z" } ], "name": "kube-cloud-config", "namespace": "openshift-config-managed", "resourceVersion": "3490", "selfLink": "/api/v1/namespaces/openshift-config-managed/configmaps/kube-cloud-config", "uid": "001d9c4f-d2cf-4ef4-a22e-8d430a79b9b2" } }, ``` The `kube-cloud-config` is available and created by kube_cloud_config controller in cluster-config-operator, so it looks like a machine config operator problem wrt accessing the correct object. If you look at the logs for MCO > https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284851991540928512/artifacts/e2e-vsphere-upi/pods/openshift-machine-config-operator_machine-config-operator-64cfc94945-hl5c7_machine-config-operator_previous.log ``` I0719 15:00:17.939227 1 operator.go:270] Starting MachineConfigOperator E0719 15:00:17.961963 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 223 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1814240, 0x2a34840) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82 panic(0x1814240, 0x2a34840) /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/panic.go:969 +0x166 github.com/openshift/machine-config-operator/pkg/operator.getIgnitionHost(0xc000976188, 0xc000997890, 0x1a41f59, 0x18, 0xc000307758) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:330 +0x314 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncRenderConfig(0xc000206d80, 0x0, 0xc038917aaf, 0x29a2ec27ede) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:295 +0x10c7 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncAll(0xc000206d80, 0xc0003dfca8, 0x6, 0x6, 0xc00009c180, 0x6) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:69 +0x177 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).sync(0xc000206d80, 0xc00098eb70, 0x30, 0x0, 0x0) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:362 +0x40a github.com/openshift/machine-config-operator/pkg/operator.(*Operator).processNextWorkItem(0xc000206d80, 0x203000) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:318 +0xd2 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).worker(0xc000206d80) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:307 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000bdc440) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000bdc440, 0x1cbf1a0, 0xc0007f1f20, 0xc0007da701, 0xc0000b63c0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000bdc440, 0x3b9aca00, 0x0, 0x1b2ca01, 0xc0000b63c0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xe2 k8s.io/apimachinery/pkg/util/wait.Until(0xc000bdc440, 0x3b9aca00, 0xc0000b63c0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by github.com/openshift/machine-config-operator/pkg/operator.(*Operator).Run /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:276 +0x3dc panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x168ae94] goroutine 223 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105 panic(0x1814240, 0x2a34840) /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/panic.go:969 +0x166 github.com/openshift/machine-config-operator/pkg/operator.getIgnitionHost(0xc000976188, 0xc000997890, 0x1a41f59, 0x18, 0xc000307758) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:330 +0x314 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncRenderConfig(0xc000206d80, 0x0, 0xc038917aaf, 0x29a2ec27ede) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:295 +0x10c7 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncAll(0xc000206d80, 0xc0003dfca8, 0x6, 0x6, 0xc00009c180, 0x6) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:69 +0x177 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).sync(0xc000206d80, 0xc00098eb70, 0x30, 0x0, 0x0) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:362 +0x40a github.com/openshift/machine-config-operator/pkg/operator.(*Operator).processNextWorkItem(0xc000206d80, 0x203000) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:318 +0xd2 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).worker(0xc000206d80) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:307 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000bdc440) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000bdc440, 0x1cbf1a0, 0xc0007f1f20, 0xc0007da701, 0xc0000b63c0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000bdc440, 0x3b9aca00, 0x0, 0x1b2ca01, 0xc0000b63c0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xe2 k8s.io/apimachinery/pkg/util/wait.Until(0xc000bdc440, 0x3b9aca00, 0xc0000b63c0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by github.com/openshift/machine-config-operator/pkg/operator.(*Operator).Run /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:276 +0x3dc ``` you see a panic and then the next pod > https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284851991540928512/artifacts/e2e-vsphere-upi/pods/openshift-machine-config-operator_machine-config-operator-64cfc94945-hl5c7_machine-config-operator.log ``` I0719 15:05:29.828172 1 start.go:46] Version: 4.6.0-0.nightly-2020-07-19-093912 (Raw: v4.6.0-202007181013.p0-dirty, Hash: fe81baa202b6644dc5521bf229372a282088a445) I0719 15:05:29.830060 1 leaderelection.go:242] attempting to acquire leader lease openshift-machine-config-operator/machine-config... ``` Is stuck waiting for lease to start ..
I don't think the MCO is the source of this error. The main MCO pod shouldn't be doing anything other than waiting for workers. In other jobs both the current and the previous MCO panics: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1284126896153432064/artifacts/e2e-vsphere-upi/pods/openshift-machine-config-operator_machine-config-operator-788f669cd8-fdhdx_machine-config-operator.log So unless we're doing something wrong here https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/operator.go#L276 I think something is the source of error. For many of these jobs I can see: authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, machine-config, monitoring Being degraded operators. Maybe another component would have more insight? Not sure where to move this though.
*** Bug 1860783 has been marked as a duplicate of this bug. ***
This more recent run from 2020-07-28 ( https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1288115144924073984) which has the fix is still encountering the same error: error: some steps failed: * could not run steps: step e2e-vsphere-upi failed: template pod "e2e-vsphere-upi" failed: the pod ci-op-3npxvqqf/e2e-vsphere-upi failed after 51m11s (failed containers: setup): ContainerFailed one or more containers exited Container setup exited with code 1, reason Error --- ed is False with AsExpected: " level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available" level=info msg="Cluster operator machine-config Progressing is True with : Working towards 4.6.0-0.nightly-2020-07-25-091217" level=error msg="Cluster operator machine-config Degraded is True with RenderConfigFailed: Unable to apply 4.6.0-0.nightly-2020-07-25-091217: openshift-config-managed/kube-cloud-config configmap is required on platform VSphere but not found: configmap \"kube-cloud-config\" not found" level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.6.0-0.nightly-2020-07-25-091217" level=info msg="Cluster operator monitoring Available is False with : " level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
Issue still exist on upi on vsphere with https proxy, nightly build is 4.6.0-0.nightly-2020-08-04-210224 ~~~~~~~~~~~~ level=error msg="Cluster operator authentication Degraded is True with APIServerDeployment_UnavailablePod::IngressStateEndpoints_MissingSubsets::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError::RouterCerts_NoRouterCertSecret: OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready\nOAuthServiceCheckEndpointAccessibleControllerDegraded: Get \"https://172.30.126.218:443/healthz\": dial tcp 172.30.126.218:443: connect: connection timed out\nRouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret \"v4-0-config-system-router-certs\" not found\nIngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server\nAPIServerDeploymentDegraded: 3 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver" level=info msg="Cluster operator authentication Available is False with APIServerDeployment_NoPod: APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node." level=error msg="Cluster operator kube-apiserver Degraded is True with NodeInstaller_InstallerPodFailed::StaticPods_Error: NodeInstallerDegraded: 1 nodes are failing on revision 2:\nNodeInstallerDegraded: static pod of revision 2 has been installed, but is not ready while new revision 3 is pending\nStaticPodsDegraded: pods \"kube-apiserver-control-plane-0\" not found" level=info msg="Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 3" level=error msg="Cluster operator kube-controller-manager Degraded is True with NodeInstaller_InstallerPodFailed::StaticPods_Error: StaticPodsDegraded: pod/kube-controller-manager-control-plane-0 container \"cluster-policy-controller\" is not ready: CrashLoopBackOff: back-off 2m40s restarting failed container=cluster-policy-controller pod=kube-controller-manager-control-plane-0_openshift-kube-controller-manager(f8ae47a0c6efdb3244d1bc2e3725812e)\nStaticPodsDegraded: pod/kube-controller-manager-control-plane-0 container \"cluster-policy-controller\" is waiting: CrashLoopBackOff: back-off 2m40s restarting failed container=cluster-policy-controller pod=kube-controller-manager-control-plane-0_openshift-kube-controller-manager(f8ae47a0c6efdb3244d1bc2e3725812e)\nStaticPodsDegraded: pod/kube-controller-manager-control-plane-0 container \"kube-controller-manager\" is not ready: unknown reason\nNodeInstallerDegraded: 1 nodes are failing on revision 5:\nNodeInstallerDegraded: ; 1 nodes are failing on revision 6:\nNodeInstallerDegraded: static pod of revision 6 has been installed, but is not ready while new revision 7 is pending" level=info msg="Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 7" level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available" level=error msg="Cluster operator network Degraded is True with RolloutHung: DaemonSet \"openshift-sdn/sdn-metrics\" rollout is not making progress - last change 2020-08-05T01:27:36Z" level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-sdn/sdn-metrics\" is not yet scheduled on any nodes" level=info msg="Cluster operator network Available is False with Startup: The network is starting up" level=info msg="Cluster operator openshift-apiserver Available is False with APIServices_PreconditionNotReady: APIServicesAvailable: PreconditionNotReady" level=info msg="Cluster operator openshift-controller-manager Progressing is True with : " level=info msg="Cluster operator openshift-controller-manager Available is False with : " level=info msg="Cluster operator operator-lifecycle-manager-packageserver Available is False with : " level=info msg="Cluster operator operator-lifecycle-manager-packageserver Progressing is True with : Working toward 0.16.0" ~~~~~~~~~~~~~~~
Looking through the CI run logs, no longer seeing panic in MCO, but this is still failing across several operators, moving to auth since they are failing.. For ex: evel=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthRouteCheckEndpointAccessibleController_SyncError::OAuthServerRoute_InvalidCanonicalHost::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError::OAuthVersionDeployment_GetFailed::Route_InvalidCanonicalHost: OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready\nOAuthServiceCheckEndpointAccessibleControllerDegraded: Get \"https://172.30.147.209:443/healthz\": dial tcp 172.30.147.209:443: connect: connection timed out\nIngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server\nOAuthRouteCheckEndpointAccessibleControllerDegraded: route status does not have host address\nOAuthVersionDeploymentDegraded: Unable to get OAuth server deployment: deployment.apps \"oauth-openshift\" not found\nRouteDegraded: Route is not available at canonical host oauth-openshift.apps.ci-op-2fnvilpp-8c73e.origin-ci-int-aws.dev.rhcloud.com: route status ingress is empty\nOAuthServerRouteDegraded: Route is not available at canonical host oauth-openshift.apps.ci-op-2fnvilpp-8c73e.origin-ci-int-aws.dev.rhcloud.com: route status ingress is empty" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring, network"
According to the artifacts for a recent upi ci job [1]: - The auth operator is degraded because its route has not been processed by ingress - The ingress operator is unable to fully deploy because there no worker nodes to schedule to Assigning to the installer team to figure out why the worker nodes are not being created properly as part of the install. 1: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.6/1292825401403379712
Possible direction for debugging why the compute nodes never joined would be gathering their console logs in the old-style template. Example AWS implementation: [1]. [1]: https://github.com/openshift/release/blob/bde0ab9f545d5197338a248ac24c36a991af48fb/ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e.yaml#L2381-L2394
This might be due to using an oc v4.5 binary on a 4.6 cluster, causing certificate approvals to fail. The prow job linked in Comment 11 is missing a few lines in the output, and we are seeing this elsewhere. In the last successful run on this job we saw the following lines: certificatesigningrequest.certificates.k8s.io/csr-6ptpf approved certificatesigningrequest.certificates.k8s.io/csr-fbtgm approved certificatesigningrequest.certificates.k8s.io/csr-kl86x approved certificatesigningrequest.certificates.k8s.io/csr-kzc7v approved certificatesigningrequest.certificates.k8s.io/csr-hfmjt approved certificatesigningrequest.certificates.k8s.io/csr-psn6g approved
After using latest oc version on 4.6, issue is not reproduced any more when installing ocp4.6 on upi on vsphere.
Tried to perform a fresh install on profile ipi-on-vsphere/versioned-installer-6_7-http_proxy-vsphere_slave-ci using payload 4.6.0-0.nightly-2020-08-11-040013 but hit the issue below. level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthRouteCheckEndpointAccessibleController_SyncError::OAuthServerRoute_InvalidCanonicalHost::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError::OAuthVersionDeployment_GetFailed::Route_InvalidCanonicalHost: OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready\nOAuthServiceCheckEndpointAccessibleControllerDegraded: Get \"https://172.30.62.70:443/healthz\": dial tcp 172.30.62.70:443: connect: connection timed out\nIngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server\nOAuthRouteCheckEndpointAccessibleControllerDegraded: route status does not have host address\nRouteDegraded: Route is not available at canonical host oauth-openshift.apps.knarra-08debug1.qe.devcluster.openshift.com: route status ingress is empty\nOAuthVersionDeploymentDegraded: Unable to get OAuth server deployment: deployment.apps \"oauth-openshift\" not found\nOAuthServerRouteDegraded: Route is not available at canonical host oauth-openshift.apps.knarra-08debug1.qe.devcluster.openshift.com: route status ingress is empty" level=info msg="Cluster operator console Progressing is True with DefaultRouteSync_FailedAdmitDefaultRoute::OAuthClientSync_FailedHost: DefaultRouteSyncProgressing: route \"console\" is not available at canonical host []\nOAuthClientSyncProgressing: route \"console\" is not available at canonical host []" level=info msg="Cluster operator console Available is Unknown with NoData: " level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available." level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available." level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default" level=info msg="Cluster operator insights Disabled is False with AsExpected: " level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available" level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack." level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available" level=info msg="Cluster operator monitoring Available is False with : " level=error msg="Cluster operator network Degraded is True with RolloutHung: DaemonSet \"openshift-sdn/sdn-metrics\" rollout is not making progress - last change 2020-08-11T16:29:30Z" level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-sdn/sdn-metrics\" is not yet scheduled on any nodes" level=info msg="Cluster operator network Available is False with Startup: The network is starting up" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring, network" on further checking Jinyun Ma said csr is in pending state and For IPI, this should be automatically performed through installer which has not happened. We have found two similar installs hitting the same issue. Below link contains the must-gather logs: =========================================== http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1859230/
The CSRs need to be approved by cluster-machine-approver seeing the logs in http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1859230/must-gather.local.3578202702302146318/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8a84ccbdd140bb7151a774df04b6ba8a310ab4bbf025405a9af18b5e63847912/namespaces/openshift-cluster-machine-approver/pods/machine-approver-5f6457695b-q8t77/machine-approver-controller/machine-approver-controller/logs/current.log There are bunch of these ``` 2020-08-12T06:09:40.916502422Z I0812 06:09:40.914467 1 main.go:147] CSR csr-bn8j8 added 2020-08-12T06:09:40.945201911Z I0812 06:09:40.945119 1 main.go:182] CSR csr-bn8j8 not authorized: failed to find machine for node knarra-08debug1-8dlwg-worker-gp8wk ``` When i can see the machine object http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1859230/must-gather.local.3578202702302146318/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8a84ccbdd140bb7151a774df04b6ba8a310ab4bbf025405a9af18b5e63847912/namespaces/openshift-machine-api/machine.openshift.io/machines/ http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1859230/must-gather.local.3578202702302146318/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8a84ccbdd140bb7151a774df04b6ba8a310ab4bbf025405a9af18b5e63847912/namespaces/openshift-machine-api/machine.openshift.io/machines/knarra-08debug1-8dlwg-worker-gp8wk.yaml So there is a problem which is causing the approver to reject the CSR. So looking at the controller for machine-api > http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1859230/must-gather.local.3578202702302146318/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8a84ccbdd140bb7151a774df04b6ba8a310ab4bbf025405a9af18b5e63847912/namespaces/openshift-machine-api/pods/machine-api-controllers-7f468664b5-zhx4b/machine-controller/machine-controller/logs/current.log ``` 2020-08-11T16:42:40.557574474Z I0811 16:42:40.557518 1 machine_scope.go:102] knarra-08debug1-8dlwg-worker-gp8wk: patching machine 2020-08-11T16:42:40.583703973Z E0811 16:42:40.583648 1 machine_scope.go:114] Failed to patch machine "knarra-08debug1-8dlwg-worker-gp8wk": admission webhook "validation.machine.machine.openshift.io" denied the request: providerSpec.diskGiB: Invalid value: 50: diskGiB is below minimum value (120) 2020-08-11T16:42:40.583777731Z E0811 16:42:40.583750 1 controller.go:279] knarra-08debug1-8dlwg-worker-gp8wk: error updating machine: admission webhook "validation.machine.machine.openshift.io" denied the request: providerSpec.diskGiB: Invalid value: 50: diskGiB is below minimum value (120) ``` I think this causing the controller to fail the updating the status of vsphere machines which means the approver doesn't have information to make the decision to approve the CSR. Now since this bug mostly has vSphere UPI failures, I would recommend you create another bug for this IPI failure and assign that to cloud team.
So the bug https://bugzilla.redhat.com/show_bug.cgi?id=1868448 is the main reason why IPI workers are not joining and the UPI workers CSRs have been fixed already https://github.com/openshift/release/pull/10950. So closing this bug. Please free to create a new bug when a different reason comes up.