The 4.7 periodic test e2e-metal-ipi-ovn-dualstack has failed the last two times, the latest is here: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi-ovn-dualstack/1339248412003930112 No workers are created: Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods) Here are logged errors: level=debug msg=Still waiting for the cluster to initialize: Cluster operator network is reporting a failure: Deployment "openshift-network-diagnostics/network-check-source" rollout is not making progress - last change 2020-12-16T17:29:22Z level=error msg=Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthRouteCheckEndpointAccessibleController_SyncError::OAuthServerDeployment_DeploymentAvailableReplicasCheckFailed::OAuthServerRoute_InvalidCanonicalHost::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError::OAuthVersionDeployment_GetFailed::Route_InvalidCanonicalHost::WellKnownReadyController_SyncError: OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready level=error msg=OAuthServiceCheckEndpointAccessibleControllerDegraded: Get "https://172.30.42.244:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) level=error msg=IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server level=error msg=RouteDegraded: no ingress for host oauth-openshift.apps.ostest.test.metalkube.org in route oauth-openshift in namespace openshift-authentication level=error msg=OAuthRouteCheckEndpointAccessibleControllerDegraded: route status does not have host address level=error msg=OAuthVersionDeploymentDegraded: Unable to get OAuth server deployment: deployment.apps "oauth-openshift" not found level=error msg=WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this) level=error msg=OAuthServerDeploymentDegraded: deployments.apps "oauth-openshift" not found level=error msg=OAuthServerRouteDegraded: no ingress for host oauth-openshift.apps.ostest.test.metalkube.org in route oauth-openshift in namespace openshift-authentication level=info msg=Cluster operator authentication Available is False with OAuthServiceCheckEndpointAccessibleController_EndpointUnavailable::OAuthServiceEndpointsCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionDeployment_MissingDeployment::ReadyIngressNodes_NoReadyIngressNodes::WellKnown_NotReady: OAuthServiceEndpointsCheckEndpointAccessibleControllerAvailable: Failed to get oauth-openshift enpoints level=info msg=ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods). level=info msg=OAuthServiceCheckEndpointAccessibleControllerAvailable: Get "https://172.30.42.244:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) level=info msg=WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this) level=info msg=Cluster operator baremetal Disabled is False with : level=info msg=Cluster operator console Progressing is True with DefaultRouteSync_FailedAdmitDefaultRoute::OAuthClientSync_FailedHost: DefaultRouteSyncProgressing: route "console" is not available at canonical host [] level=info msg=OAuthClientSyncProgressing: route "console" is not available at canonical host [] level=info msg=Cluster operator console Available is Unknown with NoData: level=info msg=Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available. level=info msg=Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available. level=error msg=Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-559ff48fd-xrf4q" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Pod "router-default-559ff48fd-9wz2m" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1) level=info msg=Cluster operator insights Disabled is True with Disabled: Health reporting is disabled level=info msg=Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available level=info msg=Cluster operator monitoring Available is False with : level=info msg=Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. level=error msg=Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available level=error msg=Cluster operator network Degraded is True with RolloutHung: Deployment "openshift-network-diagnostics/network-check-source" rollout is not making progress - last change 2020-12-16T17:29:22Z level=info msg=Cluster operator network ManagementStateDegraded is False with : level=info msg=Cluster operator network Progressing is True with Deploying: Deployment "openshift-network-diagnostics/network-check-source" is not available (awaiting 1 nodes) level=info msg=Cluster operator network Available is False with Startup: The network is starting up level=error msg=Cluster initialization failed because one or more operators are not functioning properly. level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below, level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation level=fatal msg=failed to initialize the cluster: Cluster operator network is reporting a failure: Deployment "openshift-network-diagnostics/network-check-source" rollout is not making progress - last change 2020-12-16T17:29:22Z
From the job history at https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi-ovn-dualstack, this failure in intermittent over the last few days. It occurred since 12/11 (https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi-ovn-dualstack/1337544938430140416) at least and has been intermixed with successful runs. There is currently no must-gather included in the results, Stephen has just added a patch to reenable it - https://github.com/openshift-metal3/dev-scripts/pull/1173.
The latest dualstack test just passed, so this is definitely intermittent.
It looks like the same root cause as before. Kubelet is getting FQDN, but machine doesn't have it so the CSR isn't signed. ./namespaces/openshift-cluster-machine-approver/pods/machine-approver-7dfc9559f7-k869q/machine-approver-controller/machine-approver-controller/logs/current.log:2020-12-17T02:04:30.330897225Z I1217 02:04:30.330758 1 main.go:147] CSR csr-wvvjr added ./namespaces/openshift-cluster-machine-approver/pods/machine-approver-7dfc9559f7-k869q/machine-approver-controller/machine-approver-controller/logs/current.log:2020-12-17T02:04:30.345437663Z I1217 02:04:30.345387 1 main.go:182] CSR csr-wvvjr not authorized: failed to find machine for node worker-0.ostest.test.metalkube.org must-gather is here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-metal3_dev-scripts/1173/pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi-ovn-dualstack/1339362989836341248/artifacts/e2e-metal-ipi-ovn-dualstack/baremetalds-devscripts-gather/ @derekh, any chance you have time to look?
Looks like we have a race condition, our script sets the hostname but eventually if things take long enough NetworkManager sets it back I've extracted the relevant logs, here NetworkManager sets the hostname back to worker-0 before the details are posted to ironic Dec 16 20:08:41 worker-0.ostest.test.metalkube.org NetworkManager[458]: <info> [1608167321.5491] policy: set-hostname: current hostname was changed outside NetworkManager: 'worker-0.ostest.test.metalkube.org' Dec 16 20:08:41 worker-0.ostest.test.metalkube.org NetworkManager[458]: <info> [1608167321.5492] policy: set-hostname: set hostname to 'worker-0' (from DHCPv4) ... Dec 16 20:09:53 worker-0 ironic-python-agent[832]: 2020-12-16 20:09:53.711 832 INFO ironic_python_agent.inspector [-] posting collected data to http://[fd00:1101::3]:5050/v1/continue I suspect this is probably more likely to be hit now that inspector is running all of its collectors and taking longer since we fixed the naming https://github.com/openshift/ironic-image/pull/131 I'll look into a better option to set the hostname,
I ran 3 dual stack deployments with 4.7 nightly images - all passed (for instance : registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-17-211555) I also ran 2 dual stack 4.7 stable deployments - both failed (quay.io/openshift-release-dev/ocp-release:4.7.0-fc.3-x86_64, quay.io/openshift-release-dev/ocp-release:4.7.0-fc.2-x86_64) worker nodes did not deployed: (From 4.7.0-fc.3 image): [kni@provisionhost-0-0 ~]$ oc get node NAME STATUS ROLES AGE VERSION master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 34m v1.20.0+394a5a3 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 33m v1.20.0+394a5a3 master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 32m v1.20.0+394a5a3 and multiple errors occured: Message: Multiple errors are preventing progress: * Could not update prometheusrule "openshift-cloud-credential-operator/cloud-credential-operator-alerts" (559 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-cluster-machine-approver/machineapprover-rules" (578 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-cluster-node-tuning-operator/node-tuning-operator" (339 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-cluster-samples-operator/samples-operator-alerts" (360 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-dns-operator/dns" (603 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-image-registry/image-registry-operator-alerts" (288 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-ingress-operator/ingress-operator" (610 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-kube-apiserver-operator/kube-apiserver-operator" (614 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (626 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-kube-scheduler-operator/kube-scheduler-operator" (630 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-machine-api/cluster-autoscaler-operator-rules" (251 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-machine-api/machine-api-operator-prometheus-rules" (638 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-machine-config-operator/machine-config-daemon" (640 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-operator-lifecycle-manager/olm-alert-rules" (645 of 662): the server is reporting an internal error .openshift_install.log attached to the bug (from 4.7.0-fc.2 image).
Created attachment 1748659 [details] openshihft_install.log
(In reply to Shelly Miron from comment #8) > I also ran 2 dual stack 4.7 stable deployments - both failed > (quay.io/openshift-release-dev/ocp-release:4.7.0-fc.3-x86_64, > quay.io/openshift-release-dev/ocp-release:4.7.0-fc.2-x86_64) > > worker nodes did not deployed: > > > (From 4.7.0-fc.3 image): > > > [kni@provisionhost-0-0 ~]$ oc get node > NAME STATUS ROLES AGE > VERSION > master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 34m > v1.20.0+394a5a3 > master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 33m > v1.20.0+394a5a3 > master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 32m > v1.20.0+394a5a3 It doesn't look like this is the same problem, do you have a mustgather from one of the failed runs?
After several attempts to reproduce the bug, it seems like the problem was fixed - for quay images: quay.io/openshift-release-dev/ocp-release:4.7.0-fc.3-x86_64, quay.io/openshift-release-dev/ocp-release:4.7.0-fc.2-x86_64, dual-stack deployed successfully. verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633