Bug 1908493

Summary:

4.7-e2e-metal-ipi-ovn-dualstack intermittent test failures, worker hostname is overwritten by NM

Product:

OpenShift Container Platform

Reporter:

Bob Fournier <bfournie>

Component:

Installer

Assignee:

Derek Higgins <derekh>

Installer sub component:

OpenShift on Bare Metal IPI

QA Contact:

Shelly Miron <smiron>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

high

CC:

derekh, kholtz, kquinn, rbartal, smiron, stbenjam

Version:

4.7

Keywords:

Triaged, UpcomingSprint

Target Milestone:

---

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Previously when using dual stack deployments, worker node hostnames sometimes didn’t match the name inspected before deployment. This caused nodes to need manual approval. This has been fixed.

Story Points:

---

Clone Of:

Clones:

1955114 (view as bug list)

Environment:

Last Closed:

2021-02-24 15:45:31 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1955114

Attachments:

Description	Flags
openshihft_install.log	none

Description Bob Fournier 2020-12-16 20:56:05 UTC

The 4.7 periodic test e2e-metal-ipi-ovn-dualstack has failed the last two times, the latest is here:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi-ovn-dualstack/1339248412003930112

No workers are created:
Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods)

Here are logged errors:
level=debug msg=Still waiting for the cluster to initialize: Cluster operator network is reporting a failure: Deployment "openshift-network-diagnostics/network-check-source" rollout is not making progress - last change 2020-12-16T17:29:22Z
level=error msg=Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthRouteCheckEndpointAccessibleController_SyncError::OAuthServerDeployment_DeploymentAvailableReplicasCheckFailed::OAuthServerRoute_InvalidCanonicalHost::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError::OAuthVersionDeployment_GetFailed::Route_InvalidCanonicalHost::WellKnownReadyController_SyncError: OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready
level=error msg=OAuthServiceCheckEndpointAccessibleControllerDegraded: Get "https://172.30.42.244:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
level=error msg=IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
level=error msg=RouteDegraded: no ingress for host oauth-openshift.apps.ostest.test.metalkube.org in route oauth-openshift in namespace openshift-authentication
level=error msg=OAuthRouteCheckEndpointAccessibleControllerDegraded: route status does not have host address
level=error msg=OAuthVersionDeploymentDegraded: Unable to get OAuth server deployment: deployment.apps "oauth-openshift" not found
level=error msg=WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
level=error msg=OAuthServerDeploymentDegraded: deployments.apps "oauth-openshift" not found
level=error msg=OAuthServerRouteDegraded: no ingress for host oauth-openshift.apps.ostest.test.metalkube.org in route oauth-openshift in namespace openshift-authentication
level=info msg=Cluster operator authentication Available is False with OAuthServiceCheckEndpointAccessibleController_EndpointUnavailable::OAuthServiceEndpointsCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionDeployment_MissingDeployment::ReadyIngressNodes_NoReadyIngressNodes::WellKnown_NotReady: OAuthServiceEndpointsCheckEndpointAccessibleControllerAvailable: Failed to get oauth-openshift enpoints
level=info msg=ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
level=info msg=OAuthServiceCheckEndpointAccessibleControllerAvailable: Get "https://172.30.42.244:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
level=info msg=WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
level=info msg=Cluster operator baremetal Disabled is False with : 
level=info msg=Cluster operator console Progressing is True with DefaultRouteSync_FailedAdmitDefaultRoute::OAuthClientSync_FailedHost: DefaultRouteSyncProgressing: route "console" is not available at canonical host []
level=info msg=OAuthClientSyncProgressing: route "console" is not available at canonical host []
level=info msg=Cluster operator console Available is Unknown with NoData: 
level=info msg=Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available.
level=info msg=Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.
level=error msg=Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-559ff48fd-xrf4q" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Pod "router-default-559ff48fd-9wz2m" cannot be scheduled: 0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1)
level=info msg=Cluster operator insights Disabled is True with Disabled: Health reporting is disabled
level=info msg=Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available
level=info msg=Cluster operator monitoring Available is False with : 
level=info msg=Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
level=error msg=Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available
level=error msg=Cluster operator network Degraded is True with RolloutHung: Deployment "openshift-network-diagnostics/network-check-source" rollout is not making progress - last change 2020-12-16T17:29:22Z
level=info msg=Cluster operator network ManagementStateDegraded is False with : 
level=info msg=Cluster operator network Progressing is True with Deploying: Deployment "openshift-network-diagnostics/network-check-source" is not available (awaiting 1 nodes)
level=info msg=Cluster operator network Available is False with Startup: The network is starting up
level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=fatal msg=failed to initialize the cluster: Cluster operator network is reporting a failure: Deployment "openshift-network-diagnostics/network-check-source" rollout is not making progress - last change 2020-12-16T17:29:22Z

Comment 1 Bob Fournier 2020-12-16 22:00:59 UTC

From the job history at https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi-ovn-dualstack, this failure in intermittent over the last few days.  It occurred since 12/11 (https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi-ovn-dualstack/1337544938430140416) at least and has been intermixed with successful runs.

There is currently no must-gather included in the results, Stephen has just added a patch to reenable it - https://github.com/openshift-metal3/dev-scripts/pull/1173.

Comment 2 Bob Fournier 2020-12-17 01:57:38 UTC

The latest dualstack test just passed, so this is definitely intermittent.

Comment 3 Stephen Benjamin 2020-12-17 02:45:18 UTC

It looks like the same root cause as before. Kubelet is getting FQDN, but machine doesn't have it so the CSR isn't signed.

./namespaces/openshift-cluster-machine-approver/pods/machine-approver-7dfc9559f7-k869q/machine-approver-controller/machine-approver-controller/logs/current.log:2020-12-17T02:04:30.330897225Z I1217 02:04:30.330758       1 main.go:147] CSR csr-wvvjr added
./namespaces/openshift-cluster-machine-approver/pods/machine-approver-7dfc9559f7-k869q/machine-approver-controller/machine-approver-controller/logs/current.log:2020-12-17T02:04:30.345437663Z I1217 02:04:30.345387       1 main.go:182] CSR csr-wvvjr not authorized: failed to find machine for node worker-0.ostest.test.metalkube.org


must-gather is here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-metal3_dev-scripts/1173/pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi-ovn-dualstack/1339362989836341248/artifacts/e2e-metal-ipi-ovn-dualstack/baremetalds-devscripts-gather/

@derekh, any chance you have time to look?

Comment 4 Derek Higgins 2020-12-18 13:05:10 UTC

Looks like we have a race condition, our script sets the hostname but eventually if things take long enough NetworkManager sets it back

I've extracted the relevant logs, here NetworkManager sets the hostname back to worker-0 before the details are posted to ironic

Dec 16 20:08:41 worker-0.ostest.test.metalkube.org NetworkManager[458]: <info>  [1608167321.5491] policy: set-hostname: current hostname was changed outside NetworkManager: 'worker-0.ostest.test.metalkube.org'
Dec 16 20:08:41 worker-0.ostest.test.metalkube.org NetworkManager[458]: <info>  [1608167321.5492] policy: set-hostname: set hostname to 'worker-0' (from DHCPv4)
...
Dec 16 20:09:53 worker-0 ironic-python-agent[832]: 2020-12-16 20:09:53.711 832 INFO ironic_python_agent.inspector [-] posting collected data to http://[fd00:1101::3]:5050/v1/continue

I suspect this is probably more likely to be hit now that inspector is running all of its collectors and taking longer
since we fixed the naming https://github.com/openshift/ironic-image/pull/131

I'll look into a better option to set the hostname,

Comment 8 Shelly Miron 2021-01-19 09:08:31 UTC

I ran 3 dual stack deployments with 4.7 nightly images - all passed (for instance : registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-17-211555)
I also ran 2 dual stack 4.7 stable deployments - both failed (quay.io/openshift-release-dev/ocp-release:4.7.0-fc.3-x86_64, quay.io/openshift-release-dev/ocp-release:4.7.0-fc.2-x86_64)

worker nodes did not deployed:


(From 4.7.0-fc.3 image):


[kni@provisionhost-0-0 ~]$ oc get node
NAME                                              STATUS   ROLES    AGE   VERSION
master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   Ready    master   34m   v1.20.0+394a5a3
master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   Ready    master   33m   v1.20.0+394a5a3
master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com   Ready    master   32m   v1.20.0+394a5a3

and multiple errors occured:

Message:               Multiple errors are preventing progress:
* Could not update prometheusrule "openshift-cloud-credential-operator/cloud-credential-operator-alerts" (559 of 662): the server is reporting an internal error
* Could not update prometheusrule "openshift-cluster-machine-approver/machineapprover-rules" (578 of 662): the server is reporting an internal error
* Could not update prometheusrule "openshift-cluster-node-tuning-operator/node-tuning-operator" (339 of 662): the server is reporting an internal error
* Could not update prometheusrule "openshift-cluster-samples-operator/samples-operator-alerts" (360 of 662): the server is reporting an internal error
* Could not update prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 662): the server is reporting an internal error
* Could not update prometheusrule "openshift-dns-operator/dns" (603 of 662): the server is reporting an internal error
* Could not update prometheusrule "openshift-image-registry/image-registry-operator-alerts" (288 of 662): the server is reporting an internal error
* Could not update prometheusrule "openshift-ingress-operator/ingress-operator" (610 of 662): the server is reporting an internal error
* Could not update prometheusrule "openshift-kube-apiserver-operator/kube-apiserver-operator" (614 of 662): the server is reporting an internal error
* Could not update prometheusrule "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (626 of 662): the server is reporting an internal error
* Could not update prometheusrule "openshift-kube-scheduler-operator/kube-scheduler-operator" (630 of 662): the server is reporting an internal error
* Could not update prometheusrule "openshift-machine-api/cluster-autoscaler-operator-rules" (251 of 662): the server is reporting an internal error
* Could not update prometheusrule "openshift-machine-api/machine-api-operator-prometheus-rules" (638 of 662): the server is reporting an internal error
* Could not update prometheusrule "openshift-machine-config-operator/machine-config-daemon" (640 of 662): the server is reporting an internal error
* Could not update prometheusrule "openshift-operator-lifecycle-manager/olm-alert-rules" (645 of 662): the server is reporting an internal error


.openshift_install.log attached to the bug (from 4.7.0-fc.2 image).

Comment 9 Shelly Miron 2021-01-19 09:11:07 UTC

Created attachment 1748659 [details]
openshihft_install.log

Comment 10 Derek Higgins 2021-01-19 11:20:09 UTC

(In reply to Shelly Miron from comment #8)
> I also ran 2 dual stack 4.7 stable deployments - both failed
> (quay.io/openshift-release-dev/ocp-release:4.7.0-fc.3-x86_64,
> quay.io/openshift-release-dev/ocp-release:4.7.0-fc.2-x86_64)
> 
> worker nodes did not deployed:
> 
> 
> (From 4.7.0-fc.3 image):
> 
> 
> [kni@provisionhost-0-0 ~]$ oc get node
> NAME                                              STATUS   ROLES    AGE  
> VERSION
> master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   Ready    master   34m  
> v1.20.0+394a5a3
> master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   Ready    master   33m  
> v1.20.0+394a5a3
> master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com   Ready    master   32m  
> v1.20.0+394a5a3

It doesn't look like this is the same problem, do you have a mustgather from one of the failed runs?

Comment 11 Shelly Miron 2021-01-19 13:46:47 UTC

After several attempts to reproduce the bug, it seems like the problem was fixed -
for quay images: quay.io/openshift-release-dev/ocp-release:4.7.0-fc.3-x86_64, quay.io/openshift-release-dev/ocp-release:4.7.0-fc.2-x86_64, dual-stack deployed successfully. 

verified.

Comment 14 errata-xmlrpc 2021-02-24 15:45:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633