Bug 1908493
Summary: | 4.7-e2e-metal-ipi-ovn-dualstack intermittent test failures, worker hostname is overwritten by NM | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Bob Fournier <bfournie> | ||||
Component: | Installer | Assignee: | Derek Higgins <derekh> | ||||
Installer sub component: | OpenShift on Bare Metal IPI | QA Contact: | Shelly Miron <smiron> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | high | CC: | derekh, kholtz, kquinn, rbartal, smiron, stbenjam | ||||
Version: | 4.7 | Keywords: | Triaged, UpcomingSprint | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.7.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Previously when using dual stack deployments, worker node hostnames sometimes didn’t match the name inspected before deployment. This caused nodes to need manual approval. This has been fixed.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1955114 (view as bug list) | Environment: | |||||
Last Closed: | 2021-02-24 15:45:31 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1955114 | ||||||
Attachments: |
|
Description
Bob Fournier
2020-12-16 20:56:05 UTC
From the job history at https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi-ovn-dualstack, this failure in intermittent over the last few days. It occurred since 12/11 (https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi-ovn-dualstack/1337544938430140416) at least and has been intermixed with successful runs. There is currently no must-gather included in the results, Stephen has just added a patch to reenable it - https://github.com/openshift-metal3/dev-scripts/pull/1173. The latest dualstack test just passed, so this is definitely intermittent. It looks like the same root cause as before. Kubelet is getting FQDN, but machine doesn't have it so the CSR isn't signed. ./namespaces/openshift-cluster-machine-approver/pods/machine-approver-7dfc9559f7-k869q/machine-approver-controller/machine-approver-controller/logs/current.log:2020-12-17T02:04:30.330897225Z I1217 02:04:30.330758 1 main.go:147] CSR csr-wvvjr added ./namespaces/openshift-cluster-machine-approver/pods/machine-approver-7dfc9559f7-k869q/machine-approver-controller/machine-approver-controller/logs/current.log:2020-12-17T02:04:30.345437663Z I1217 02:04:30.345387 1 main.go:182] CSR csr-wvvjr not authorized: failed to find machine for node worker-0.ostest.test.metalkube.org must-gather is here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-metal3_dev-scripts/1173/pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi-ovn-dualstack/1339362989836341248/artifacts/e2e-metal-ipi-ovn-dualstack/baremetalds-devscripts-gather/ @derekh, any chance you have time to look? Looks like we have a race condition, our script sets the hostname but eventually if things take long enough NetworkManager sets it back I've extracted the relevant logs, here NetworkManager sets the hostname back to worker-0 before the details are posted to ironic Dec 16 20:08:41 worker-0.ostest.test.metalkube.org NetworkManager[458]: <info> [1608167321.5491] policy: set-hostname: current hostname was changed outside NetworkManager: 'worker-0.ostest.test.metalkube.org' Dec 16 20:08:41 worker-0.ostest.test.metalkube.org NetworkManager[458]: <info> [1608167321.5492] policy: set-hostname: set hostname to 'worker-0' (from DHCPv4) ... Dec 16 20:09:53 worker-0 ironic-python-agent[832]: 2020-12-16 20:09:53.711 832 INFO ironic_python_agent.inspector [-] posting collected data to http://[fd00:1101::3]:5050/v1/continue I suspect this is probably more likely to be hit now that inspector is running all of its collectors and taking longer since we fixed the naming https://github.com/openshift/ironic-image/pull/131 I'll look into a better option to set the hostname, I ran 3 dual stack deployments with 4.7 nightly images - all passed (for instance : registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-17-211555) I also ran 2 dual stack 4.7 stable deployments - both failed (quay.io/openshift-release-dev/ocp-release:4.7.0-fc.3-x86_64, quay.io/openshift-release-dev/ocp-release:4.7.0-fc.2-x86_64) worker nodes did not deployed: (From 4.7.0-fc.3 image): [kni@provisionhost-0-0 ~]$ oc get node NAME STATUS ROLES AGE VERSION master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 34m v1.20.0+394a5a3 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 33m v1.20.0+394a5a3 master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 32m v1.20.0+394a5a3 and multiple errors occured: Message: Multiple errors are preventing progress: * Could not update prometheusrule "openshift-cloud-credential-operator/cloud-credential-operator-alerts" (559 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-cluster-machine-approver/machineapprover-rules" (578 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-cluster-node-tuning-operator/node-tuning-operator" (339 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-cluster-samples-operator/samples-operator-alerts" (360 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-dns-operator/dns" (603 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-image-registry/image-registry-operator-alerts" (288 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-ingress-operator/ingress-operator" (610 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-kube-apiserver-operator/kube-apiserver-operator" (614 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (626 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-kube-scheduler-operator/kube-scheduler-operator" (630 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-machine-api/cluster-autoscaler-operator-rules" (251 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-machine-api/machine-api-operator-prometheus-rules" (638 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-machine-config-operator/machine-config-daemon" (640 of 662): the server is reporting an internal error * Could not update prometheusrule "openshift-operator-lifecycle-manager/olm-alert-rules" (645 of 662): the server is reporting an internal error .openshift_install.log attached to the bug (from 4.7.0-fc.2 image). Created attachment 1748659 [details]
openshihft_install.log
(In reply to Shelly Miron from comment #8) > I also ran 2 dual stack 4.7 stable deployments - both failed > (quay.io/openshift-release-dev/ocp-release:4.7.0-fc.3-x86_64, > quay.io/openshift-release-dev/ocp-release:4.7.0-fc.2-x86_64) > > worker nodes did not deployed: > > > (From 4.7.0-fc.3 image): > > > [kni@provisionhost-0-0 ~]$ oc get node > NAME STATUS ROLES AGE > VERSION > master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 34m > v1.20.0+394a5a3 > master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 33m > v1.20.0+394a5a3 > master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 32m > v1.20.0+394a5a3 It doesn't look like this is the same problem, do you have a mustgather from one of the failed runs? After several attempts to reproduce the bug, it seems like the problem was fixed - for quay images: quay.io/openshift-release-dev/ocp-release:4.7.0-fc.3-x86_64, quay.io/openshift-release-dev/ocp-release:4.7.0-fc.2-x86_64, dual-stack deployed successfully. verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |