Bug 2034527
| Summary: | IPI deployment fails 'timeout reached while inspecting the node' when provisioning network ipv6 | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Lubov <lshilin> |
| Component: | Installer | Assignee: | Derek Higgins <derekh> |
| Installer sub component: | OpenShift on Bare Metal IPI | QA Contact: | Lubov <lshilin> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | urgent | CC: | achernet, aguclu, jniu, rpittau, stbenjam, vvoronko, yjoseph, ykashtan, zbitter |
| Version: | 4.10 | Keywords: | AutomationBlocker, Triaged |
| Target Milestone: | --- | ||
| Target Release: | 4.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
job=periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-ovn-dualstack=all
|
|
| Last Closed: | 2022-03-10 16:35:46 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Could you log into the machine's virtual console to see what is happening there? Okay, got it. So we're seeing a failure to connect to the rootfs location. Now I wonder if it's a DHCP issue or an issue with routing. The address it is trying to use, do you expect it to be reachable from the machine? @vvoronko, can you help with DHCP question, please reproduced for 4.10.0-0.nightly-2021-12-21-130047 bm ipv4 provisioning ipv6 https://s3.upshift.redhat.com/DH-PROD-OCP-EDGE-QE-CI/Infra/must-gather/1191/index.html *** Bug 2035219 has been marked as a duplicate of this bug. *** *** Bug 2037419 has been marked as a duplicate of this bug. *** as far as I can see ip=dhcp is being included in the kernel params, this is what we had in dualstack pre METAL-1, but the provisioning network is ipv6 so RHCOS doesn't boot and cant inspect. I think pre metal-1 this was fine because ip=dhcp wasn't being set on the the IPA image, now it is (as its now using RHCOS) so it is attempting to boot with ip=dhcp and then can't download the rootfs at http://[fd00:1101::2]:80/images/ironic-python-agent.rootfs This is breaking dual-stack, even in CI, so it is definitely a release blocker. The cluster is now provisioning in CI but some of the tests are failing, new bz opened https://bugzilla.redhat.com/show_bug.cgi?id=2040671 verified on 4.10.0-0.nightly-2022-01-15-092722 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |
Version: $ ./openshift-baremetal-install version ./openshift-baremetal-install 4.10.0-0.nightly-2021-12-20-231053 built from commit 9f37ece3620d14e48507f2afc5cf6a667ca2cef0 release image registry.ci.openshift.org/ocp/release@sha256:26f811bd37c593564093aa8e323cf81637c49a9527dbe6772c885fa9f55ab684 release architecture amd64 Platform: IPI What happened? Deploy on real BM failed twice in the row time="2021-12-21T08:18:15+02:00" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'timeout reached while inspecting the node'" time="2021-12-21T08:18:15+02:00" level=error time="2021-12-21T08:18:15+02:00" level=error msg=" on ../../tmp/openshift-install-masters-2606240572/main.tf line 13, in resource \"ironic_node_v1\" \"openshift-master-host\":" time="2021-12-21T08:18:15+02:00" level=error msg=" 13: resource \"ironic_node_v1\" \"openshift-master-host\" {" time="2021-12-21T08:18:15+02:00" level=error time="2021-12-21T08:18:15+02:00" level=error time="2021-12-21T08:18:15+02:00" level=error time="2021-12-21T08:18:15+02:00" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'timeout reached while inspecting the node'" time="2021-12-21T08:18:15+02:00" level=error time="2021-12-21T08:18:15+02:00" level=error msg=" on ../../tmp/openshift-install-masters-2606240572/main.tf line 13, in resource \"ironic_node_v1\" \"openshift-master-host\":" time="2021-12-21T08:18:15+02:00" level=error msg=" 13: resource \"ironic_node_v1\" \"openshift-master-host\" {" time="2021-12-21T08:18:15+02:00" level=error time="2021-12-21T08:18:15+02:00" level=error time="2021-12-21T08:18:15+02:00" level=error time="2021-12-21T08:18:15+02:00" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'timeout reached while inspecting the node'" time="2021-12-21T08:18:15+02:00" level=error time="2021-12-21T08:18:15+02:00" level=error msg=" on ../../tmp/openshift-install-masters-2606240572/main.tf line 13, in resource \"ironic_node_v1\" \"openshift-master-host\":" time="2021-12-21T08:18:15+02:00" level=error msg=" 13: resource \"ironic_node_v1\" \"openshift-master-host\" {" time="2021-12-21T08:18:15+02:00" level=error time="2021-12-21T08:18:15+02:00" level=error time="2021-12-21T08:18:15+02:00" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change" time="2021-12-21T08:18:16+02:00" level=debug msg="OpenShift Installer 4.10.0-0.nightly-2021-12-20-231053" time="2021-12-21T08:18:16+02:00" level=debug msg="Built from commit 9f37ece3620d14e48507f2afc5cf6a667ca2cef0" time="2021-12-21T08:18:16+02:00" level=info msg="Waiting up to 20m0s (until 8:38AM) for the Kubernetes API at https://api.ocp-edge.lab.eng.tlv2.redhat.com:6443..." time="2021-12-21T08:18:16+02:00" level=info msg="API v1.22.1+6859754 up" time="2021-12-21T08:18:16+02:00" level=info msg="Waiting up to 30m0s (until 8:48AM) for bootstrapping to complete..." time="2021-12-21T08:48:16+02:00" level=info msg="Use the following commands to gather logs from the cluster" time="2021-12-21T08:48:16+02:00" level=info msg="openshift-install gather bootstrap --help" time="2021-12-21T08:48:16+02:00" level=error msg="Bootstrap failed to complete: timed out waiting for the condition" time="2021-12-21T08:48:16+02:00" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane." The status of the cluster: [kni@ocp-edge06 ~]$ oc get nodes No resources found [kni@ocp-edge06 ~]$ oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE openshift-machine-api openshift-master-0 ocp-edge-6xs6f-master-0 true 120m openshift-machine-api openshift-master-1 ocp-edge-6xs6f-master-1 true 120m openshift-machine-api openshift-master-2 ocp-edge-6xs6f-master-2 true 120m openshift-machine-api openshift-worker-0 true 120m openshift-machine-api openshift-worker-1 true 120m openshift-machine-api openshift-worker-2 true 120m [kni@ocp-edge06 ~]$ oc get machineset -A NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE openshift-machine-api ocp-edge-6xs6f-worker-0 3 0 121m [kni@ocp-edge06 ~]$ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ocp-edge-6xs6f-master-0 121m openshift-machine-api ocp-edge-6xs6f-master-1 121m openshift-machine-api ocp-edge-6xs6f-master-2 121m [kni@ocp-edge06 ~]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication baremetal cloud-controller-manager cloud-credential True False False 122m cluster-autoscaler config-operator console csi-snapshot-controller dns etcd image-registry ingress insights kube-apiserver kube-controller-manager kube-scheduler kube-storage-version-migrator machine-api machine-approver machine-config marketplace monitoring network node-tuning openshift-apiserver openshift-controller-manager openshift-samples operator-lifecycle-manager operator-lifecycle-manager-catalog operator-lifecycle-manager-packageserver service-ca storage must-gathers https://s3.upshift.redhat.com/DH-PROD-OCP-EDGE-QE-CI/Infra/must-gather/1176/index.html https://s3.upshift.redhat.com/DH-PROD-OCP-EDGE-QE-CI/Infra/must-gather/1185/index.html What did you expect to happen? deploy should pass How to reproduce it (as minimally and precisely as possible)? usual deployment Anything else we need to know? #Enter text here.