2034527 – IPI deployment fails 'timeout reached while inspecting the node' when provisioning network ipv6

Bug 2034527 - IPI deployment fails 'timeout reached while inspecting the node' when provisioning network ipv6

Summary: IPI deployment fails 'timeout reached while inspecting the node' when provisi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Derek Higgins
QA Contact:	Lubov
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	2035219 2037419 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-21 08:36 UTC by Lubov
Modified:	2022-03-10 16:36 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:	job=periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-ovn-dualstack=all
Last Closed:	2022-03-10 16:35:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-baremetal-operator pull 230	None	Merged	Bug 2034527: Base IPA kernel params on provisioning network IP version	2022-01-10 15:38:54 UTC
Github	openshift cluster-baremetal-operator pull 233	None	Merged	Bug 2034527: Pass IP options to installed CoreOS image	2022-01-12 15:40:56 UTC
Github	openshift image-customization-controller pull 31	None	Merged	Bug 2034527: Add IP_OPTIONS environment variable	2022-01-12 15:40:58 UTC
Github	openshift installer pull 5521	None	Merged	Bug 2034527: Pass different IP options to installed CoreOS image and IPA	2022-01-14 10:52:11 UTC
Github	openshift ironic-agent-image pull 31	None	Merged	Bug 2034527: Allow setting IP options kernel args on install	2022-01-14 10:52:14 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:36:04 UTC

Description Lubov 2021-12-21 08:36:35 UTC

Version:

$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.10.0-0.nightly-2021-12-20-231053
built from commit 9f37ece3620d14e48507f2afc5cf6a667ca2cef0
release image registry.ci.openshift.org/ocp/release@sha256:26f811bd37c593564093aa8e323cf81637c49a9527dbe6772c885fa9f55ab684
release architecture amd64


Platform:
IPI

What happened?
Deploy on real BM failed twice in the row
time="2021-12-21T08:18:15+02:00" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'timeout reached while inspecting the node'"
time="2021-12-21T08:18:15+02:00" level=error
time="2021-12-21T08:18:15+02:00" level=error msg="  on ../../tmp/openshift-install-masters-2606240572/main.tf line 13, in resource \"ironic_node_v1\" \"openshift-master-host\":"
time="2021-12-21T08:18:15+02:00" level=error msg="  13: resource \"ironic_node_v1\" \"openshift-master-host\" {"
time="2021-12-21T08:18:15+02:00" level=error
time="2021-12-21T08:18:15+02:00" level=error
time="2021-12-21T08:18:15+02:00" level=error
time="2021-12-21T08:18:15+02:00" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'timeout reached while inspecting the node'"
time="2021-12-21T08:18:15+02:00" level=error
time="2021-12-21T08:18:15+02:00" level=error msg="  on ../../tmp/openshift-install-masters-2606240572/main.tf line 13, in resource \"ironic_node_v1\" \"openshift-master-host\":"
time="2021-12-21T08:18:15+02:00" level=error msg="  13: resource \"ironic_node_v1\" \"openshift-master-host\" {"
time="2021-12-21T08:18:15+02:00" level=error
time="2021-12-21T08:18:15+02:00" level=error
time="2021-12-21T08:18:15+02:00" level=error
time="2021-12-21T08:18:15+02:00" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'timeout reached while inspecting the node'"
time="2021-12-21T08:18:15+02:00" level=error
time="2021-12-21T08:18:15+02:00" level=error msg="  on ../../tmp/openshift-install-masters-2606240572/main.tf line 13, in resource \"ironic_node_v1\" \"openshift-master-host\":"
time="2021-12-21T08:18:15+02:00" level=error msg="  13: resource \"ironic_node_v1\" \"openshift-master-host\" {"
time="2021-12-21T08:18:15+02:00" level=error
time="2021-12-21T08:18:15+02:00" level=error
time="2021-12-21T08:18:15+02:00" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"
time="2021-12-21T08:18:16+02:00" level=debug msg="OpenShift Installer 4.10.0-0.nightly-2021-12-20-231053"
time="2021-12-21T08:18:16+02:00" level=debug msg="Built from commit 9f37ece3620d14e48507f2afc5cf6a667ca2cef0"
time="2021-12-21T08:18:16+02:00" level=info msg="Waiting up to 20m0s (until 8:38AM) for the Kubernetes API at https://api.ocp-edge.lab.eng.tlv2.redhat.com:6443..."
time="2021-12-21T08:18:16+02:00" level=info msg="API v1.22.1+6859754 up"
time="2021-12-21T08:18:16+02:00" level=info msg="Waiting up to 30m0s (until 8:48AM) for bootstrapping to complete..."
time="2021-12-21T08:48:16+02:00" level=info msg="Use the following commands to gather logs from the cluster"
time="2021-12-21T08:48:16+02:00" level=info msg="openshift-install gather bootstrap --help"
time="2021-12-21T08:48:16+02:00" level=error msg="Bootstrap failed to complete: timed out waiting for the condition"
time="2021-12-21T08:48:16+02:00" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."

The status of the cluster:
[kni@ocp-edge06 ~]$ oc get nodes
No resources found
[kni@ocp-edge06 ~]$ oc get bmh -A
NAMESPACE               NAME                 STATE   CONSUMER                  ONLINE   ERROR   AGE
openshift-machine-api   openshift-master-0           ocp-edge-6xs6f-master-0   true             120m
openshift-machine-api   openshift-master-1           ocp-edge-6xs6f-master-1   true             120m
openshift-machine-api   openshift-master-2           ocp-edge-6xs6f-master-2   true             120m
openshift-machine-api   openshift-worker-0                                     true             120m
openshift-machine-api   openshift-worker-1                                     true             120m
openshift-machine-api   openshift-worker-2                                     true             120m
[kni@ocp-edge06 ~]$ oc get machineset -A
NAMESPACE               NAME                      DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   ocp-edge-6xs6f-worker-0   3         0                             121m
[kni@ocp-edge06 ~]$ oc get machines -A
NAMESPACE               NAME                      PHASE   TYPE   REGION   ZONE   AGE
openshift-machine-api   ocp-edge-6xs6f-master-0                                  121m
openshift-machine-api   ocp-edge-6xs6f-master-1                                  121m
openshift-machine-api   ocp-edge-6xs6f-master-2                                  121m
[kni@ocp-edge06 ~]$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                                                                                    
baremetal                                                                                         
cloud-controller-manager                                                                          
cloud-credential                                     True        False         False      122m    
cluster-autoscaler                                                                                
config-operator                                                                                   
console                                                                                           
csi-snapshot-controller                                                                           
dns                                                                                               
etcd                                                                                              
image-registry                                                                                    
ingress                                                                                           
insights                                                                                          
kube-apiserver                                                                                    
kube-controller-manager                                                                           
kube-scheduler                                                                                    
kube-storage-version-migrator                                                                     
machine-api                                                                                       
machine-approver                                                                                  
machine-config                                                                                    
marketplace                                                                                       
monitoring                                                                                        
network                                                                                           
node-tuning                                                                                       
openshift-apiserver                                                                               
openshift-controller-manager                                                                      
openshift-samples                                                                                 
operator-lifecycle-manager                                                                        
operator-lifecycle-manager-catalog                                                                
operator-lifecycle-manager-packageserver                                                          
service-ca                                                                                        
storage   

must-gathers
https://s3.upshift.redhat.com/DH-PROD-OCP-EDGE-QE-CI/Infra/must-gather/1176/index.html 
https://s3.upshift.redhat.com/DH-PROD-OCP-EDGE-QE-CI/Infra/must-gather/1185/index.html


What did you expect to happen?
deploy should pass

How to reproduce it (as minimally and precisely as possible)?
usual deployment

Anything else we need to know?

#Enter text here.

Comment 2 Dmitry Tantsur 2021-12-21 17:30:41 UTC

Could you log into the machine's virtual console to see what is happening there?

Comment 4 Dmitry Tantsur 2021-12-22 10:26:45 UTC

Okay, got it. So we're seeing a failure to connect to the rootfs location. Now I wonder if it's a DHCP issue or an issue with routing.

The address it is trying to use, do you expect it to be reachable from the machine?

Comment 5 Lubov 2021-12-22 10:41:17 UTC

@vvoronko, can you help with DHCP question, please

Comment 9 Lubov 2021-12-22 12:46:37 UTC

reproduced for 4.10.0-0.nightly-2021-12-21-130047  bm ipv4 provisioning ipv6 https://s3.upshift.redhat.com/DH-PROD-OCP-EDGE-QE-CI/Infra/must-gather/1191/index.html

Comment 10 Arda Guclu 2021-12-23 10:15:26 UTC

*** Bug 2035219 has been marked as a duplicate of this bug. ***

Comment 14 Riccardo Pittau 2022-01-05 15:59:08 UTC

*** Bug 2037419 has been marked as a duplicate of this bug. ***

Comment 15 Derek Higgins 2022-01-05 16:52:25 UTC

as far as I can see ip=dhcp is being included in the kernel params, this is what we had in dualstack pre METAL-1,

but the provisioning network is ipv6 so RHCOS doesn't boot and cant inspect.
I think pre metal-1 this was fine because ip=dhcp wasn't being set on the the IPA image, now it is (as its now using RHCOS) so it is attempting to boot with ip=dhcp
and then can't download the rootfs at http://[fd00:1101::2]:80/images/ironic-python-agent.rootfs

Comment 17 Zane Bitter 2022-01-12 14:57:08 UTC

This is breaking dual-stack, even in CI, so it is definitely a release blocker.

Comment 20 Derek Higgins 2022-01-14 12:39:37 UTC

The cluster is now provisioning in CI but some of the tests are failing, new bz opened
https://bugzilla.redhat.com/show_bug.cgi?id=2040671

Comment 22 Lubov 2022-01-16 11:18:27 UTC

verified on 4.10.0-0.nightly-2022-01-15-092722

Comment 25 errata-xmlrpc 2022-03-10 16:35:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.