2079593 – BM provisioning of OCP 4.10 fails 'timeout reached while inspecting the node' on dual stack hub

Bug 2079593 - BM provisioning of OCP 4.10 fails 'timeout reached while inspecting the node' on dual stack hub

Summary: BM provisioning of OCP 4.10 fails 'timeout reached while inspecting the node'...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Cluster Lifecycle
Sub Component:
Version:	rhacm-2.5
Hardware:	x86_64
OS:	Unspecified
Priority:	urgent
Severity:	medium
Target Milestone:	---
Target Release:	rhacm-2.5
Assignee:	Jian Qiu
QA Contact:	Hui Chen
Docs Contact:	Christopher Dawson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-27 20:02 UTC by Thuy Nguyen
Modified:	2023-09-15 01:54 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-12 00:37:16 UTC
Target Upstream Version:
Embargoed:
Flags:	bot-tracker-sync: rhacm-2.5+ kcormier: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	stolostron backlog issues 22025	0	None	None	None	2022-04-28 00:27:08 UTC
Red Hat Bugzilla	2084158	1	unspecified	CLOSED	Support provisioning bm cluster where no provisioning network provided	2022-06-09 02:12:34 UTC

Description Thuy Nguyen 2022-04-27 20:02:16 UTC

Description of the problem: BM provisioning of OCP 4.10 fails 'timeout reached while inspecting the node' on dual stack hub 

Release version:

Operator snapshot version:

OCP version: 4.9.28

Browser Info:

Steps to reproduce:
1. Provision a BM cluster of OCP 4.10.5 on dual stack hub

Actual results:
Deployment failed

Expected results:

Additional info:

time="2022-04-27T19:42:27Z" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'timeout reached while inspecting the node'"
time="2022-04-27T19:42:27Z" level=error
time="2022-04-27T19:42:27Z" level=error msg="  on ../tmp/openshift-install-masters-1383611823/main.tf line 13, in resource \"ironic_node_v1\" \"openshift-master-host\":"
time="2022-04-27T19:42:27Z" level=error msg="  13: resource \"ironic_node_v1\" \"openshift-master-host\" {"
time="2022-04-27T19:42:27Z" level=error
time="2022-04-27T19:42:27Z" level=error
time="2022-04-27T19:42:27Z" level=error
time="2022-04-27T19:42:27Z" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'timeout reached while inspecting the node'"
time="2022-04-27T19:42:27Z" level=error
time="2022-04-27T19:42:27Z" level=error msg="  on ../tmp/openshift-install-masters-1383611823/main.tf line 13, in resource \"ironic_node_v1\" \"openshift-master-host\":"
time="2022-04-27T19:42:27Z" level=error msg="  13: resource \"ironic_node_v1\" \"openshift-master-host\" {"
time="2022-04-27T19:42:27Z" level=error
time="2022-04-27T19:42:27Z" level=error
time="2022-04-27T19:42:27Z" level=error
time="2022-04-27T19:42:27Z" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'timeout reached while inspecting the node'"
time="2022-04-27T19:42:27Z" level=error
time="2022-04-27T19:42:27Z" level=error msg="  on ../tmp/openshift-install-masters-1383611823/main.tf line 13, in resource \"ironic_node_v1\" \"openshift-master-host\":"
time="2022-04-27T19:42:27Z" level=error msg="  13: resource \"ironic_node_v1\" \"openshift-master-host\" {"
time="2022-04-27T19:42:27Z" level=error
time="2022-04-27T19:42:27Z" level=error
time="2022-04-27T19:42:27Z" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"
time="2022-04-27T19:42:28Z" level=error msg="error after waiting for command completion" error="exit status 1" installID=brw229zm
time="2022-04-27T19:42:28Z" level=error msg="error provisioning cluster" error="exit status 1" installID=brw229zm
time="2022-04-27T19:42:28Z" level=error msg="error running openshift-install, running deprovision to clean up" error="exit status 1" installID=brw229zm
time="2022-04-27T19:42:28Z" level=debug msg="Unable to find log storage actuator. Disabling gathering logs." installID=brw229zm
time="2022-04-27T19:42:28Z" level=info msg="saving installer output" installID=brw229zm
time="2022-04-27T19:42:28Z" level=debug msg="installer console log: level=info msg=Consuming Install Config from target directory\nlevel=warning msg=Making control-plane schedulable by setting MastersSchedulable to true for Scheduler cluster settings\nlevel=warning msg=Discarding the Openshift Manifests that was provided in the target directory because its dependencies are dirty and it needs to be regenerated\nlevel=info msg=Manifests created in: manifests and openshift\nlevel=warning msg=Found override for release image. Please be warned, this is not advised\nlevel=info msg=Consuming Common Manifests from target directory\nlevel=info msg=Consuming Worker Machines from target directory\nlevel=info msg=Consuming Openshift Manifests from target directory\nlevel=info msg=Consuming OpenShift Install (Manifests) from target directory\nlevel=info msg=Consuming Master Machines from target directory\nlevel=info msg=Ignition-Configs created in: . and auth\nlevel=info msg=Consuming Bootstrap Ignition Config from tar...
time="2022-04-27T19:42:28Z" level=error msg="failed due to install error" error="exit status 1" installID=brw229zm
time="2022-04-27T19:42:28Z" level=fatal msg="runtime error" error="exit status 1"


hive version-
time="2022-04-20T17:21:09.64Z" level=info msg="Version: openshift/hive v1.1.16-343-g8871bf4cd"
time="2022-04-20T17:21:09.641Z" level=info msg="hive namespace: hive"

Comment 3 daliu 2022-04-28 03:24:25 UTC

@efried 
Could you help to take a look? or need to involve ocp installer team?

Comment 4 Eric Fried 2022-04-28 14:39:11 UTC

Yeah, this is beyond hive's purview. I agree the installer team would be good to ask next.

Comment 5 daliu 2022-04-29 00:38:09 UTC

Ask installer team to help to take a look: https://coreos.slack.com/archives/CFP6ST0A3/p1651192626753469

Comment 6 dhuynh 2022-05-02 15:14:38 UTC

This is blocking all QE squads from further test execution FYI @ecai @acanan

Comment 11 Derek Higgins 2022-05-03 09:33:01 UTC

We can see that the vmedia is inserted by ironic before powering on the node

2022-04-27 19:12:28.623 1 INFO ironic.drivers.modules.redfish.boot [req-3bb5e4ff-aa78-4d0a-83df-9c7bd9a10a44 bootstrap-user - - - -] Inserted boot media http://[fd00:1101:0:2::2]:80/redfish/boot-b71f8941-5110-4dfc-b3be-47b6748a99cf.iso into VirtualMediaType.CD for node b71f8941-5110-4dfc-b3be-47b6748a99cf

Can you watch the master console while its booting to confirm if it is successfully
booting or not? 

If it doesn't boot can to supply any errors you can see on the console

If it does boot you should be able to ssh to the VM as either "root" or "core" (depending on the release),
if this works you can get more info from the ironic python agent logs.

Comment 12 Thuy Nguyen 2022-05-03 17:31:31 UTC

I don't see anything printed out in the vm console, which I assume that the vm is not booting.

virsh dumpxml master-1-0 | grep console -A3 -B7
    <serial type='pty'>
      <source path='/dev/pts/10'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/10'>
      <source path='/dev/pts/10'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>


virsh console master-1-0
Connected to domain 'master-1-0'
Escape character is ^] (Ctrl + ])


Reattach ironic logs.

Comment 21 daliu 2022-05-06 02:53:02 UTC

@derekh 
As ACM will RC next week(5/12), and this issue blocked the ACM QE test, So please help to high priority this issue.

@thnguyen In order to investigate the issue ASAP, please help to directly chat with ocp installer team in https://coreos.slack.com/archives/CFP6ST0A3/p1651192626753469 or give them your env about this.

Comment 22 Derek Higgins 2022-05-07 00:45:18 UTC

Looks like this is something to do with using vmedia vs pxe , as far as I can see when using vmedia the kernel params for IPA end up as

BOOT_IMAGE=/images/pxeboot/vmlinuz random.trust_cpu=on coreos.liveiso=rhcos-410.84.202201251210-0 ignition.firstboot ignition.platform.id=metal ip=dhcp6

but when using pxe they end very different, something in there means we end up getting a IPv4 address

deploy_kernel selinux=0 troubleshoot=0 text nofb nomodeset vga=normal ipa-insecure=1 sshkey="ssh-rsa AAAA...lc= derekh.kni.lab.eng.bos.redhat.com" ip=dhcp6 coreos.live.rootfs_url=http://[fd00:1101::2]:80/images/ironic-python-agent.rootfs ignition.firstboot ignition.platform.id=metal ipa-debug=1 ipa-inspection-collectors=default,extra-hardware,logs ipa-enable-vlan-interfaces=all ipa-inspection-dhcp-all-interfaces=1 ipa-collect-lldp=1 ipa-inspection-callback-url=http://[fd00:1101::2]:5050/v1/continue ipa-api-url=http://[fd00:1101::2]:6385 ipa-global-request-id=req-f6ccb5af-3861-4b57-9cd0-36958ad4dcc1 BOOTIF=00:7f:da:85:55:8a initrd=deploy_ramdisk


Are you able to test using PXE to confirm?

Comment 25 Thuy Nguyen 2022-05-09 15:49:00 UTC

The cluster is indeed deployed successfully using pxe.

Comment 26 daliu 2022-05-10 09:58:48 UTC

Hi @thnguyen 
Could you help to test disable the provisioning network using vmedia? details see: https://coreos.slack.com/archives/CFP6ST0A3/p1652093718396339?thread_ts=1651192626.753469&cid=CFP6ST0A3

And we already have workaround, So may be we could remove the low priority the issue. And we may need to doc this later.

Comment 28 Thuy Nguyen 2022-05-10 20:38:16 UTC

The openshift install is working properly using vmedia while provisioning network is disabled.

Comment 29 daliu 2022-05-11 00:24:55 UTC

Thanks @thnguyen @ecai 
So I think we should doc this issue. And I will add an doc for this.

Comment 30 daliu 2022-05-11 00:27:06 UTC

@thnguyen Please also help to create an issue for console to make the provisioning network optional. As discussed in https://coreos.slack.com/archives/CFP6ST0A3/p1652194319407539?thread_ts=1651192626.753469&cid=CFP6ST0A3
FYI @kcormier

Comment 31 bot-tracker-sync 2022-05-11 04:08:48 UTC

G2Bsync 1123150862 comment 
 nelsonjean Wed, 11 May 2022 03:48:29 UTC 
 G2Bsync 

Looks like the issue will be doc'd.  Is this still a Sev-1?

Comment 32 dhuynh 2022-05-11 16:29:47 UTC

@daliu did you create a separate doc issue for this?

Comment 35 daliu 2022-05-12 00:37:16 UTC

use separate doc issue to trace doc part and close this issue.

Comment 37 daliu 2022-05-13 00:06:03 UTC

doc issue : https://github.com/stolostron/backlog/issues/22488

Comment 38 Christopher Dawson 2022-07-15 11:18:16 UTC

This issue is closed. Thanks!

Comment 39 Red Hat Bugzilla 2023-09-15 01:54:14 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.