Description of the problem: BM provisioning of OCP 4.10 fails 'timeout reached while inspecting the node' on dual stack hub Release version: Operator snapshot version: OCP version: 4.9.28 Browser Info: Steps to reproduce: 1. Provision a BM cluster of OCP 4.10.5 on dual stack hub Actual results: Deployment failed Expected results: Additional info: time="2022-04-27T19:42:27Z" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'timeout reached while inspecting the node'" time="2022-04-27T19:42:27Z" level=error time="2022-04-27T19:42:27Z" level=error msg=" on ../tmp/openshift-install-masters-1383611823/main.tf line 13, in resource \"ironic_node_v1\" \"openshift-master-host\":" time="2022-04-27T19:42:27Z" level=error msg=" 13: resource \"ironic_node_v1\" \"openshift-master-host\" {" time="2022-04-27T19:42:27Z" level=error time="2022-04-27T19:42:27Z" level=error time="2022-04-27T19:42:27Z" level=error time="2022-04-27T19:42:27Z" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'timeout reached while inspecting the node'" time="2022-04-27T19:42:27Z" level=error time="2022-04-27T19:42:27Z" level=error msg=" on ../tmp/openshift-install-masters-1383611823/main.tf line 13, in resource \"ironic_node_v1\" \"openshift-master-host\":" time="2022-04-27T19:42:27Z" level=error msg=" 13: resource \"ironic_node_v1\" \"openshift-master-host\" {" time="2022-04-27T19:42:27Z" level=error time="2022-04-27T19:42:27Z" level=error time="2022-04-27T19:42:27Z" level=error time="2022-04-27T19:42:27Z" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'timeout reached while inspecting the node'" time="2022-04-27T19:42:27Z" level=error time="2022-04-27T19:42:27Z" level=error msg=" on ../tmp/openshift-install-masters-1383611823/main.tf line 13, in resource \"ironic_node_v1\" \"openshift-master-host\":" time="2022-04-27T19:42:27Z" level=error msg=" 13: resource \"ironic_node_v1\" \"openshift-master-host\" {" time="2022-04-27T19:42:27Z" level=error time="2022-04-27T19:42:27Z" level=error time="2022-04-27T19:42:27Z" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change" time="2022-04-27T19:42:28Z" level=error msg="error after waiting for command completion" error="exit status 1" installID=brw229zm time="2022-04-27T19:42:28Z" level=error msg="error provisioning cluster" error="exit status 1" installID=brw229zm time="2022-04-27T19:42:28Z" level=error msg="error running openshift-install, running deprovision to clean up" error="exit status 1" installID=brw229zm time="2022-04-27T19:42:28Z" level=debug msg="Unable to find log storage actuator. Disabling gathering logs." installID=brw229zm time="2022-04-27T19:42:28Z" level=info msg="saving installer output" installID=brw229zm time="2022-04-27T19:42:28Z" level=debug msg="installer console log: level=info msg=Consuming Install Config from target directory\nlevel=warning msg=Making control-plane schedulable by setting MastersSchedulable to true for Scheduler cluster settings\nlevel=warning msg=Discarding the Openshift Manifests that was provided in the target directory because its dependencies are dirty and it needs to be regenerated\nlevel=info msg=Manifests created in: manifests and openshift\nlevel=warning msg=Found override for release image. Please be warned, this is not advised\nlevel=info msg=Consuming Common Manifests from target directory\nlevel=info msg=Consuming Worker Machines from target directory\nlevel=info msg=Consuming Openshift Manifests from target directory\nlevel=info msg=Consuming OpenShift Install (Manifests) from target directory\nlevel=info msg=Consuming Master Machines from target directory\nlevel=info msg=Ignition-Configs created in: . and auth\nlevel=info msg=Consuming Bootstrap Ignition Config from tar... time="2022-04-27T19:42:28Z" level=error msg="failed due to install error" error="exit status 1" installID=brw229zm time="2022-04-27T19:42:28Z" level=fatal msg="runtime error" error="exit status 1" hive version- time="2022-04-20T17:21:09.64Z" level=info msg="Version: openshift/hive v1.1.16-343-g8871bf4cd" time="2022-04-20T17:21:09.641Z" level=info msg="hive namespace: hive"
@efried Could you help to take a look? or need to involve ocp installer team?
Yeah, this is beyond hive's purview. I agree the installer team would be good to ask next.
Ask installer team to help to take a look: https://coreos.slack.com/archives/CFP6ST0A3/p1651192626753469
This is blocking all QE squads from further test execution FYI @ecai @acanan
We can see that the vmedia is inserted by ironic before powering on the node 2022-04-27 19:12:28.623 1 INFO ironic.drivers.modules.redfish.boot [req-3bb5e4ff-aa78-4d0a-83df-9c7bd9a10a44 bootstrap-user - - - -] Inserted boot media http://[fd00:1101:0:2::2]:80/redfish/boot-b71f8941-5110-4dfc-b3be-47b6748a99cf.iso into VirtualMediaType.CD for node b71f8941-5110-4dfc-b3be-47b6748a99cf Can you watch the master console while its booting to confirm if it is successfully booting or not? If it doesn't boot can to supply any errors you can see on the console If it does boot you should be able to ssh to the VM as either "root" or "core" (depending on the release), if this works you can get more info from the ironic python agent logs.
I don't see anything printed out in the vm console, which I assume that the vm is not booting. virsh dumpxml master-1-0 | grep console -A3 -B7 <serial type='pty'> <source path='/dev/pts/10'/> <target type='isa-serial' port='0'> <model name='isa-serial'/> </target> <alias name='serial0'/> </serial> <console type='pty' tty='/dev/pts/10'> <source path='/dev/pts/10'/> <target type='serial' port='0'/> <alias name='serial0'/> </console> virsh console master-1-0 Connected to domain 'master-1-0' Escape character is ^] (Ctrl + ]) Reattach ironic logs.
@derekh As ACM will RC next week(5/12), and this issue blocked the ACM QE test, So please help to high priority this issue. @thnguyen In order to investigate the issue ASAP, please help to directly chat with ocp installer team in https://coreos.slack.com/archives/CFP6ST0A3/p1651192626753469 or give them your env about this.
Looks like this is something to do with using vmedia vs pxe , as far as I can see when using vmedia the kernel params for IPA end up as BOOT_IMAGE=/images/pxeboot/vmlinuz random.trust_cpu=on coreos.liveiso=rhcos-410.84.202201251210-0 ignition.firstboot ignition.platform.id=metal ip=dhcp6 but when using pxe they end very different, something in there means we end up getting a IPv4 address deploy_kernel selinux=0 troubleshoot=0 text nofb nomodeset vga=normal ipa-insecure=1 sshkey="ssh-rsa AAAA...lc= derekh.kni.lab.eng.bos.redhat.com" ip=dhcp6 coreos.live.rootfs_url=http://[fd00:1101::2]:80/images/ironic-python-agent.rootfs ignition.firstboot ignition.platform.id=metal ipa-debug=1 ipa-inspection-collectors=default,extra-hardware,logs ipa-enable-vlan-interfaces=all ipa-inspection-dhcp-all-interfaces=1 ipa-collect-lldp=1 ipa-inspection-callback-url=http://[fd00:1101::2]:5050/v1/continue ipa-api-url=http://[fd00:1101::2]:6385 ipa-global-request-id=req-f6ccb5af-3861-4b57-9cd0-36958ad4dcc1 BOOTIF=00:7f:da:85:55:8a initrd=deploy_ramdisk Are you able to test using PXE to confirm?
The cluster is indeed deployed successfully using pxe.
Hi @thnguyen Could you help to test disable the provisioning network using vmedia? details see: https://coreos.slack.com/archives/CFP6ST0A3/p1652093718396339?thread_ts=1651192626.753469&cid=CFP6ST0A3 And we already have workaround, So may be we could remove the low priority the issue. And we may need to doc this later.
The openshift install is working properly using vmedia while provisioning network is disabled.
Thanks @thnguyen @ecai So I think we should doc this issue. And I will add an doc for this.
@thnguyen Please also help to create an issue for console to make the provisioning network optional. As discussed in https://coreos.slack.com/archives/CFP6ST0A3/p1652194319407539?thread_ts=1651192626.753469&cid=CFP6ST0A3 FYI @kcormier
G2Bsync 1123150862 comment nelsonjean Wed, 11 May 2022 03:48:29 UTC G2Bsync Looks like the issue will be doc'd. Is this still a Sev-1?
@daliu did you create a separate doc issue for this?
use separate doc issue to trace doc part and close this issue.
doc issue : https://github.com/stolostron/backlog/issues/22488
This issue is closed. Thanks!
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days