Bug 1984860

Summary: Baremetal IPI is permafailing - workers are failing to PXE
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: Bare Metal Hardware ProvisioningAssignee: Tomas Sedovic <tsedovic>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Amit Ugol <augol>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: unspecified CC: derekh
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-22 11:09:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stephen Benjamin 2021-07-22 11:04:27 UTC
We're seeing increased rates of workers failing to provision, with introspection timing out.  This started with this build https://amd64.ocp.releases.ci.openshift.org/releasestream/4.9.0-0.nightly/release/4.9.0-0.nightly-2021-07-21-130417, and appears to be caused by the provisioning network optional PR - https://github.com/openshift/installer/pull/5015

This is causing nightly builds to be rejected, this needs to be reverted or fixed ASAP.


Example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1417945225736753152

baremetal-operator reports inspection times out[1]:

{"level":"info","ts":1626905438.6808221,"logger":"provisioner.ironic","msg":"current provision state","host":"openshift-machine-api~ostest-worker-0","lastError":"timeout reached while inspecting the node","current":"inspect failed","target":"manageable"}

libvirt serial console shows a PXE timeout[2]:

>>Start PXE over IPv4.
  PXE-E18: Server response timeout.
BdsDxe: failed to load Boot0001 "UEFI PXEv4 (MAC:0098F104545C)" from PciRoot(0x0)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(0098F104545C,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0): Not Found

>>Start PXE over IPv6.
  PXE-E16: No valid offer received.
BdsDxe: failed to load Boot0002 "UEFI PXEv6 (MAC:0098F104545C)" from PciRoot(0x0)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(0098F104545C,0x1)/IPv6(0000:0000:0000:0000:0000:0000:0000:0000,0x0,Static,0000:0000:0000:0000:0000:0000:0000:0000,0x40,0000:0000:0000:0000:0000:0000:0000:0000): Not Found

>>Start HTTP Boot over IPv4.
  Error: Could not retrieve NBP file size from HTTP server.

  Error: Server response timeout.
BdsDxe: failed to load Boot0003 "UEFI HTTPv4 (MAC:0098F104545C)" from PciRoot(0x0)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(0098F104545C,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)/Uri(): Not Found

>>Start HTTP Boot over IPv6.
  Error: Could not retrieve NBP file size from HTTP server.

  Error: Unexpected network error.
BdsDxe: failed to load Boot0004 "UEFI HTTPv6 (MAC:0098F104545C)" from PciRoot(0x0)/Pci(0x2,0x0)/Pci(0x0,0x0)/MAC(0098F104545C,0x1)/IPv6(0000:0000:0000:0000:0000:0000:0000:0000,0x0,Static,0000:0000:0000:0000:0000:0000:0000:0000,0x40,0000:0000:0000:0000:0000:0000:0000:0000)/Uri(): Not Found

>>Start PXE over IPv4.
  PXE-E16: No valid offer received.
BdsDxe: failed to load Boot0005 "UEFI PXEv4 (MAC:0098F104545E)" from PciRoot(0x0)/Pci(0x2,0x1)/Pci(0x0,0x0)/MAC(0098F104545E,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0): Not Found

>>Start PXE over IPv6.
  PXE-E16: No valid offer received.
BdsDxe: failed to load Boot0006 "UEFI PXEv6 (MAC:0098F104545E)" from PciRoot(0x0)/Pci(0x2,0x1)/Pci(0x0,0x0)/MAC(0098F104545E,0x1)/IPv6(0000:0000:0000:0000:0000:0000:0000:0000,0x0,Static,0000:0000:0000:0000:0000:0000:0000:0000,0x40,0000:0000:0000:0000:0000:0000:0000:0000): Not Found

>>Start HTTP Boot over IPv4.....
  Error: Could not retrieve NBP file size from HTTP server.

  Error: Server response timeout.
BdsDxe: failed to load Boot0007 "UEFI HTTPv4 (MAC:0098F104545E)" from PciRoot(0x0)/Pci(0x2,0x1)/Pci(0x0,0x0)/MAC(0098F104545E,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)/Uri(): Not Found

>>Start HTTP Boot over IPv6.
  Error: Could not retrieve NBP file size from HTTP server.

  Error: Unexpected network error.
BdsDxe: failed to load Boot0008 "UEFI HTTPv6 (MAC:0098F104545E)" from PciRoot(0x0)/Pci(0x2,0x1)/Pci(0x0,0x0)/MAC(0098F104545E,0x1)/IPv6(0000:0000:0000:0000:0000:0000:0000:0000,0x0,Static,0000:0000:0000:0000:0000:0000:0000:0000,0x40,0000:0000:0000:0000:0000:0000:0000:0000)/Uri(): Not Found
BdsDxe: No bootable option or device was found.
BdsDxe: Press any key to enter the Boot Manager Menu.





[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1417945225736753152/artifacts/e2e-metal-ipi/gather-extra/artifacts/pods/openshift-machine-api_metal3-5679654987-4gw68_metal3-baremetal-operator.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1417945225736753152/artifacts/e2e-metal-ipi/baremetalds-devscripts-gather/artifacts/

Comment 1 Derek Higgins 2021-07-22 11:09:52 UTC
Looks like a dup of bz#1984576

*** This bug has been marked as a duplicate of bug 1984576 ***