1853267 – [IPI Baremetal] OCP 4.6 When creating a new bmh, It's provisioning status is stuck on inspecting

Bug 1853267 - [IPI Baremetal] OCP 4.6 When creating a new bmh, It's provisioning status is stuck on inspecting

Summary: [IPI Baremetal] OCP 4.6 When creating a new bmh, It's provisioning status is ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1901040
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Bob Fournier
QA Contact:	Daniel
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1872659
TreeView+	depends on / blocked

Reported:	2020-07-02 10:04 UTC by Daniel
Modified:	2020-12-15 20:05 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-15 20:05:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Daniel 2020-07-02 10:04:32 UTC

Description of problem:
When trying to create a new bmh object, the newly created bmh provisioning status is stuck on inspecting instead of becoming ready after a few seconds.

Version-Release number of selected component (if applicable):
4.6
image used: registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-06-26-035408

How reproducible:
Constantly

Steps to Reproduce:
1. Make sure you have 1 extra worker vm provisioned
2. Create a new bmh resource ($oc create -f new-bmh.yaml -n openshift-machine-api)
3. Look at the BareMetalHost resources in openshift-machine-api namespace and look at the newly created bmh provisioning status. ($oc get bmh -n openshift-machine-api)

Actual results:
The newly created bmh is stuck on inspecting.

Expected results:
The newly created bmh should become ready after a few seconds.

Comment 1 Steven Hardy 2020-07-27 14:12:47 UTC

Please provide more information, did the new host boot the inspection image?  Is there any error on the primary console for that node?

The most likely reason for this behavior is an environment issue, e.g networking issue so that the new host doesn't pxe boot the inspection ramdisk.

Comment 3 Daniel 2020-08-09 06:35:24 UTC

(In reply to Steven Hardy from comment #1)
> Please provide more information, did the new host boot the inspection image?
> Is there any error on the primary console for that node?
> 
> The most likely reason for this behavior is an environment issue, e.g
> networking issue so that the new host doesn't pxe boot the inspection
> ramdisk.

As soon as i can deploy a 4.6 cluster I'll provide the above information

Comment 4 Daniel 2020-08-17 11:00:23 UTC

(In reply to Steven Hardy from comment #1)
> Please provide more information, did the new host boot the inspection image?
> Is there any error on the primary console for that node?
> 
> The most likely reason for this behavior is an environment issue, e.g
> networking issue so that the new host doesn't pxe boot the inspection
> ramdisk.

No, The new host did not boot the inspection image...

Comment 5 Steven Hardy 2020-08-17 15:29:28 UTC

> No, The new host did not boot the inspection image...

You need to provide more details for this to be an actionable bug report - there are many reasons why the ramdisk could fail to boot, and most are related to the test environment (wrong boot order, network connectivity, etc).

I'm closing this for now, please reopen if/when you have more complete information about the failure.

Comment 6 Daniel 2020-09-01 06:47:37 UTC

Link to must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ1853267_must-gather.tar.gz

Comment 8 Daniel 2020-09-02 10:05:19 UTC

Seems like the host fails to get a DHCP lease from the dnsmasq server

Comment 10 Steven Hardy 2020-09-14 11:52:13 UTC

As discussed we still need more info to make progress on this, it would be helpful to see:

- The primary console output for the BMH which failed to PXE boot into the inspection ramdisk
- The content of the new-bmh.yaml (so we can see the bootMACAddress and correlate it to the dnsmasq logs)
- Related content from the cluster dnsmasq container, so we can see if the request was processed for the failing host or not

In the must-gather we do see e.g

from must-gather.local.9185609889850216903/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4a5a8761f6c3222e28963cef4560441fb637b2c3b554ba9050caf232e2e9df7a/namespaces/openshift-machine-api/pods/metal3-5645b5bf67-4nbzg/metal3-dnsmasq/metal3-dnsmasq/logs/current.log

2020-09-01T06:10:50.275662809Z dnsmasq-dhcp: 2466671 DHCPSOLICIT(enp4s0) 00:04:72:89:c3:f8:0c:d6:31:4b:99:d6:d8:73:77:91:80:f8 
2020-09-01T06:10:50.275662809Z dnsmasq-dhcp: 2466671 DHCPADVERTISE(enp4s0) fd00:1101::83af:576c:2534:27b 00:04:72:89:c3:f8:0c:d6:31:4b:99:d6:d8:73:77:91:80:f8 
2020-09-01T06:10:50.275662809Z dnsmasq-dhcp: 2466671 requested options: 59:bootfile-url, 60:bootfile-param, 23:dns-server, 
2020-09-01T06:10:50.275662809Z dnsmasq-dhcp: 2466671 requested options: 16:vendor-class
2020-09-01T06:10:50.275682023Z dnsmasq-dhcp: 2466671 tags: pxe6, dhcpv6, enp4s0
2020-09-01T06:10:50.275682023Z dnsmasq-dhcp: 2466671 sent size: 18 option:  1 client-id  00:04:72:89:c3:f8:0c:d6:31:4b:99:d6:d8:73...
2020-09-01T06:10:50.275690487Z dnsmasq-dhcp: 2466671 sent size: 14 option:  2 server-id  00:01:00:01:26:df:9b:47:52:54:00:42:39:6b
2020-09-01T06:10:50.275698173Z dnsmasq-dhcp: 2466671 sent size: 40 option:  3 ia-na  IAID=2671116243 T1=1800 T2=3150
2020-09-01T06:10:50.275705583Z dnsmasq-dhcp: 2466671 nest size: 24 option:  5 iaaddr  fd00:1101::83af:576c:2534:27b PL=3600 VL=3600
2020-09-01T06:10:50.275713007Z dnsmasq-dhcp: 2466671 sent size:  9 option: 13 status  0 success
2020-09-01T06:10:50.275720316Z dnsmasq-dhcp: 2466671 sent size:  1 option:  7 preference  0
2020-09-01T06:10:50.275729652Z dnsmasq-dhcp: 2466671 sent size: 33 option: 59 bootfile-url  tftp://[fd00:1101::3]/snponly.efi2020-09-01T06:10:50.275737763Z 

But without the BMH MAC and/or exact times of the failed BMH inspection we can't be sure if this is the failing node or not (following the metal3-dnsmasq container logs while testing would probably be a good idea to watch for the related server activity immediately after adding the BMH and seeing it power on).

Comment 11 Daniel 2020-09-15 11:46:42 UTC

Response to Steven's comment:

- Primary console output: PXE-E18: Server response timeout.
BdsDxe: failed to load Boot0001 "UEFI PXEv4 (MAC:52540031CA77)" from PciRoot(0x0)/Pci(0x1,0x2)/Pci(0x0,0x0)/MAC(52540031CA77,0x1)/IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0): Not Found

- Content of new-bmh.yaml: https://gist.github.com/dmaizel/1e1c8d7b20d52eaa2fab56274f543187

- After running oc logs metal3-c445d4cc7-4spvq -c metal3-dnsmasq | grep 52:54:00:31:ca:77 I can see that the mac address of the new bmh doesn't appear in the logs...

Comment 12 Steven Hardy 2020-09-18 10:40:33 UTC

> After running oc logs metal3-c445d4cc7-4spvq -c metal3-dnsmasq | grep 52:54:00:31:ca:77 I can see that the mac address of the new bmh doesn't appear in the logs...

This probably means it's an environmental issue - if the new host can't reach the DHCP server the PXE boot process will not work.

Could there be some networking issue - e.g does a firewall need to be configured to enable access from the new BMH to access dnsmasq running on the cluster?

Comment 13 Daniel 2020-09-21 12:17:30 UTC

(In reply to Steven Hardy from comment #12)
> > After running oc logs metal3-c445d4cc7-4spvq -c metal3-dnsmasq | grep 52:54:00:31:ca:77 I can see that the mac address of the new bmh doesn't appear in the logs...
> 
> This probably means it's an environmental issue - if the new host can't
> reach the DHCP server the PXE boot process will not work.
> 
> Could there be some networking issue - e.g does a firewall need to be
> configured to enable access from the new BMH to access dnsmasq running on
> the cluster?

I'm pretty sure it isn't a firewall issue, since the bmh is indeed created sometimes.

Comment 15 Lubov 2020-09-23 11:59:03 UTC

We reproduced the same issue on real BM environment

Comment 16 Raviv Bar-Tal 2020-09-24 08:47:25 UTC

Adding the testBlocker keyword

This bug block the scaleup test from runnin,
Also it block some bug verification like this one:
https://bugzilla.redhat.com/show_bug.cgi?id=1863033

Comment 17 Marius Cornea 2020-09-28 09:31:15 UTC

I've hit the same issue on a bare metal environment and the issue is be caused by the provisioning network NIC on the master node where the metal3 pod is running not having an IPv6 link local address:

[root@openshift-master-2 core]# ip a s dev eno5
2: eno5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 48:df:37:c7:76:10 brd ff:ff:ff:ff:ff:ff
    inet6 fd00:1101::3/64 scope global dynamic 
       valid_lft 6sec preferred_lft 6sec

With this we can see in the dnsmasq log that the server sends DHCPADVERTISE but the client doesn't send any DHCPREQUEST:

dnsmasq-dhcp: 5933836 DHCPSOLICIT(eno5) 00:04:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff 
dnsmasq-dhcp: 5933836 DHCPADVERTISE(eno5) fd00:1101::d9c:3529:e9c5:24d 00:04:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff 
dnsmasq-dhcp: 5933836 requested options: 59:bootfile-url, 60:bootfile-param, 23:dns-server, 
dnsmasq-dhcp: 5933836 requested options: 16:vendor-class
dnsmasq-dhcp: 5933836 tags: pxe6, dhcpv6, eno5
dnsmasq-dhcp: 5933836 sent size: 18 option:  1 client-id  00:04:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff...
dnsmasq-dhcp: 5933836 sent size: 14 option:  2 server-id  00:01:00:01:27:04:5a:6e:48:df:37:c7:76:10
dnsmasq-dhcp: 5933836 sent size: 40 option:  3 ia-na  IAID=2079588287 T1=1800 T2=3150
dnsmasq-dhcp: 5933836 nest size: 24 option:  5 iaaddr  fd00:1101::d9c:3529:e9c5:24d PL=3600 VL=3600
dnsmasq-dhcp: 5933836 sent size:  9 option: 13 status  0 success
dnsmasq-dhcp: 5933836 sent size:  1 option:  7 preference  0
dnsmasq-dhcp: 5933836 sent size: 33 option: 59 bootfile-url  tftp://[fd00:1101::3]/snponly.efi
dnsmasq-dhcp: 5933836 available DHCP range: fd00:1101::a -- fd00:1101::ffff:ffff:ffff:fffe
dnsmasq-dhcp: 5933836 vendor class: 343
dnsmasq-dhcp: 5933836 DHCPSOLICIT(eno5) 00:04:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff 
dnsmasq-dhcp: 5933836 DHCPADVERTISE(eno5) fd00:1101::d9c:3529:e9c5:24d 00:04:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff 
dnsmasq-dhcp: 5933836 requested options: 59:bootfile-url, 60:bootfile-param, 23:dns-server, 
dnsmasq-dhcp: 5933836 requested options: 16:vendor-class
dnsmasq-dhcp: 5933836 tags: pxe6, dhcpv6, eno5
dnsmasq-dhcp: 5933836 sent size: 18 option:  1 client-id  00:04:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff...
dnsmasq-dhcp: 5933836 sent size: 14 option:  2 server-id  00:01:00:01:27:04:5a:6e:48:df:37:c7:76:10
dnsmasq-dhcp: 5933836 sent size: 40 option:  3 ia-na  IAID=2079588287 T1=1800 T2=3150
dnsmasq-dhcp: 5933836 nest size: 24 option:  5 iaaddr  fd00:1101::d9c:3529:e9c5:24d PL=3600 VL=3600
dnsmasq-dhcp: 5933836 sent size:  9 option: 13 status  0 success
dnsmasq-dhcp: 5933836 sent size:  1 option:  7 preference  0
dnsmasq-dhcp: 5933836 sent size: 33 option: 59 bootfile-url  tftp://[fd00:1101::3]/snponly.efi
dnsmasq-dhcp: RTR-SOLICIT(eno5) 
dnsmasq-dhcp: RTR-ADVERT(eno5) fd00:1101::
dnsmasq-dhcp: RTR-SOLICIT(eno5) 
dnsmasq-dhcp: RTR-ADVERT(eno5) fd00:1101::
dnsmasq-dhcp: 5933836 available DHCP range: fd00:1101::a -- fd00:1101::ffff:ffff:ffff:fffe
dnsmasq-dhcp: 5933836 vendor class: 343
dnsmasq-dhcp: 5933836 DHCPSOLICIT(eno5) 00:04:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff 
dnsmasq-dhcp: 5933836 DHCPADVERTISE(eno5) fd00:1101::d9c:3529:e9c5:24d 00:04:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff 
dnsmasq-dhcp: 5933836 requested options: 59:bootfile-url, 60:bootfile-param, 23:dns-server, 
dnsmasq-dhcp: 5933836 requested options: 16:vendor-class
dnsmasq-dhcp: 5933836 tags: pxe6, dhcpv6, eno5
dnsmasq-dhcp: 5933836 sent size: 18 option:  1 client-id  00:04:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff...
dnsmasq-dhcp: 5933836 sent size: 14 option:  2 server-id  00:01:00:01:27:04:5a:6e:48:df:37:c7:76:10
dnsmasq-dhcp: 5933836 sent size: 40 option:  3 ia-na  IAID=2079588287 T1=1800 T2=3150
dnsmasq-dhcp: 5933836 nest size: 24 option:  5 iaaddr  fd00:1101::d9c:3529:e9c5:24d PL=3600 VL=3600
dnsmasq-dhcp: 5933836 sent size:  9 option: 13 status  0 success
dnsmasq-dhcp: 5933836 sent size:  1 option:  7 preference  0
dnsmasq-dhcp: 5933836 sent size: 33 option: 59 bootfile-url  tftp://[fd00:1101::3]/snponly.efi
dnsmasq-dhcp: RTR-SOLICIT(eno5) 

Nevertheless if we capture the traffic over eno5 interface with tcpdump we can see that the DHCPADVERTISE packets do not leave the node.

After restoring the link local address on the provisioning NIC on the master node the nodes get DHCP addresses and can PXE boot:

[root@openshift-master-2 core]# sysctl net.ipv6.conf.eno5.addr_gen_mode=0
net.ipv6.conf.eno5.addr_gen_mode = 0

[root@openshift-master-2 core]# sysctl net.ipv6.conf.eno5.addr_gen_mode=1
net.ipv6.conf.eno5.addr_gen_mode = 1

[root@openshift-master-2 core]# ip a s dev eno5
2: eno5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 48:df:37:c7:76:10 brd ff:ff:ff:ff:ff:ff
    inet6 fd00:1101::3/64 scope global dynamic 
       valid_lft 9sec preferred_lft 9sec
    inet6 fe80::4adf:37ff:fec7:7610/64 scope link 
       valid_lft forever preferred_lft forever

Comment 18 Marius Cornea 2020-09-28 12:46:05 UTC

It looks like the link local address gets removed when NetworkManager enables the connection:


Sep 28 12:29:44 openshift-master-1 NetworkManager[2477]: <info>  [1601296184.0467] device (eno5): Activation: starting connection 'eno5' (5e8a265c-0625-4c15-ba9b-95c6e5a3fb5e)
Sep 28 12:29:44 openshift-master-1 NetworkManager[2477]: <info>  [1601296184.0468] device (eno5): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Sep 28 12:29:44 openshift-master-1 NetworkManager[2477]: <info>  [1601296184.0471] device (eno5): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Sep 28 12:29:44 openshift-master-1 NetworkManager[2477]: <info>  [1601296184.0473] device (eno5): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Sep 28 12:29:44 openshift-master-1 NetworkManager[2477]: <info>  [1601296184.0476] device (eno5): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Sep 28 12:29:44 openshift-master-1 NetworkManager[2477]: <info>  [1601296184.0486] device (eno5): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Sep 28 12:29:44 openshift-master-1 NetworkManager[2477]: <info>  [1601296184.0488] device (eno5): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Sep 28 12:29:44 openshift-master-1 NetworkManager[2477]: <info>  [1601296184.0498] device (eno5): Activation: successful, device activated.


Mon Sep 28 12:29:43 UTC 2020
2: eno5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 48:df:37:c7:76:40 brd ff:ff:ff:ff:ff:ff
    inet6 fd00:1101::3/64 scope global dynamic 
       valid_lft 6sec preferred_lft 6sec
    inet6 fe80::4adf:37ff:fec7:7640/64 scope link noprefixroute 

       valid_lft forever preferred_lft forever
Mon Sep 28 12:29:44 UTC 2020
2: eno5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 48:df:37:c7:76:40 brd ff:ff:ff:ff:ff:ff
    inet6 fd00:1101::3/64 scope global tentative dynamic 
       valid_lft 10sec preferred_lft 10sec

Comment 19 Daniel 2020-09-30 06:34:26 UTC

I can confirm Marius's comment, creating bmh works perfectly fine when using ipv4 for provisioning network.

Comment 21 Lubov 2020-11-29 13:32:24 UTC

Is there any progress with this bz? Inspection is still stuck when provision network in ipv6

Comment 22 Bob Fournier 2020-12-15 19:44:46 UTC

This issue with the link-local address removal causing dnsmasq and PXE issues is being tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1901040.  Once the link-local address is removed dnsmasq can no longer respond to multicast packets as Marius found in Comment 17.  We are able to reproduce this issue and Derek has a workaround patch.

I just wanted to confirm that this is occurring with 4.7?  If so we can close this one as a duplicate to 1901040, otherwise we'll use it to track the fix for 4.6.

Comment 23 Marius Cornea 2020-12-15 19:47:31 UTC

(In reply to Bob Fournier from comment #22)
> This issue with the link-local address removal causing dnsmasq and PXE
> issues is being tracked in
> https://bugzilla.redhat.com/show_bug.cgi?id=1901040.  Once the link-local
> address is removed dnsmasq can no longer respond to multicast packets as
> Marius found in Comment 17.  We are able to reproduce this issue and Derek
> has a workaround patch.
> 
> I just wanted to confirm that this is occurring with 4.7?  If so we can
> close this one as a duplicate to 1901040, otherwise we'll use it to track
> the fix for 4.6.

I can confirm it is also occuring with 4.7.

Comment 24 Bob Fournier 2020-12-15 20:05:11 UTC

Thanks Marius. I'm going to close this as a duplicate as the patch and debug info is in 1901040.

*** This bug has been marked as a duplicate of bug 1901040 ***

Note You need to log in before you can comment on or make changes to this bug.