Bug 1822805 - Node sporadically gets stuck during PXE boot at iPXE initialising devices..(when using IPv6 provisioning network)
Summary: Node sporadically gets stuck during PXE boot at iPXE initialising devices..(w...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Julia Kreger
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-09 21:51 UTC by Marius Cornea
Modified: 2020-05-08 14:39 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-08 14:39:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pxe_stuck.png (36.63 KB, image/png)
2020-04-09 21:51 UTC, Marius Cornea
no flags Details
packet capture (48.45 KB, text/plain)
2020-04-16 21:27 UTC, Marius Cornea
no flags Details
console screenshot (37.08 KB, image/png)
2020-04-16 21:27 UTC, Marius Cornea
no flags Details

Description Marius Cornea 2020-04-09 21:51:44 UTC
Created attachment 1677660 [details]
pxe_stuck.png

Description of problem:

Node sporadically gets stuck during PXE boot at iPXE initialising devices..(when using IPv6 provisioning network). This can happen to master or worker nodes but the probability to show up on a worker node is higher since they PXE boot twice(for introspection and provisioning).

The servers are ProLiant DL380 Gen10 machines and the PXE boot interface is a 10G port of an add-on HPE Eth 10Gb 4p 563SFP+ Adptr card.

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-04-04-025830

How reproducible:
Not always but pretty consistently - estimate 1/4 times

Steps to Reproduce:
1. Deploy IPI bare metal setup with 3 x masters + 2 worker nodes using ProLiant DL380 Gen10 machines and HPE Eth 10Gb 4p 563SFP+ Adptr card for PXE boot

Actual results:

Sporadically nodes get stuck during PXE boot at iPXE initialising devices.. which breaks the PXE boot process and the node end up booting from the drive. As a result, if the node which experiences this issue is a master node then deployment fails due to timeout. If the node which shows the issue is a worker node then it never gets deployed and the bmh resource remains stuck on inspecting:

(openstack-cli) [kni@ocp-edge06 ~]$ oc -n openshift-machine-api get bmh | grep inspecting
openshift-worker-0   OK       inspecting               ocp-edge-worker-0-v97bn   redfish://10.46.2.223/redfish/v1/Systems/1                      true     


Expected results:
No issue during PXE boot.

Additional info:

These are the following actions that I've taken in order to workaround this issue but the issue is still present:

 - updated HPE Eth 10Gb 4p 563SFP+ nics firmware using Non-Volatile Memory (NVM) Update Utility for Intel® Ethernet Network Adapter 700 Series; current nic firmware version is 1.2585.0 so it's up to date
 - configure pre-boot network interface to 1st port of HPE Eth 10Gb 4p 563SFP+ Adptr
 - configure pre-boot network mode to ipv6
 - configure manual ipv6 address in pre-boot network

Attaching a screenshot of the console showing this issue.

Comment 2 Derek Higgins 2020-04-16 14:54:17 UTC
Would it be possible to get a dump of network traffic (specifically DHCPv6 ports 546/547, also http if possible but not as important).
I'd like to see what dnsmasq is communicating with the node that gets stuck.

Comment 3 Marius Cornea 2020-04-16 21:27:07 UTC
(In reply to Derek Higgins from comment #2)
> Would it be possible to get a dump of network traffic (specifically DHCPv6
> ports 546/547, also http if possible but not as important).
> I'd like to see what dnsmasq is communicating with the node that gets stuck.

Attaching the packet capture and console screenshot. The issue occurs for node with mac address 48:df:37:c7:f7:b0

Packet capture was done with following filters:
tcpdump -i ens1f0 -n -vv '(udp port 546 or 547) or icmp6'

Comment 4 Marius Cornea 2020-04-16 21:27:29 UTC
Created attachment 1679493 [details]
packet capture

Comment 5 Marius Cornea 2020-04-16 21:27:51 UTC
Created attachment 1679494 [details]
console screenshot

Comment 6 Marius Cornea 2020-04-16 21:30:21 UTC
As a side note: I tried switching to using the onboard 1G NIC for provisioning and the issue reproduced as well.

Comment 7 Derek Higgins 2020-04-20 14:25:05 UTC
(In reply to Marius Cornea from comment #3)
> Packet capture was done with following filters:
> tcpdump -i ens1f0 -n -vv '(udp port 546 or 547) or icmp6'
thanks,

this looks similar to a problem we saw at one stage during development but no longer occurred
after we changed our UEFI setup,

dnsmasq was sending truncated to the dhcp client[1]

I believe these two dhcpv6 replies are truncated and cause the ipxe process to stall
21:17:07.537946 IP6 (class 0xc0, flowlabel 0x558c6, hlim 64, next-header UDP (17) payload length: 72) fe80::9792:82b3:6153:4ece.dhcpv6-server > fe80::4adf:37ff:fec7:f7b0.dhcpv6-client: [bad udp cksum 0x411a -> 0x11fe!] dhcp6 msgtype-134 (xid=0 (opt_16576) (opt_0) (opt_0)[|dhcp6ext])
21:17:15.719928 IP6 (class 0xc0, flowlabel 0x9510f, hlim 64, next-header UDP (17) payload length: 72) fe80::9792:82b3:6153:4ece.dhcpv6-server > fe80::4adf:37ff:feb0:7930.dhcpv6-client: [bad udp cksum 0xc282 -> 0x9095!] dhcp6 msgtype-134 (xid=0 (opt_16576) (opt_0) (opt_0)[|dhcp6ext])

A fix was eventually checked into dnsmasq[2], would it be difficult to try a recent version of dnsmasq (in the ironic image) to confirm if it fixes the problem?

1. http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2019q4/013554.html
2. http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2019q4/013649.html

Comment 8 Marius Cornea 2020-04-20 15:28:45 UTC
(In reply to Derek Higgins from comment #7)
> (In reply to Marius Cornea from comment #3)
> > Packet capture was done with following filters:
> > tcpdump -i ens1f0 -n -vv '(udp port 546 or 547) or icmp6'
> thanks,
> 
> this looks similar to a problem we saw at one stage during development but
> no longer occurred
> after we changed our UEFI setup,
> 
> dnsmasq was sending truncated to the dhcp client[1]
> 
> I believe these two dhcpv6 replies are truncated and cause the ipxe process
> to stall
> 21:17:07.537946 IP6 (class 0xc0, flowlabel 0x558c6, hlim 64, next-header UDP
> (17) payload length: 72) fe80::9792:82b3:6153:4ece.dhcpv6-server >
> fe80::4adf:37ff:fec7:f7b0.dhcpv6-client: [bad udp cksum 0x411a -> 0x11fe!]
> dhcp6 msgtype-134 (xid=0 (opt_16576) (opt_0) (opt_0)[|dhcp6ext])
> 21:17:15.719928 IP6 (class 0xc0, flowlabel 0x9510f, hlim 64, next-header UDP
> (17) payload length: 72) fe80::9792:82b3:6153:4ece.dhcpv6-server >
> fe80::4adf:37ff:feb0:7930.dhcpv6-client: [bad udp cksum 0xc282 -> 0x9095!]
> dhcp6 msgtype-134 (xid=0 (opt_16576) (opt_0) (opt_0)[|dhcp6ext])
> 
> A fix was eventually checked into dnsmasq[2], would it be difficult to try a
> recent version of dnsmasq (in the ironic image) to confirm if it fixes the
> problem?
> 
> 1.
> http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2019q4/013554.html
> 2.
> http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2019q4/013649.html

I can test this in the environment but I'm not sure how I could pull an ironic image different than what's in the release payload. Could you help me with the steps for testing this?

Comment 9 Beth White 2020-04-20 15:48:56 UTC
Moved against new component hardware provisioning: ironic due to nature of the issue

Comment 10 Derek Higgins 2020-04-21 12:48:44 UTC
(In reply to Marius Cornea from comment #8)
> (In reply to Derek Higgins from comment #7)
> > (In reply to Marius Cornea from comment #3)
> > > Packet capture was done with following filters:
> > > tcpdump -i ens1f0 -n -vv '(udp port 546 or 547) or icmp6'
> > thanks,
> > 
> > this looks similar to a problem we saw at one stage during development but
> > no longer occurred
> > after we changed our UEFI setup,
> > 
> > dnsmasq was sending truncated to the dhcp client[1]
> > 
> > I believe these two dhcpv6 replies are truncated and cause the ipxe process
> > to stall
> > 21:17:07.537946 IP6 (class 0xc0, flowlabel 0x558c6, hlim 64, next-header UDP
> > (17) payload length: 72) fe80::9792:82b3:6153:4ece.dhcpv6-server >
> > fe80::4adf:37ff:fec7:f7b0.dhcpv6-client: [bad udp cksum 0x411a -> 0x11fe!]
> > dhcp6 msgtype-134 (xid=0 (opt_16576) (opt_0) (opt_0)[|dhcp6ext])
> > 21:17:15.719928 IP6 (class 0xc0, flowlabel 0x9510f, hlim 64, next-header UDP
> > (17) payload length: 72) fe80::9792:82b3:6153:4ece.dhcpv6-server >
> > fe80::4adf:37ff:feb0:7930.dhcpv6-client: [bad udp cksum 0xc282 -> 0x9095!]
> > dhcp6 msgtype-134 (xid=0 (opt_16576) (opt_0) (opt_0)[|dhcp6ext])
> > 
> > A fix was eventually checked into dnsmasq[2], would it be difficult to try a
> > recent version of dnsmasq (in the ironic image) to confirm if it fixes the
> > problem?
> > 
> > 1.
> > http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2019q4/013554.html
> > 2.
> > http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2019q4/013649.html
> 
> I can test this in the environment but I'm not sure how I could pull an
> ironic image different than what's in the release payload. Could you help me
> with the steps for testing this?

I'm not sure how best to pull a custom ironic image into your setup, but If you can 
set up an environment where this reproduces I can take a look and see if I can figure
it out.

Comment 13 Marius Cornea 2020-05-08 14:39:48 UTC
Based on the latest runs the issue doesn't reproduce anymore(latest tested build is 4.4.0-0.nightly-2020-05-01-231319).


Note You need to log in before you can comment on or make changes to this bug.