Created attachment 1677660 [details] pxe_stuck.png Description of problem: Node sporadically gets stuck during PXE boot at iPXE initialising devices..(when using IPv6 provisioning network). This can happen to master or worker nodes but the probability to show up on a worker node is higher since they PXE boot twice(for introspection and provisioning). The servers are ProLiant DL380 Gen10 machines and the PXE boot interface is a 10G port of an add-on HPE Eth 10Gb 4p 563SFP+ Adptr card. Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-04-04-025830 How reproducible: Not always but pretty consistently - estimate 1/4 times Steps to Reproduce: 1. Deploy IPI bare metal setup with 3 x masters + 2 worker nodes using ProLiant DL380 Gen10 machines and HPE Eth 10Gb 4p 563SFP+ Adptr card for PXE boot Actual results: Sporadically nodes get stuck during PXE boot at iPXE initialising devices.. which breaks the PXE boot process and the node end up booting from the drive. As a result, if the node which experiences this issue is a master node then deployment fails due to timeout. If the node which shows the issue is a worker node then it never gets deployed and the bmh resource remains stuck on inspecting: (openstack-cli) [kni@ocp-edge06 ~]$ oc -n openshift-machine-api get bmh | grep inspecting openshift-worker-0 OK inspecting ocp-edge-worker-0-v97bn redfish://10.46.2.223/redfish/v1/Systems/1 true Expected results: No issue during PXE boot. Additional info: These are the following actions that I've taken in order to workaround this issue but the issue is still present: - updated HPE Eth 10Gb 4p 563SFP+ nics firmware using Non-Volatile Memory (NVM) Update Utility for Intel® Ethernet Network Adapter 700 Series; current nic firmware version is 1.2585.0 so it's up to date - configure pre-boot network interface to 1st port of HPE Eth 10Gb 4p 563SFP+ Adptr - configure pre-boot network mode to ipv6 - configure manual ipv6 address in pre-boot network Attaching a screenshot of the console showing this issue.
Would it be possible to get a dump of network traffic (specifically DHCPv6 ports 546/547, also http if possible but not as important). I'd like to see what dnsmasq is communicating with the node that gets stuck.
(In reply to Derek Higgins from comment #2) > Would it be possible to get a dump of network traffic (specifically DHCPv6 > ports 546/547, also http if possible but not as important). > I'd like to see what dnsmasq is communicating with the node that gets stuck. Attaching the packet capture and console screenshot. The issue occurs for node with mac address 48:df:37:c7:f7:b0 Packet capture was done with following filters: tcpdump -i ens1f0 -n -vv '(udp port 546 or 547) or icmp6'
Created attachment 1679493 [details] packet capture
Created attachment 1679494 [details] console screenshot
As a side note: I tried switching to using the onboard 1G NIC for provisioning and the issue reproduced as well.
(In reply to Marius Cornea from comment #3) > Packet capture was done with following filters: > tcpdump -i ens1f0 -n -vv '(udp port 546 or 547) or icmp6' thanks, this looks similar to a problem we saw at one stage during development but no longer occurred after we changed our UEFI setup, dnsmasq was sending truncated to the dhcp client[1] I believe these two dhcpv6 replies are truncated and cause the ipxe process to stall 21:17:07.537946 IP6 (class 0xc0, flowlabel 0x558c6, hlim 64, next-header UDP (17) payload length: 72) fe80::9792:82b3:6153:4ece.dhcpv6-server > fe80::4adf:37ff:fec7:f7b0.dhcpv6-client: [bad udp cksum 0x411a -> 0x11fe!] dhcp6 msgtype-134 (xid=0 (opt_16576) (opt_0) (opt_0)[|dhcp6ext]) 21:17:15.719928 IP6 (class 0xc0, flowlabel 0x9510f, hlim 64, next-header UDP (17) payload length: 72) fe80::9792:82b3:6153:4ece.dhcpv6-server > fe80::4adf:37ff:feb0:7930.dhcpv6-client: [bad udp cksum 0xc282 -> 0x9095!] dhcp6 msgtype-134 (xid=0 (opt_16576) (opt_0) (opt_0)[|dhcp6ext]) A fix was eventually checked into dnsmasq[2], would it be difficult to try a recent version of dnsmasq (in the ironic image) to confirm if it fixes the problem? 1. http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2019q4/013554.html 2. http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2019q4/013649.html
(In reply to Derek Higgins from comment #7) > (In reply to Marius Cornea from comment #3) > > Packet capture was done with following filters: > > tcpdump -i ens1f0 -n -vv '(udp port 546 or 547) or icmp6' > thanks, > > this looks similar to a problem we saw at one stage during development but > no longer occurred > after we changed our UEFI setup, > > dnsmasq was sending truncated to the dhcp client[1] > > I believe these two dhcpv6 replies are truncated and cause the ipxe process > to stall > 21:17:07.537946 IP6 (class 0xc0, flowlabel 0x558c6, hlim 64, next-header UDP > (17) payload length: 72) fe80::9792:82b3:6153:4ece.dhcpv6-server > > fe80::4adf:37ff:fec7:f7b0.dhcpv6-client: [bad udp cksum 0x411a -> 0x11fe!] > dhcp6 msgtype-134 (xid=0 (opt_16576) (opt_0) (opt_0)[|dhcp6ext]) > 21:17:15.719928 IP6 (class 0xc0, flowlabel 0x9510f, hlim 64, next-header UDP > (17) payload length: 72) fe80::9792:82b3:6153:4ece.dhcpv6-server > > fe80::4adf:37ff:feb0:7930.dhcpv6-client: [bad udp cksum 0xc282 -> 0x9095!] > dhcp6 msgtype-134 (xid=0 (opt_16576) (opt_0) (opt_0)[|dhcp6ext]) > > A fix was eventually checked into dnsmasq[2], would it be difficult to try a > recent version of dnsmasq (in the ironic image) to confirm if it fixes the > problem? > > 1. > http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2019q4/013554.html > 2. > http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2019q4/013649.html I can test this in the environment but I'm not sure how I could pull an ironic image different than what's in the release payload. Could you help me with the steps for testing this?
Moved against new component hardware provisioning: ironic due to nature of the issue
(In reply to Marius Cornea from comment #8) > (In reply to Derek Higgins from comment #7) > > (In reply to Marius Cornea from comment #3) > > > Packet capture was done with following filters: > > > tcpdump -i ens1f0 -n -vv '(udp port 546 or 547) or icmp6' > > thanks, > > > > this looks similar to a problem we saw at one stage during development but > > no longer occurred > > after we changed our UEFI setup, > > > > dnsmasq was sending truncated to the dhcp client[1] > > > > I believe these two dhcpv6 replies are truncated and cause the ipxe process > > to stall > > 21:17:07.537946 IP6 (class 0xc0, flowlabel 0x558c6, hlim 64, next-header UDP > > (17) payload length: 72) fe80::9792:82b3:6153:4ece.dhcpv6-server > > > fe80::4adf:37ff:fec7:f7b0.dhcpv6-client: [bad udp cksum 0x411a -> 0x11fe!] > > dhcp6 msgtype-134 (xid=0 (opt_16576) (opt_0) (opt_0)[|dhcp6ext]) > > 21:17:15.719928 IP6 (class 0xc0, flowlabel 0x9510f, hlim 64, next-header UDP > > (17) payload length: 72) fe80::9792:82b3:6153:4ece.dhcpv6-server > > > fe80::4adf:37ff:feb0:7930.dhcpv6-client: [bad udp cksum 0xc282 -> 0x9095!] > > dhcp6 msgtype-134 (xid=0 (opt_16576) (opt_0) (opt_0)[|dhcp6ext]) > > > > A fix was eventually checked into dnsmasq[2], would it be difficult to try a > > recent version of dnsmasq (in the ironic image) to confirm if it fixes the > > problem? > > > > 1. > > http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2019q4/013554.html > > 2. > > http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2019q4/013649.html > > I can test this in the environment but I'm not sure how I could pull an > ironic image different than what's in the release payload. Could you help me > with the steps for testing this? I'm not sure how best to pull a custom ironic image into your setup, but If you can set up an environment where this reproduces I can take a look and see if I can figure it out.
Based on the latest runs the issue doesn't reproduce anymore(latest tested build is 4.4.0-0.nightly-2020-05-01-231319).