Bug 2237855

Summary: UEFI (edk2/ovmf) network boot with OVN fail because no DHCP release reply
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Harald Jensås <hjensas>
Component: ovn23.12Assignee: Ales Musil <amusil>
Status: CLOSED DUPLICATE QA Contact: ying xu <yinxu>
Severity: high Docs Contact:
Priority: high    
Version: RHEL 9.0CC: amusil, ctrautma, jiji, jishi, mmichels
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-10-13 19:11:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Traffic capture from working env with dnsmasq and failing env with OVN. none

Description Harald Jensås 2023-09-07 10:53:54 UTC
Created attachment 1987514 [details]
Traffic capture from working env with dnsmasq and failing env with OVN.

Description of problem:
When attempting to verify neutron change[1], we discovered that despite options in DHCPv6 ADV and REQ/REPLY are correct network booting still fails.

When comparing traffic capture between openvswitch+neutron-dhcp-agent setup to the ovn setup a significant difference is that:
* neutron-dhcp-ageent(dnsmasq) does REPLY to RELEASE with a packet including a dhcpv6 option type Status code (13) success to confirm the release. edk2/ovmf does TFTP transfer of the NBP immediately after recieving this reply.
* OVN does not respond with a REPLY to the clients RELEASE. In traffic capture we can see the client repeates the RELEASE several times, but finally give up and raise an error:

>>Start PXE over IPv6..
  Station IP address is FC01:0:0:0:0:0:0:206
  Server IP address is FC00:0:0:0:0:0:0:1
  NBP filename is snponly.efi
  NBP filesize is 0 Bytes
  PXE-E53: No boot filename received.

--------------------------------------------------
FAILING - sequence on OVN
--------------------------------------------------
No. Time Source Destination Protocol Length Info
1 0.000000 fe80::f816:3eff:fe6f:a0ab :: ICMPv6 118 Router Advertisement from fa:16:3e:6f:a0:ab
2 51.931422 fe80::5054:ff:feb1:a5b0 ff02::1:2 DHCPv6 177 Solicit XID: 0x4f04ed CID: 000430a25dc55972534aa516ff9c9f7c7ac4
3 51.931840 fe80::f816:3eff:feeb:b176 fe80::5054:ff:feb1:a5b0 DHCPv6 198 Advertise XID: 0x4f04ed CID: 000430a25dc55972534aa516ff9c9f7c7ac4 IAA: fc01::2ad
4 56.900421 fe80::5054:ff:feb1:a5b0 ff02::1:2 DHCPv6 219 Request XID: 0x5004ed CID: 000430a25dc55972534aa516ff9c9f7c7ac4 IAA: fc01::2ad
5 56.900726 fe80::f816:3eff:feeb:b176 fe80::5054:ff:feb1:a5b0 DHCPv6 198 Reply XID: 0x5004ed CID: 000430a25dc55972534aa516ff9c9f7c7ac4 IAA: fc01::2ad
6 68.861979 fe80::5054:ff:feb1:a5b0 ff02::1:2 DHCPv6 152 Release XID: 0x5104ed CID: 000430a25dc55972534aa516ff9c9f7c7ac4 IAA: fc01::2ad
7 69.900715 fe80::5054:ff:feb1:a5b0 ff02::1:2 DHCPv6 152 Release XID: 0x5104ed CID: 000430a25dc55972534aa516ff9c9f7c7ac4 IAA: fc01::2ad
8 72.900784 fe80::5054:ff:feb1:a5b0 ff02::1:2 DHCPv6 152 Release XID: 0x5104ed CID: 000430a25dc55972534aa516ff9c9f7c7ac4 IAA: fc01::2ad
9 77.900774 fe80::5054:ff:feb1:a5b0 ff02::1:2 DHCPv6 152 Release XID: 0x5104ed CID: 000430a25dc55972534aa516ff9c9f7c7ac4 IAA: fc01::2ad
10 86.900759 fe80::5054:ff:feb1:a5b0 ff02::1:2 DHCPv6 152 Release XID: 0x5104ed CID: 000430a25dc55972534aa516ff9c9f7c7ac4 IAA: fc01::2ad
11 103.900786 fe80::5054:ff:feb1:a5b0 ff02::1:2 DHCPv6 152 Release XID: 0x5104ed CID: 000430a25dc55972534aa516ff9c9f7c7ac4 IAA: fc01::2ad

--------------------------------------------------
WORKING - sequence on neutron-dhcp-agent (dnsmasq)
--------------------------------------------------
No. Time Source Destination Protocol Length Info
1 0.000000 fe80::f816:3eff:fe38:eef0 ff02::1 ICMPv6 142 Router Advertisement from fa:16:3e:38:ee:f0
2 0.001102 fe80::5054:ff:fed9:3d5c ff02::1:2 DHCPv6 116 Solicit XID: 0x71d892 CID: 0004c9b0caa37bce994e85633d7572708047
3 0.001245 fe80::f816:3eff:fef5:ef7a fe80::5054:ff:fed9:3d5c DHCPv6 208 Advertise XID: 0x71d892 CID: 0004c9b0caa37bce994e85633d7572708047 IAA: fc01::87
4 0.002436 fe80::5054:ff:fed9:3d5c ff02::1:2 DHCPv6 162 Request XID: 0x72d892 CID: 0004c9b0caa37bce994e85633d7572708047 IAA: fc01::87
5 0.002508 fe80::f816:3eff:fef5:ef7a fe80::5054:ff:fed9:3d5c DHCPv6 219 Reply XID: 0x72d892 CID: 0004c9b0caa37bce994e85633d7572708047 IAA: fc01::87
6 3.130605 fe80::5054:ff:fed9:3d5c ff02::1:2 DHCPv6 223 Request XID: 0x73d892 CID: 0004c9b0caa37bce994e85633d7572708047 IAA: fc01::87
7 3.130791 fe80::f816:3eff:fef5:ef7a fe80::5054:ff:fed9:3d5c DHCPv6 256 Reply XID: 0x73d892 CID: 0004c9b0caa37bce994e85633d7572708047 IAA: fc01::2a0
8 3.132060 fe80::5054:ff:fed9:3d5c ff02::1:2 DHCPv6 156 Release XID: 0x74d892 CID: 0004c9b0caa37bce994e85633d7572708047 IAA: fc01::87
9 3.132126 fe80::f816:3eff:fef5:ef7a fe80::5054:ff:fed9:3d5c DHCPv6 128 Reply XID: 0x74d892 CID: 0004c9b0caa37bce994e85633d7572708047
10 5.477847 fc01::2a0 fc00::1 TFTP 116 Read Request, File: snponly.efi, Transfer type: octet, tsize=0, blksize=1228, windowsize=4
--------------------------------------------------

Conclusion is that OVN DHCPv6 implementation need to be fixed, it should reply when the client send a dhcpv6 release.

Attached file contain traffic capture files from both the working (dnsmasq dhcp) set-up and the failing (OVN dhcp) set-up.

[1] https://review.opendev.org/c/openstack/neutron/+/890683

Version-Release number of selected component (if applicable):
Testing was performed on CentOS Stream 9 with OpenStack devstack. OVN and OVS was compiled from source:

OVS_BRANCH="v3.1.1"
OVN_BRANCH="v23.06.0"

How reproducible:
100%

Steps to Reproduce:
The issue can be reproduced in OpenStack DevStack environment.
Access to reproducer can be provided, or instructions in a follow on comment if required.

Actual results:
UEFI network boot fails:

>>Start PXE over IPv6..
  Station IP address is FC01:0:0:0:0:0:0:206
  Server IP address is FC00:0:0:0:0:0:0:1
  NBP filename is snponly.efi
  NBP filesize is 0 Bytes
  PXE-E53: No boot filename received.

Firmware attempts to do a DHCPv6 RELEASE and does not receive a REPLY from the DHCPv6 implementation in OVN. After repeated attempts UEFI firmware gives up and fails the network boot process.

Expected results:
The OVN DHCPv6 implementation should REPLY to RELEASE with Status Option success code.

Additional info:
Upstream bug: https://bugs.launchpad.net/ironic/+bug/2034684

Comment 2 Harald Jensås 2023-09-12 10:07:58 UTC
(In reply to Ales Musil from comment #1)
> Patch posted:
> https://patchwork.ozlabs.org/project/ovn/patch/20230908121149.341504-1-
> amusil/

Thanks Ales!

I tested this change - and I can confirm that the REPLY to RELEASE is delivered to the client.

Frame 28: 108 bytes on wire (864 bits), 108 bytes captured (864 bits)
Ethernet II, Src: fa:16:3e:92:9f:6e (fa:16:3e:92:9f:6e), Dst: RealtekU_5a:c7:72 (52:54:00:5a:c7:72)
Internet Protocol Version 6, Src: fe80::f816:3eff:fe92:9f6e, Dst: fe80::5054:ff:fe5a:c772
User Datagram Protocol, Src Port: 547, Dst Port: 546
DHCPv6
    Message type: Reply (7)
    Transaction ID: 0x0f0306
    Client Identifier
        Option: Client Identifier (1)
        Length: 18
        DUID: 000415d8ce964cb5694f8af6ac952a751b26
        DUID Type: Universally Unique IDentifier (UUID) (4)
        UUID: 15d8ce964cb5694f8af6ac952a751b26
    Server Identifier
        Option: Server Identifier (2)
        Length: 10
        DUID: 00030001fa163e929f6e
        DUID Type: link-layer address (3)
        Hardware type: Ethernet (1)
        Link-layer address: fa:16:3e:92:9f:6e
    Status code
        Option: Status code (13)
        Length: 2
        Status Code: Success (0)


Unfortunately the network boot process still fail's, but the release retries are gone.
I will continue digging and open new bugs if I find other issues related to OVN DHCPv6/RA's.

Comment 3 OVN Bot 2023-09-15 04:06:49 UTC
ovn23.09 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2239061

Comment 4 Mark Michelson 2023-10-13 19:11:22 UTC
I'm closing this as a duplicate of 2230961. It may seem weird to close the original issue as a duplicate of its clone. The reason is that this issue was opened against ovn23.12. That bugzilla component was created at the beginning of this year before we switched our release schedule to release every six months. There will be no ovn23.12 release. Therefore, this issue needs to be closed since otherwise it will never get resolved via an errata. Closing it as a duplicate seems like the most correct resolution.

*** This bug has been marked as a duplicate of bug 2230961 ***

Comment 5 Mark Michelson 2023-10-13 19:12:26 UTC
Sorry, I mistyped the bug number in my previous comment and the resolution. I have updated it to 2239061.

*** This bug has been marked as a duplicate of bug 2239061 ***