Bug 2102258

Summary: IPA Ramdisk DHCP client looses connectivity [upstream]
Product: Red Hat OpenStack Reporter: Julia Kreger <jkreger>
Component: diskimage-builderAssignee: Julia Kreger <jkreger>
Status: CLOSED ERRATA QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 17.0 (Wallaby)CC: apevec, jparoly, jschluet, pweeks, sbaker
Target Milestone: betaKeywords: Triaged
Target Release: 17.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: diskimage-builder-3.22.1-0.20220701120834.527e75a.el9ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-21 12:23:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Julia Kreger 2022-06-29 14:23:38 UTC
Description of problem:

This is a preemptive bug for an upstream issue we've just recently started to see in Centos 9-stream, which may end up impacting RHEL9 based builds at some point in the near future given the cycle of package updates.

It appears that an issue is has appeared in NetworkManager where the dhcp client in the ramdisk, specifically dhclient launched by NetworkManager, is not settling, and retries every sixty seconds while the ramdisk runs. Eventually, the dnsmasq dhcp server says "You already have a lease, I'm going to ignore you until your lease runs down." Unfortunately, ramdisk connectivity eventually breaks.

On a positive side, we see this in CI which is generally slower than real hardware, but my worry right now is this suddenly breaks us.

The path we've determined upstream is to set the dhcp client setting for NetworkManager to "internal", which we've been unable to reproduce this issue as of yet using that as the client default. Unfortunately, this requires a diskimage-builder change to take effect.

Upstream we're able to reproduce this with any ramdisk that is online for greater than about 250 seconds. After which the network configuration disappears from within the ramdisk, and we're no longer able to ping the ramdisk which continues to run.

Steps to Reproduce:

The only way to reproduce this is to have an artificially long deploy, which might mean a BMaaS deployment with a larger than normal image. That being said, if this appears in the product, we should likely just rely upon a code check if we've not figured out a better way to test this, given we basically race physical system performance versus a clock.

Actual results:

Deployments with a full ramdisk fail upstream.


Expected results:

Deployments succeed.

Additional info:

https://storyboard.openstack.org/#!/story/2010109

Comment 7 errata-xmlrpc 2022-09-21 12:23:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543