Bug 2102258 - IPA Ramdisk DHCP client looses connectivity [upstream]
Summary: IPA Ramdisk DHCP client looses connectivity [upstream]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: diskimage-builder
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: beta
: 17.0
Assignee: Julia Kreger
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-29 14:23 UTC by Julia Kreger
Modified: 2022-09-21 12:23 UTC (History)
5 users (show)

Fixed In Version: diskimage-builder-3.22.1-0.20220701120834.527e75a.el9ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-21 12:23:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack Storyboard 2010109 0 None None None 2022-06-29 14:23:38 UTC
OpenStack gerrit 848017 0 None NEW Use internal dhcp client for centos 9-stream and beyond 2022-06-29 14:23:38 UTC
Red Hat Issue Tracker OSP-16151 0 None None None 2022-06-29 14:32:33 UTC
Red Hat Product Errata RHEA-2022:6543 0 None None None 2022-09-21 12:23:29 UTC

Description Julia Kreger 2022-06-29 14:23:38 UTC
Description of problem:

This is a preemptive bug for an upstream issue we've just recently started to see in Centos 9-stream, which may end up impacting RHEL9 based builds at some point in the near future given the cycle of package updates.

It appears that an issue is has appeared in NetworkManager where the dhcp client in the ramdisk, specifically dhclient launched by NetworkManager, is not settling, and retries every sixty seconds while the ramdisk runs. Eventually, the dnsmasq dhcp server says "You already have a lease, I'm going to ignore you until your lease runs down." Unfortunately, ramdisk connectivity eventually breaks.

On a positive side, we see this in CI which is generally slower than real hardware, but my worry right now is this suddenly breaks us.

The path we've determined upstream is to set the dhcp client setting for NetworkManager to "internal", which we've been unable to reproduce this issue as of yet using that as the client default. Unfortunately, this requires a diskimage-builder change to take effect.

Upstream we're able to reproduce this with any ramdisk that is online for greater than about 250 seconds. After which the network configuration disappears from within the ramdisk, and we're no longer able to ping the ramdisk which continues to run.

Steps to Reproduce:

The only way to reproduce this is to have an artificially long deploy, which might mean a BMaaS deployment with a larger than normal image. That being said, if this appears in the product, we should likely just rely upon a code check if we've not figured out a better way to test this, given we basically race physical system performance versus a clock.

Actual results:

Deployments with a full ramdisk fail upstream.


Expected results:

Deployments succeed.

Additional info:

https://storyboard.openstack.org/#!/story/2010109

Comment 7 errata-xmlrpc 2022-09-21 12:23:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543


Note You need to log in before you can comment on or make changes to this bug.