Bug 2189428

Summary: CentOS Stream 8 LXD and cloud images taking 2 minutes to get IP
Product: Red Hat Enterprise Linux 8 Reporter: Lucas <lucasbaile14>
Component: systemdAssignee: systemd-maint
Status: CLOSED DUPLICATE QA Contact: Frantisek Sumsal <fsumsal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: CentOS StreamCC: bstinson, jwboyer, mosvald, systemd-maint-list
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-04-25 09:15:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs from the fresh LXD container none

Description Lucas 2023-04-25 07:56:29 UTC
Created attachment 1959737 [details]
Logs from the fresh LXD container

Description of problem:

We use centos stream 8 images on both LXD containers and on cloud VMs (Hetzner mostly in our case), and the base image currently seems to have an issue where its taking up to 2 minutes to finish the network startup. The issue seems related to the dhclient startup, but I might be wrong.

I've attached all the logs from a fresh LXD container startup where the issue can be seen clearly. The container boots at 15:52:50 and the logs simply stop at 15:52:51 until 15:54:21 where they eventually resume, resulting in a startup finishing around 15:54:23, where it finally has an accessible IP, and thus the network is finally working.

To add a bit more on that: We are also using podman centos stream 8 images (quay.io/centos/centos:stream8) and these are not experiencing the same issue.

Version-Release number of selected component (if applicable):


How reproducible:

It happens 100% of times I create LXD containers.


Steps to Reproduce:
The behavior is the same for our LXD containers and our cloud VMs. Just start a fresh container using a Centos Stream 8 base image and it takes around 1~2 minutes to startup its network.

For the LXD scenario, the image being used is `images:centos/8-Stream`, like this:
```
lxc launch images:centos/8-Stream test-lxd
```

Actual results:

The network service takes 2 minutes on average to startup and get any IP


Expected results:

In the previous version (and other similar images) that usually happens withing a few short seconds

Comment 1 Martin Osvald 🛹 2023-04-25 09:15:03 UTC
Thank you for reporting this problem.

Checking the attached log, at first glance there is no apparent issue with dhclient itself or any dhcp server side issue (getting from DHCPDISCOVER to DHCPACK is very fast):

~~~
Apr 20 15:54:21 test-lxd network[467]: Bringing up interface eth0:
Apr 20 15:54:21 test-lxd dhclient[630]: Created duid "\000\004\337\252\266\0058}C\201\204}L\255\204\034SK".
Apr 20 15:54:21 test-lxd dhclient[630]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8 (xid=0xca1db832)
Apr 20 15:54:21 test-lxd dhclient[630]: DHCPREQUEST on eth0 to 255.255.255.255 port 67 (xid=0xca1db832)
Apr 20 15:54:21 test-lxd dhclient[630]: DHCPOFFER from 10.155.213.1
Apr 20 15:54:21 test-lxd dhclient[630]: DHCPACK from 10.155.213.1 (xid=0xca1db832)
Apr 20 15:54:23 test-lxd NET[658]: /usr/sbin/dhclient-script : updated /etc/resolv.conf
Apr 20 15:54:23 test-lxd dbus-daemon[468]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':1.8' (uid=0 pid=660 comm="hostnamectl --transient " label="unconfined")
Apr 20 15:54:23 test-lxd systemd[1]: Starting Hostname Service...
Apr 20 15:54:23 test-lxd dbus-daemon[468]: [system] Successfully activated service 'org.freedesktop.hostname1'
Apr 20 15:54:23 test-lxd systemd[1]: Started Hostname Service.
Apr 20 15:54:23 test-lxd dhclient[630]: bound to 10.155.213.114 -- renewal in 1754 seconds.
Apr 20 15:54:23 test-lxd network[611]: Determining IP information for eth0... done.
Apr 20 15:54:23 test-lxd network[467]: [  OK  ]
~~~

On the other hand, I see the delay seems to arise earlier here:

~~~
Apr 20 15:52:51 test-lxd systemd-udevd[437]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Apr 20 15:52:51 test-lxd systemd[1]: Started udev Coldplug all Devices.
Apr 20 15:52:51 test-lxd systemd-udevd[438]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.

... note the delay happens here:

Apr 20 15:54:21 test-lxd systemd[1]: systemd-journal-flush.service: start operation timed out. Terminating.
Apr 20 15:54:21 test-lxd systemd[1]: systemd-journal-flush.service: Main process exited, code=killed, status=15/TERM
Apr 20 15:54:21 test-lxd systemd[1]: systemd-journal-flush.service: Failed with result 'timeout'.
~~~

I found a similar problem on the net:

https://unix.stackexchange.com/questions/740162/systemd-journal-flush-service-failed-with-result-timeout-failed-to-start-flush

There are already opened BZs for this for RHEL7 (WONTFIX) & RHEL8:

RHEL7 BZ:

[Bug 1649857 - systemd-journal-flush.service fails in timeout causing boot to be delayed by 1min30]
https://bugzilla.redhat.com/show_bug.cgi?id=1649857

RHEL8 BZ:

[Bug 2174645 - Failed to start Flush Journal to Persistent Storage (systemd-journal-flush.service)]
https://bugzilla.redhat.com/show_bug.cgi?id=2174645

When looking at the state of RHEL8 BZ I can see it is targeted to get fixed in RHEL8.8.

There is also a comment asking about CentOS 8 fix first:

https://bugzilla.redhat.com/show_bug.cgi?id=2174645#c26


Please, refer to the information shared in those links and BZs.

Given the information above, I am going to change the component of this to systemd and close it as a duplicate of bug 2174645.

*** This bug has been marked as a duplicate of bug 2174645 ***

Comment 2 Lucas 2023-04-25 11:07:29 UTC
Thanks! I was totally barking at the wrong tree here due to this behavior affecting initially the networking, but I can see now that it's on an earlier point. 

I'll play around with the workaround from stackexchange to see if that's feasible for our flow, and will also keep an eye on this bug 2174645

Appreciate the help and fast reply!