Bug 1994804

Summary: Long delay with AWS systems becoming available
Product: Red Hat Enterprise Linux 8 Reporter: David Valin <dvalin>
Component: cloud-initAssignee: sushil kulkarni <sukulkar>
Status: CLOSED DUPLICATE QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: ---CC: Aaron.Boudreaux, ailan, huzhao, jgreguske, linl, mhayden, ribarry, virt-bugs, xiachen, xiliang, yacao
Target Milestone: beta   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-18 12:52:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
systemd analyze-time output none

Description David Valin 2021-08-17 20:56:25 UTC
Created attachment 1814954 [details]
systemd  analyze-time output

Description of problem:
On AWS, we are seeing on system creation, taking 122 seconds to be able to login to the system via ssh.  Ubuntu and Amazon linux is taking about 55 seconds

Version-Release number of selected component (if applicable):


How reproducible:

Steps to Reproduce:
1.Create an AWS instance with RHEL (was using ami-03d64741867e7bb94, c5.xlarge instance)
2.Time how long it takes before you can ssh into the system
3.

Actual results:
Time from cloud instance creation start to logging in, is taking about 122 seconds.

Expected results:
Expect to be on par with Ubuntu/Amazon Linux, which is around 55 seconds.

Additional info:

From dmesg files

[   10.003160] ppdev: user-space parallel port driver
[   62.901525] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   62.910524] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   62.928081] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   62.940125] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   63.511141] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

That is the large jump in time.  Amazon Linux shows this about 4 seconds into the run.

From 
systemd analyze-blame
         53.120s cloud-init-local.service
          8.451s kdump.service

So the cloud-init-local.service appears to be the culprit.

Looking at its log file 
2021-08-17 20:26:39,749 - util.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', '172.31.0.1', 'dev', 'eth0', 'src', '172.31.4.239'] with allowed return codes [0] (shell=False, capture=True)
2021-08-17 20:26:39,760 - util.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', 'default', 'via', '172.31.0.1', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2021-08-17 20:27:19,832 - util.py[DEBUG]: Resolving URL: http://169.254.169.254 took 40.061 seconds
2021-08-17 20:27:29,843 - util.py[DEBUG]: Resolving URL: http://instance-data.:8773 took 10.010 seconds

Appears that the URL resolution is what is causing the problems.

Log files attached

Comment 8 Amnon Ilan 2021-08-18 12:52:02 UTC

*** This bug has been marked as a duplicate of bug 1862930 ***

Comment 9 Aaron.Boudreaux 2022-03-03 15:17:06 UTC
bug 1862930 is a internal bug that I cannot view the status on. Is there any status on the resolution of it or any work arounds mentioned? We are still experiencing this issue with rhel 8.5.

Comment 10 Frank Liang 2022-03-04 01:28:15 UTC
(In reply to aboudr01 from comment #9)
> bug 1862930 is a internal bug that I cannot view the status on. Is there any
> status on the resolution of it or any work arounds mentioned? We are still
> experiencing this issue with rhel 8.5.
Are you using the images built by your self or created from running instances?
Does it appear only in your first launch? If yes, please try to cleanup below file before uploading your own image or creating from running instance.
#truncate -s 0 /etc/resolv.conf

Comment 11 Aaron.Boudreaux 2022-03-04 15:11:22 UTC
We are using a redhat provided 8.3 AMI in the c2s region which points to a yum repo updated with 8.5 rpms. It appears to only be a problem with the first boot. After stopping and starting the instance, when I run cloud-init analyze blame the first boot record shows

-- Boot Record 01 --
51.03100s (init-local/search-Ec2Local)

and the second

-- Boot Record 02 --
00.65400s (init-local/search-Ec2Local)

We are also creating our own images, starting from the redhat provided one, which are also having the issue. I will try the suggested work around of clearing resolv.conf to see if it resolves the issue for our own images.

Comment 12 Aaron.Boudreaux 2022-03-04 17:24:16 UTC
I have confirmed that clearing /etc/resolv.conf before making an image resolves the problem.

Thanks,
Aaron

Comment 13 Frank Liang 2022-03-07 01:00:19 UTC
(In reply to aboudr01 from comment #12)
> I have confirmed that clearing /etc/resolv.conf before making an image
> resolves the problem.
> 
> Thanks,
> Aaron

Thanks for your confirmation, we are working on writing an article about this topic.

Comment 14 Aaron.Boudreaux 2023-05-26 19:47:10 UTC
We are now using rhel 8.7 with NetworkManager, and clearing the /etc/resolv.conf with `truncate -s 0 /etc/resolv.conf` before AMI creation is no longer resolving the issue because the /etc/resolv.conf gets repopulated between the time of clearing the file and shutting down the instance in order to create the AMI. Is there a fix that can be made that does not require needing /etc/resolv.conf to be clear? I suspect NetworkManager is reconfiguring the file; is there a recommended strategy for preparing a host for AMI creation?