1994804 – Long delay with AWS systems becoming available

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1994804 - Long delay with AWS systems becoming available

Summary: Long delay with AWS systems becoming available

Keywords:
Status:	CLOSED DUPLICATE of bug 1862930
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	cloud-init
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	beta
Target Release:	---
Assignee:	sushil kulkarni
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-17 20:56 UTC by David Valin
Modified:	2024-11-20 07:49 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-18 12:52:02 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
systemd analyze-time output (148 bytes, text/plain) 2021-08-17 20:56 UTC, David Valin	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-93897	0	None	None	None	2021-08-17 20:57:51 UTC

Description David Valin 2021-08-17 20:56:25 UTC

Created attachment 1814954 [details]
systemd  analyze-time output

Description of problem:
On AWS, we are seeing on system creation, taking 122 seconds to be able to login to the system via ssh.  Ubuntu and Amazon linux is taking about 55 seconds

Version-Release number of selected component (if applicable):


How reproducible:

Steps to Reproduce:
1.Create an AWS instance with RHEL (was using ami-03d64741867e7bb94, c5.xlarge instance)
2.Time how long it takes before you can ssh into the system
3.

Actual results:
Time from cloud instance creation start to logging in, is taking about 122 seconds.

Expected results:
Expect to be on par with Ubuntu/Amazon Linux, which is around 55 seconds.

Additional info:

From dmesg files

[   10.003160] ppdev: user-space parallel port driver
[   62.901525] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   62.910524] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   62.928081] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   62.940125] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   63.511141] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

That is the large jump in time.  Amazon Linux shows this about 4 seconds into the run.

From 
systemd analyze-blame
         53.120s cloud-init-local.service
          8.451s kdump.service

So the cloud-init-local.service appears to be the culprit.

Looking at its log file 
2021-08-17 20:26:39,749 - util.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', '172.31.0.1', 'dev', 'eth0', 'src', '172.31.4.239'] with allowed return codes [0] (shell=False, capture=True)
2021-08-17 20:26:39,760 - util.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', 'default', 'via', '172.31.0.1', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2021-08-17 20:27:19,832 - util.py[DEBUG]: Resolving URL: http://169.254.169.254 took 40.061 seconds
2021-08-17 20:27:29,843 - util.py[DEBUG]: Resolving URL: http://instance-data.:8773 took 10.010 seconds

Appears that the URL resolution is what is causing the problems.

Log files attached

Comment 8 Amnon Ilan 2021-08-18 12:52:02 UTC


*** This bug has been marked as a duplicate of bug 1862930 ***

Comment 9 Aaron.Boudreaux 2022-03-03 15:17:06 UTC

bug 1862930 is a internal bug that I cannot view the status on. Is there any status on the resolution of it or any work arounds mentioned? We are still experiencing this issue with rhel 8.5.

Comment 10 Frank Liang 2022-03-04 01:28:15 UTC

(In reply to aboudr01 from comment #9)
> bug 1862930 is a internal bug that I cannot view the status on. Is there any
> status on the resolution of it or any work arounds mentioned? We are still
> experiencing this issue with rhel 8.5.
Are you using the images built by your self or created from running instances?
Does it appear only in your first launch? If yes, please try to cleanup below file before uploading your own image or creating from running instance.
#truncate -s 0 /etc/resolv.conf

Comment 11 Aaron.Boudreaux 2022-03-04 15:11:22 UTC

We are using a redhat provided 8.3 AMI in the c2s region which points to a yum repo updated with 8.5 rpms. It appears to only be a problem with the first boot. After stopping and starting the instance, when I run cloud-init analyze blame the first boot record shows

-- Boot Record 01 --
51.03100s (init-local/search-Ec2Local)

and the second

-- Boot Record 02 --
00.65400s (init-local/search-Ec2Local)

We are also creating our own images, starting from the redhat provided one, which are also having the issue. I will try the suggested work around of clearing resolv.conf to see if it resolves the issue for our own images.

Comment 12 Aaron.Boudreaux 2022-03-04 17:24:16 UTC

I have confirmed that clearing /etc/resolv.conf before making an image resolves the problem.

Thanks,
Aaron

Comment 13 Frank Liang 2022-03-07 01:00:19 UTC

(In reply to aboudr01 from comment #12)
> I have confirmed that clearing /etc/resolv.conf before making an image
> resolves the problem.
> 
> Thanks,
> Aaron

Thanks for your confirmation, we are working on writing an article about this topic.

Comment 14 Aaron.Boudreaux 2023-05-26 19:47:10 UTC

We are now using rhel 8.7 with NetworkManager, and clearing the /etc/resolv.conf with `truncate -s 0 /etc/resolv.conf` before AMI creation is no longer resolving the issue because the /etc/resolv.conf gets repopulated between the time of clearing the file and shutting down the instance in order to create the AMI. Is there a fix that can be made that does not require needing /etc/resolv.conf to be clear? I suspect NetworkManager is reconfiguring the file; is there a recommended strategy for preparing a host for AMI creation?

Note You need to log in before you can comment on or make changes to this bug.