Bug 2163657

Summary:

[RHEL-8]Launching EC2 Instance in IPv6-Only subnet leads to unreachable instance

Product:

Red Hat Enterprise Linux 8

Reporter:

Neil Hanlon <neil>

Component:

cloud-init

Assignee:

Ani Sinha <anisinha>

Status:

CLOSED MIGRATED

QA Contact:

xiachen

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CentOS Stream

CC:

andavis, anisinha, b2b-redhat-augustineas-2011, bstinson, carl, eesposit, eterrell, huzhao, jgreguske, jwboyer, riehecky, toracat, xiliang, yacao

Target Milestone:

Keywords:

MigratedToJIRA, Triaged

Target Release:

---

Flags:

pm-rhel: mirror+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

2227771 (view as bug list)

Environment:

Last Closed:

2023-09-22 15:44:28 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

2172811

Bug Blocks:

2227771

Attachments:

Description	Flags
backport of EC2 IPv6 fix & tests to EL8	none

Description Neil Hanlon 2023-01-24 06:08:38 UTC

Description of problem:

Launching a CentOS Stream/RHEL machine in AWS into an IPv6-Only subnet results in an unusable instance due to missing ipv4 routes.

Support was added in cloud-init by this commit: https://github.com/canonical/cloud-init/pull/1160/

I rebased the patch against EL8 here: https://git.rockylinux.org/sig/cloud/patch/cloud-init/-/blob/r8/ROCKY/_supporting/9998-Add-Ec2-IPV6-IMDS.patch 

There were three total chunks which failed to apply cleanly; two of them were in tests and trivial to fix. One patch for cloudinit/url_helper.py needed to be rebased slightly to match the EL8 source.

Version-Release number of selected component (if applicable): 22.1-5.el8


How reproducible:
Always

Steps to Reproduce:

1) Create a VPC with and IPv6 CIDR block (using either your own or Amazon's IPv6 address space)
2) Create an IPv6 only subnet by creating a new subnet and checking the "IPv6 Only" box
3) Create a Rocky Linux instance and associate it with the IPv6 capable VPC and the IPv6-only subnet.
4) After approximately 10 minutes, the instance will complete the boot process, but will have "1/2 checks passed" in the "Status Check" column, and "Instance reachability check failed" in the "Status Check" tab of the instance details section.
5) The box will not be connected to the network

Actual results:

System reports the following during boot, and is unreachable once cloud-init times out due the failure.

 [ 12.865186] cloud-init[899]: 2022-09-12 20:05:50,230 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/latest/api/token' failed [0/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /latest/api/token (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe4404f8128>: Failed to establish a new connection: [Errno 101] Network is unreachable',))]
[ 13.648505] cloud-init[899]: 2022-09-12 20:05:51,234 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/latest/api/token' failed [1/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /latest/api/token (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe4404f8a20>: Failed to establish a new connection: [Errno 101] Network is unreachable',))]

These messages repeat many times before the final error messages:

[ 131.835862] cloud-init[899]: 2022-09-12 20:07:49,433 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/latest/api/token' failed [119/120s]: unexpected error [Attempted to set connect timeout to 0.0, but the timeout cannot be set to a value less than or equal to 0.]
[ 138.843417] cloud-init[899]: 2022-09-12 20:07:56,440 - DataSourceEc2.py[WARNING]: IMDS's HTTP endpoint is probably disabled

Expected results:

Cloud-Init connects successfully to the EC2 metadata service and 


Additional info:

Originally filed at https://bugs.rockylinux.org/view.php?id=279 - Verified on Rocky, Alma, CentOS Stream, and RHEL AMIs.

Comment 1 Neil Hanlon 2023-01-24 06:12:08 UTC

This is also present in RHEL/CentOS Stream 9. Should I file another bug against that product as well? I am not sure what the proper procedure is.

Thank you!

Comment 2 Neil Hanlon 2023-02-17 13:37:12 UTC

Hi Huijuan,

While investigating this I found a handful of other bugs which I feel should be addressed in RHEL 8 (and 9).

I rebased cloud-init to version 22.2.2 in a copr [1], and have tested this on Rocky Linux 8 and 9. In addition to the IPv6 issue in this issue, the following problems are also resolved by this rebase:

* cloud-init no longer poisons the /etc/NetworkManager/NetworkManager.conf file owned by the NetworkManager package
* The system behaviour after cloud-init is now consistent with the RHEL docs which states that keyfiles have priority over ifcfg
* cloud-init no longer generates antique ifcfg files and generates keyfiles instead
* The new cloud-init generated network configuration has priority 0 (the default) making it possible to override the cloud-init configuration using your own keyfiles. (Previously, the generated ifcfg files had maximum priority and couldn't be overridden).

Given this information, I would like to suggest (and request) that cloud-init be rebased to 22.2.2 for RHEL 8 and 9. I understand that RHEL 8 is coming to a point where it may not be possible to perform this rebase. If so, I still think RHEL 9 would benefit from this change.

I am happy to close this ticket in favor of new ones requesting a rebase in RHEL 9, if that would be preferable. Please let me know your thoughts and if there are any further clarifications I can provide.

Best,
Neil

[1] https://copr.fedorainfracloud.org/coprs/neil/cloud-init/

Comment 3 Huijuan Zhao 2023-02-18 02:56:37 UTC

Hi Neil,

Thanks a lot for raising the issue and all the investigations.

Regarding the investigations in comment 2, you are right, rebasing to version 22.2 will change the network config methods, from ifcfg to keyfiles. This change comes from patch[1], which was added to cloud-init-22.2
[1] https://github.com/canonical/cloud-init/pull/1224

We backported the patch[1] to 22.1, but there were some additional issues in rhel produced by patch[1], due to the release time and resource limit, we dropped it from 22.1. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=2118235 for detail information.

We will rebase to 22.2 or higher version in next release when the resource is available. 

For this issue described in comment 0, not sure if we can catch up with this release(rhel-8.8/9.1) if only backport the patch[2]. Will discuss this in team ASAP. 
[2] https://github.com/canonical/cloud-init/pull/1160/


And for this issue, I have a question about the test steps/environment. Could you please help to clarify it? Thanks!
For the IPv6-Only subnet, there is no ipv4 address, right? If yes, does the metadata http service(169.254.169.254) has ipv6 address to make the IPv6-Only network reach out it to get the metadata?

Comment 4 xiachen 2023-02-18 06:48:53 UTC

Sharing some information,

IPv6 endpoints has been available on AWS for the Amazon EC2 Instance Metadata Service since Aug 2021.
https://aws.amazon.com/about-aws/whats-new/2021/08/Ipv6-amazon-ec2-metadata-time-sync-vpc-dns/

About how to retrieve instance metadata, I got more information from AWS docs
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html
https://aws.amazon.com/blogs/networking-and-content-delivery/introducing-ipv6-only-subnets-and-ec2-instances/

IPv4
http://169.254.169.254/latest/meta-data/
IPv6
http://[fd00:ec2::254]/latest/meta-data/

Just as this bug description, support was added in cloud-init by commit: https://github.com/canonical/cloud-init/pull/1160/
cloudinit/sources/DataSourceEc2.py
    metadata_urls = [
        "http://169.254.169.254",
        "http://[fd00:ec2::254]",
        "http://instance-data.:8773",
    ]


In addition,
The Support for IPv6 metadata to `DataSourceOpenStack` was added in cloud-init by another commit, which is included in upstream 22.4.1
https://github.com/canonical/cloud-init/pull/1805
- Add openstack IPv6 metadata url `fe80::a9fe:a9fe`
- Enable requesting multiple metadata sources in parallel
This PR is very similar to #1160, reusing the provided `url_heper` logic.
https://bugs.launchpad.net/cloud-init/+bug/1906849

Comment 5 Neil Hanlon 2023-02-20 20:01:03 UTC

(In reply to Huijuan Zhao from comment #3)
> Hi Neil,
> 
> Thanks a lot for raising the issue and all the investigations.
> 
> Regarding the investigations in comment 2, you are right, rebasing to
> version 22.2 will change the network config methods, from ifcfg to keyfiles.
> This change comes from patch[1], which was added to cloud-init-22.2
> [1] https://github.com/canonical/cloud-init/pull/1224
> 
> We backported the patch[1] to 22.1, but there were some additional issues in
> rhel produced by patch[1], due to the release time and resource limit, we
> dropped it from 22.1. Please refer to
> https://bugzilla.redhat.com/show_bug.cgi?id=2118235 for detail information.

That issue is restricted, but I understand the scenario and why it was dropped.

> 
> We will rebase to 22.2 or higher version in next release when the resource
> is available. 

Thank you!

> 
> For this issue described in comment 0, not sure if we can catch up with this
> release(rhel-8.8/9.1) if only backport the patch[2]. Will discuss this in
> team ASAP. 
> [2] https://github.com/canonical/cloud-init/pull/1160/
> 

Again, thank you. I appreciate it. As I mentioned in comment 0, I was able to backport patch[2]. I have attached said patch (9998-Add-Ec2-IPV6-IMDS.patch) to this ticket for posterity, and in case it is helpful.

> 
> And for this issue, I have a question about the test steps/environment.
> Could you please help to clarify it? Thanks!
> For the IPv6-Only subnet, there is no ipv4 address, right? If yes, does the
> metadata http service(169.254.169.254) has ipv6 address to make the
> IPv6-Only network reach out it to get the metadata?

Absolutely. In this test scenario, the server has no IPv4 address or route table besides the loopback. This is a relative new feature[1a] and is probably a fairly rare use case, but it does seem like a good thing to fix (along with the similar OpenStack patch that was mentioned in comment 4 (Thank you!)). So in this case, the current behavior is that cloud-init eventually times out due to `no route to host`, since there is no route installed for ipv4 in this subnet.

There are instructions for setting up such a subnet here: [2b]

Thank you!


[1a] https://aws.amazon.com/blogs/networking-and-content-delivery/introducing-ipv6-only-subnets-and-ec2-instances/
[2b] https://docs.aws.amazon.com/vpc/latest/userguide/vpc-subnets-commands-example-ipv6-v2.html

Comment 6 Neil Hanlon 2023-02-20 20:02:20 UTC

Created attachment 1945360 [details]
backport of EC2 IPv6 fix & tests to EL8

Comment 7 Huijuan Zhao 2023-02-21 01:51:47 UTC

Thanks Neil and Amy for the update/information.

Not sure if we have enough resource/time to backport this patch to rhel-8.8/9.2, there were maybe some conflict, just like Neil mentioned in comment 0: "There were three total chunks which failed to apply cleanly; two of them were in tests and trivial to fix. One patch for cloudinit/url_helper.py needed to be rebased slightly to match the EL8 source."

Emanuele, could you please help to review and evaluate this issue? Is it possible to backport the patch in rhel-8.8/9.2 or rebase in next release? Thanks!

Comment 8 xiachen 2023-02-21 02:14:50 UTC

Filed another bug BZ#2171957 to track the Support for IPv6 metadata to `DataSourceOpenStack`.

Comment 10 Huijuan Zhao 2023-02-22 01:17:16 UTC

(In reply to Emanuele Giuseppe Esposito from comment #9)
> Hi Huijuan, do you think the issue is high priority enough to ask for
> exception+?
> 
> Otherwise we could wait for the next rebase.

Thanks Emanuele, I agree with Neil's comment 5 this is probably a fairly rare use case, IMO it is ok to wait for the next rebase.

Neil, is it ok to fix the issue in next rebase/release? As the rhel-8.8/9.2 is coming to the exception+ phase, it means high priority bugs can still be fixed currently.

Comment 11 Neil Hanlon 2023-02-22 01:55:37 UTC

(In reply to Huijuan Zhao from comment #10)
> (In reply to Emanuele Giuseppe Esposito from comment #9)
> > Hi Huijuan, do you think the issue is high priority enough to ask for
> > exception+?
> > 
> > Otherwise we could wait for the next rebase.
> 
> Thanks Emanuele, I agree with Neil's comment 5 this is probably a fairly
> rare use case, IMO it is ok to wait for the next rebase.
> 
> Neil, is it ok to fix the issue in next rebase/release? As the rhel-8.8/9.2
> is coming to the exception+ phase, it means high priority bugs can still be
> fixed currently.

Thank you Huijuan, Emanuele - 

I think it is acceptable to wait until the next release for this bug (in comment 0 ) related to IPv6 in AWS.

Regarding the other bugs patched in 22.2 mentioned in comment 2 - it would be great if these could be fixed in the current release, but I understand the timelines and deadlines for the release support. I am OK providing an updated cloud-init for my purposes via my COPR repository until 8.8 and 9.2 are released.

Best,
Neil

Comment 12 Huijuan Zhao 2023-02-22 02:37:44 UTC

(In reply to Neil Hanlon from comment #11)
> Thank you Huijuan, Emanuele - 
> 
> I think it is acceptable to wait until the next release for this bug (in
> comment 0 ) related to IPv6 in AWS.
> 
> Regarding the other bugs patched in 22.2 mentioned in comment 2 - it would
> be great if these could be fixed in the current release, but I understand
> the timelines and deadlines for the release support. I am OK providing an
> updated cloud-init for my purposes via my COPR repository until 8.8 and 9.2
> are released.
> 
> Best,
> Neil

Neil, thank you so much for the understanding.

Regarding the other bugs patched in 22.2 mentioned in comment 2 - I think they will also be fixed in next rebase/release as the time and resource limitation now. And will update here once the official cloud-init version available, then you can update in your COPR repository.

Emanuele, please correct me if anything incorrect. Thanks!

Comment 13 Emanuele Giuseppe Esposito 2023-02-22 08:59:34 UTC

Agree with you, thanks!

Comment 14 Adam Augustine 2023-06-15 23:17:26 UTC

Did these fixes get merged into the 9.2 release? I seem to be having similar symptoms with a RHEL 9.2 box in an AWS IPv6-only subnet. It does eventually boot (after minutes), but is unresponsive to pings or SSH. It also fails the "Instance status checks 1/2: Instance reachability check failure".

It is different in that there don't appear to be /any/ cloud-init lines or actual network interface configuration in the system log.

I have cross checked with a similarly configured (at least as far as security groups and ACLs go) Amazon Linux instance and it does boot quickly, passes the reachability check, is pingable, and I can SSH to it.

Comment 15 xiachen 2023-06-19 15:33:33 UTC

Hi Adam, 

The fix did not merged into the 9.2 release, it will be merged into 9.3 by plan.

Comment 27 RHEL Program Management 2023-09-22 15:43:07 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 28 RHEL Program Management 2023-09-22 15:44:28 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.