Bug 1780572

Summary: S3 region invalid if ec2-metadata fails on first attempt
Product: OpenShift Container Platform Reporter: Simon Reber <sreber>
Component: RHCOSAssignee: Micah Abbott <miabbott>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.2.zCC: bbreard, behoward, dornelas, dustymabe, imcleod, jligon, miabbott, nstielau, walters
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1781383 1784475 (view as bug list) Environment:
Last Closed: 2020-05-04 11:19:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1186913, 1784475    

Description Simon Reber 2019-12-06 12:51:51 UTC
Description of problem:

Bootstrap node fails downloading the ignition fragment from s3, while s3-eu-central-1 prefix lists are accessible. We can see in the logs that Ignition component tries to reach S3 in us-east-1 and therefore is failing as access to this network is not allowed and blocked by the VPC.

> Bootstrap node is in a eu-central-1 VPC
> rhcos-42.80.20191002.0-hvm (ami-092b69120ecf915ed)

Ignition snippet from user_data
> {"ignition":{"config":{"append":[{"source":"s3://foo-eu-central-1-bar-ocp-cluster-management/bootstrap.ign","verification":{}}]},

Logs found
> [K[  127.326019] ignition[636]: Ignition failed: RequestError: send request failed
> [  127.326019] caused by: dial tcp 52.217.10.22:443: i/o timeout
> [[0;1;31mFAILED[0m] Failed to start Ignition (disks)[  127.339017] systemd[1]: ignition-disks.service: Main process exited, code=exited, status=1/FAILURE

When using S3 source, and network isn't ready for its first metadata retrieval, the regionHint is nil and so defaults to us-east-1 in https://github.com/coreos/ignition/blob/v0.33.0/internal/providers/ec2/ec2.go#L62

Version-Release number of selected component (if applicable):

RHEL Core OS 42.80.20191002.0

How reproducible:

Always

Steps to Reproduce:
1. Restrict S3 access to S3 eu-central-1 only and bootstrap may fail

Actual results:

Bootstrap is failing

Expected results:

Bootstrap to work

Additional info:

Likely the fix from https://github.com/coreos/ignition/pull/830 is needed

Comment 2 Micah Abbott 2019-12-06 14:06:46 UTC
We should produce a newer version of Ignition for RHCOS 4.2 that has that PR, as well as other fixes that have landed.  However, the tricky part is that the new Ignition would need to also be included in the boot images that are used to initially provision the bootstrap node and the rest of the nodes.

We currently do not have a good mechanism for doing that, so just noting here for visibility.

@behoward this seems like something the Tools team could handle:  the rebuild of Ignition for 4.2.

Comment 3 Colin Walters 2019-12-06 14:22:08 UTC
Right, it would need new bootimages, which so far we haven't done.  I'm not sure this warrants it versus getting 4.3 out.

Comment 7 Andrew Jeddeloh 2019-12-11 21:23:14 UTC
I've tagged Ignition v0.34.0 with the fix needed. Not sure what needs to happen to get it into the relevant RHCOS.

Comment 12 Michael Nguyen 2019-12-18 01:25:32 UTC
Verified 4.4 nightly has the correct version of ignition

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2019-12-16-124946   True        False         4h11m   Cluster version is 4.4.0-0.nightly-2019-12-16-124946
$ oc get nodes
NAME                           STATUS   ROLES    AGE     VERSION
ip-10-0-130-59.ec2.internal    Ready    master   4h30m   v1.16.2
ip-10-0-133-136.ec2.internal   Ready    worker   4h20m   v1.16.2
ip-10-0-154-85.ec2.internal    Ready    worker   4h21m   v1.16.2
ip-10-0-157-203.ec2.internal   Ready    master   4h31m   v1.16.2
ip-10-0-160-63.ec2.internal    Ready    worker   4h21m   v1.16.2
ip-10-0-163-177.ec2.internal   Ready    master   4h31m   v1.16.2
$ oc debug node/ip-10-0-130-59.ec2.internal
Starting pod/ip-10-0-130-59ec2internal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# ls
bin  boot  dev	etc  home  lib	lib64  media  mnt  opt	ostree	proc  root  run  sbin  srv  sys  sysroot  tmp  usr  var
sh-4.4# rpm -q ignition
ignition-0.34.0-0.rhaos4.3.git92f874c.el8.x86_64
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...

Comment 14 errata-xmlrpc 2020-05-04 11:19:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 15 Red Hat Bugzilla 2023-09-18 00:19:02 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days