Bug 1780572 - S3 region invalid if ec2-metadata fails on first attempt [NEEDINFO]
Summary: S3 region invalid if ec2-metadata fails on first attempt
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.2.z
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.4.0
Assignee: Micah Abbott
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1186913 1784475
TreeView+ depends on / blocked
 
Reported: 2019-12-06 12:51 UTC by Simon Reber
Modified: 2020-07-21 16:33 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1781383 1784475 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:19:04 UTC
Target Upstream Version:
miabbott: needinfo? (bbreard)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4641171 0 None None None 2019-12-06 12:55:12 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:19:37 UTC

Description Simon Reber 2019-12-06 12:51:51 UTC
Description of problem:

Bootstrap node fails downloading the ignition fragment from s3, while s3-eu-central-1 prefix lists are accessible. We can see in the logs that Ignition component tries to reach S3 in us-east-1 and therefore is failing as access to this network is not allowed and blocked by the VPC.

> Bootstrap node is in a eu-central-1 VPC
> rhcos-42.80.20191002.0-hvm (ami-092b69120ecf915ed)

Ignition snippet from user_data
> {"ignition":{"config":{"append":[{"source":"s3://foo-eu-central-1-bar-ocp-cluster-management/bootstrap.ign","verification":{}}]},

Logs found
> [K[  127.326019] ignition[636]: Ignition failed: RequestError: send request failed
> [  127.326019] caused by: dial tcp 52.217.10.22:443: i/o timeout
> [[0;1;31mFAILED[0m] Failed to start Ignition (disks)[  127.339017] systemd[1]: ignition-disks.service: Main process exited, code=exited, status=1/FAILURE

When using S3 source, and network isn't ready for its first metadata retrieval, the regionHint is nil and so defaults to us-east-1 in https://github.com/coreos/ignition/blob/v0.33.0/internal/providers/ec2/ec2.go#L62

Version-Release number of selected component (if applicable):

RHEL Core OS 42.80.20191002.0

How reproducible:

Always

Steps to Reproduce:
1. Restrict S3 access to S3 eu-central-1 only and bootstrap may fail

Actual results:

Bootstrap is failing

Expected results:

Bootstrap to work

Additional info:

Likely the fix from https://github.com/coreos/ignition/pull/830 is needed

Comment 2 Micah Abbott 2019-12-06 14:06:46 UTC
We should produce a newer version of Ignition for RHCOS 4.2 that has that PR, as well as other fixes that have landed.  However, the tricky part is that the new Ignition would need to also be included in the boot images that are used to initially provision the bootstrap node and the rest of the nodes.

We currently do not have a good mechanism for doing that, so just noting here for visibility.

@behoward this seems like something the Tools team could handle:  the rebuild of Ignition for 4.2.

Comment 3 Colin Walters 2019-12-06 14:22:08 UTC
Right, it would need new bootimages, which so far we haven't done.  I'm not sure this warrants it versus getting 4.3 out.

Comment 7 Andrew Jeddeloh 2019-12-11 21:23:14 UTC
I've tagged Ignition v0.34.0 with the fix needed. Not sure what needs to happen to get it into the relevant RHCOS.

Comment 12 Michael Nguyen 2019-12-18 01:25:32 UTC
Verified 4.4 nightly has the correct version of ignition

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2019-12-16-124946   True        False         4h11m   Cluster version is 4.4.0-0.nightly-2019-12-16-124946
$ oc get nodes
NAME                           STATUS   ROLES    AGE     VERSION
ip-10-0-130-59.ec2.internal    Ready    master   4h30m   v1.16.2
ip-10-0-133-136.ec2.internal   Ready    worker   4h20m   v1.16.2
ip-10-0-154-85.ec2.internal    Ready    worker   4h21m   v1.16.2
ip-10-0-157-203.ec2.internal   Ready    master   4h31m   v1.16.2
ip-10-0-160-63.ec2.internal    Ready    worker   4h21m   v1.16.2
ip-10-0-163-177.ec2.internal   Ready    master   4h31m   v1.16.2
$ oc debug node/ip-10-0-130-59.ec2.internal
Starting pod/ip-10-0-130-59ec2internal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# ls
bin  boot  dev	etc  home  lib	lib64  media  mnt  opt	ostree	proc  root  run  sbin  srv  sys  sysroot  tmp  usr  var
sh-4.4# rpm -q ignition
ignition-0.34.0-0.rhaos4.3.git92f874c.el8.x86_64
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...

Comment 14 errata-xmlrpc 2020-05-04 11:19:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.