Bug 1892521 - [AWS] Startup bootstrap machine failed due to ignition file is missing in disconnected UPI env
Summary: [AWS] Startup bootstrap machine failed due to ignition file is missing in dis...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: slowrie
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1915617
TreeView+ depends on / blocked
 
Reported: 2020-10-29 03:00 UTC by Yunfei Jiang
Modified: 2021-02-24 15:29 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:29:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
bootstrap machine full system log (64.00 KB, text/plain)
2020-10-29 03:01 UTC, Yunfei Jiang
no flags Details
system log & ignition files & install log (160.70 KB, application/gzip)
2020-10-30 06:52 UTC, Yunfei Jiang
no flags Details
Bug1892521_logs_1030.tgz (587.17 KB, application/gzip)
2020-11-03 02:14 UTC, Yunfei Jiang
no flags Details
Bug1892521_logs_1104.tgz (20.35 KB, application/gzip)
2020-11-04 10:28 UTC, Yunfei Jiang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github coreos ignition pull 1155 0 None closed internal/providers: Run platform Init function before fetching config 2021-02-16 20:53:56 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:29:47 UTC

Description Yunfei Jiang 2020-10-29 03:00:24 UTC
Install a disconnected UPI cluster on AWS, the bootstrap instance is started on AWS, but can not acceess it via SSH (in bastion host), checking system logs, get following errors, and masters also can not get ignition files from temporary service on bootstrap machine.


bootstrap machine system log:
[    5.148509] dracut-cmdline[344]: Using kernel command line parameters: ip=dhcp,dhcp6 rd.driver.pre=dm_multipath BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-f187b8152d5fd0ad77be9c5f928babd2297aaa6f6e0190810712eedfc4e17a38/vmlinuz-4.18.0-193.28.1.el8_2.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ignition.firstboot ostree=/ostree/boot.1/rhcos/f187b8152d5fd0ad77be9c5f928babd2297aaa6f6e0190810712eedfc4e17a38/0 ignition.platform.id=aws
[    6.891496] ignition-setup-base[614]: File /usr/lib/ignition/platform/aws/base.ign does not exist.. Skipping copy
[    7.016910] ignition-setup-user[617]: File /mnt/boot_partition/ignition/config.ign does not exist.. Skipping copy
         Starting Copy CoreOS Firstboot Networking Config[    7.346070] ignition[651]: Ignition 2.7.0
[    7.354002] ignition[651]: Stage: fetch-offline
[    7.364011] ignition[651]: reading system config file "/usr/lib/ignition/base.ign"
[    7.386808] ignition[651]: no config URL provided
[    7.396501] ignition[651]: reading system config file "/usr/lib/ignition/user.ign"
[    7.421530] ignition[651]: no config at "/usr/lib/ignition/user.ign"
[    7.435414] ignition[651]: failed to fetch config: resource requires networking
[    7.442683] ignition[651]: Ignition finished successfully
[    7.751114] ignition[700]: Ignition 2.7.0
[    7.755161] ignition[700]: Stage: fetch
[    7.761096] ignition[700]: reading system config file "/usr/lib/ignition/base.ign"
[    7.768221] ignition[700]: no config URL provided
[    7.772468] ignition[700]: reading system config file "/usr/lib/ignition/user.ign"
[    7.779085] ignition[700]: no config at "/usr/lib/ignition/user.ign"
[    7.783060] ignition[700]: GET http://169.254.169.254/2019-10-01/user-data: attempt #1
[    7.790100] ignition[700]: GET error: Get "http://169.254.169.254/2019-10-01/user-data": dial tcp 169.254.169.254:80: connect: network is unreachable
[    8.005128] ignition[700]: GET http://169.254.169.254/2019-10-01/user-data: attempt #2
[    8.030362] ignition[700]: GET result: OK
[    0.000000] Command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-f187b8152d5fd0ad77be9c5f928babd2297aaa6f6e0190810712eedfc4e17a38/vmlinuz-4.18.0-193.28.1.el8_2.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ignition.firstboot ostree=/ostree/boot.1/rhcos/f187b8152d5fd0ad77be9c5f928babd2297aaa6f6e0190810712eedfc4e17a38/0 ignition.platform.id=aws
[    0.000000] Kernel command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-f187b8152d5fd0ad77be9c5f928babd2297aaa6f6e0190810712eedfc4e17a38/vmlinuz-4.18.0-193.28.1.el8_2.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ignition.firstboot ostree=/ostree/boot.1/rhcos/f187b8152d5fd0ad77be9c5f928babd2297aaa6f6e0190810712eedfc4e17a38/0 ignition.platform.id=aws


Version-Release number of the following components: 
4.7.0-0.nightly-2020-10-27-051128

How reproducible: 
Always

Steps to Reproduce: 
1. create a disconnected cluster via UPI

Actual results: 
failed to start bootstrap machine and related services.

Expected results: 
start bootstrap machine successfully, and all services work well

Additional info:

Comment 1 Yunfei Jiang 2020-10-29 03:01:50 UTC
Created attachment 1724978 [details]
bootstrap machine full system log

Comment 2 Micah Abbott 2020-10-29 15:43:34 UTC
@yunjiang could you provide the ignition config that the bootstrap host is using?

From the logs, I see


```
[    8.615638] NetworkManager[721]: <info>  [1603852952.3790] policy: set 'Wired Connection' (ens3) as default for IPv4 routing and DNS
[    8.626468] NetworkManager[721]: <info>  [1603852952.3790] policy: set-hostname: set hostname to 'ip-10-0-60-227' (from DHCPv4)
[    8.636940] NetworkManager[721]: <info>  [1603852952.3792] device (ens3): Activation: successful, device activated.
[    8.646249] NetworkManager[721]: <info>  [1603852952.3792] manager: NetworkManager state is now CONNECTED_GLOBAL
[    8.651751] NetworkManager[721]: <info>  [1603852952.3794] manager: startup complete
[    8.659360] NetworkManager[721]: <info>  [1603852952.3795] quitting now that startup is complete
[    8.667754] NetworkManager[721]: <info>  [1603852952.4437] dhcp4 (ens3): canceled DHCP transaction
[    8.676276] NetworkManager[721]: <info>  [1603852952.4438] dhcp4 (ens3): state changed bound -> done
[    8.684998] NetworkManager[721]: <info>  [1603852952.4438] device (ens3): DHCPv4: trying to acquire a new lease within 90 seconds
[    8.695369] NetworkManager[721]: <info>  [1603852952.4439] manager: NetworkManager state is now CONNECTED_SITE
[    8.704536] NetworkManager[721]: <info>  [1603852952.4440] exiting (success)
```

...then


```
[[0m[0;31m*     [0m] A start job is running for Ignition (fetch) (8s / no limit)
[K[[0;1;31m*[0m[0;31m*    [0m] A start job is running for Ignition (fetch) (9s / no limit)
[K[[0;31m*[0;1;31m*[0m[0;31m*   [0m] A start job is running for Ignition (fetch) (9s / no limit)
[K[ [0;31m*[0;1;31m*[0m[0;31m*  [0m] A start job is running for Ignition (fetch) (10s / no limit)
[K[  [0;31m*[0;1;31m*[0m[0;31m* [0m] A start job is running for Ignition (fetch) (10s / no limit)
[K[   [0;31m*[0;1;31m*[0m[0;31m*[0m] A start job is running for Ignition (fetch) (11s / no limit)
[K[    [0;31m*[0;1;31m*[0m] A start job is running for Ignition (fetch) (11s / no limit)
[K[     [0;31m*[0m] A start job is running for Ignition (fetch) (12s / no limit)
[K[    [0;31m*[0;1;31m*[0m] A start job is running for Ignition (fetch) (12s / no limit)
```

So the system appears to have network access, but it is spinning endlessly trying to fetch some part of the Ignition config.

@slowrie could you investigate/triage this more thoroughly?

Comment 3 Yunfei Jiang 2020-10-30 06:52:32 UTC
Created attachment 1725241 [details]
system log & ignition files & install log

I reproduced the issue, please check attached Bug1892521_logs_1030.tgz for system log & ignition files & install log.

Comment 4 slowrie 2020-10-30 15:43:49 UTC
I think the loop is a red-herring (likely it's just spewed out output from when Ignition was waiting for NetworkManager to finish) and that the real error is probably occurring after the console logs shown. The AWS `get-console-output` command only returns up to 64KB (https://docs.aws.amazon.com/cli/latest/reference/ec2/get-console-output.html#description) and the current console logs happen to also be 64KB and end in a weird spot.

If possible can you grab the console output 5 minutes after the machine has failed / been killed (`aws ec2 get-console-output -i $instance_id --latest`)?

Comment 5 Yunfei Jiang 2020-11-03 02:14:34 UTC
Created attachment 1726102 [details]
Bug1892521_logs_1030.tgz

Comment 6 Yunfei Jiang 2020-11-03 02:41:23 UTC
please check latest logs in Bug1892521_logs_1030.tgz.

I collect system console by using `aws ec2 get-console-output -i $instance_id > $log_file` every 30 seconds after bootstrap instance id is available, and last 2 hours. (means still get console logs 90 mins after job failed.)

I removed the duplicate logs file (checking via md5sum).

EC2 console does not support `--latest` option:
>> You can optionally retrieve the latest serial console output at any time during the instance lifecycle. This option is only supported on Instances built on the Nitro System. It is not supported through the Amazon EC2 console.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-console.html

Comment 7 slowrie 2020-11-03 17:33:16 UTC
Looking through the new set of logs nothing is jumping out that this is Ignition related. Ignition does seem to be fetching the config from the metadata service once NetworkManager gets the networking up (Ignition has built-in retries which is the logs in the initial post) as seen in this chunk:

[   10.341853] ignition[713]: GET http://169.254.169.254/2019-10-01/user-data: attempt #3
[   10.348561] NetworkManager[735]: <info>  [1604310826.9742] dhcp4 (ens3): option requested_broadcast_address => '1'
[   10.356673] NetworkManager[735]: <info>  [1604310826.9742] dhcp4 (ens3): option requested_domain_name => '1'
[   10.365086] NetworkManager[735]: <info>  [1604310826.9742] dhcp4 (ens3): option requested_domain_name_servers => '1'
[   10.371674] ignition[713]: GET result: OK

The machine seems to be doing a reboot after the fetch stage but I'm not sure why that is and after it comes back Ignition doesn't seem to be being ran at all before we hit the emergency shell. AWS console logs are generally rather mangled so it's hard to have an accurate timeline from them.

If possible can you provide a more compact reproducer so we can do additional debugging?

Comment 8 Yunfei Jiang 2020-11-04 10:27:39 UTC
Creating a disconnected cluster with proxy enabled:

1. create VPC by using 01_vpc_disconnected_aws_with_privatelink.yaml (all CF templates are attached in Bug1892521_logs_1104.tgz)
2. create a bastion host with http proxy in public subnet
3. create install-config.yaml with http proxy
4. create CF stack 02_cluster_infra.yaml
5. create CF stack 03_cluster_security.yaml
6. create CF stack 04_cluster_bootstrap_disconnected.yaml

* the whole jenkins output is in the file jenkins_output.txt in Bug1892521_logs_1104.tgz

Comment 9 Yunfei Jiang 2020-11-04 10:28:15 UTC
Created attachment 1726524 [details]
Bug1892521_logs_1104.tgz

Comment 10 Micah Abbott 2020-11-04 19:25:20 UTC
We'll target this as part of 4.7, but we are currently working on higher priority items related to 4.7 features.

Is it possible to reproduce this with just a single RHCOS node + Ignition config?  Or do you think it requires all the setup for a disconnected environment?

Comment 11 Yunfei Jiang 2020-11-05 02:30:22 UTC
(In reply to Micah Abbott from comment #10)
> We'll target this as part of 4.7, but we are currently working on higher
> priority items related to 4.7 features.
> 
> Is it possible to reproduce this with just a single RHCOS node + Ignition
> config?  Or do you think it requires all the setup for a disconnected
> environment?

I think disconnected env is required, since the normal UPI works well.

Comment 13 Yunfei Jiang 2020-11-26 10:22:49 UTC
QE did some research on this issue:

>> Test 1
Normal UPI, access ignition file via s3 URI (S3://) (tested, works well)
>> Test 2
Disconnect UPI, access ignition file via s3 URI (s3://), since s3 endpoint is enabled, it should get ignition file, but it doesn't (this bug)
>> Test 3
Disconnect UPI, access ignition file via s3 pre-signed url (https://) (tested, works well)


OCP 4.6 could handle s3 URI (s3://) correctly in disconnected environment, so there is no such issue on 4.6 (Test 2 got passed on 4.6), seems some settings or processes were changed in rhcos 4.7 image.

A good option is, s3 URI could be handled correctly by rhcos 4.7 in disconnect environment, same as 4.6, otherwise, we need to document this issue, using pre-signed url instead of s3 URI in disconnected environment.
 
Since we get a workaround for this issue by using pre-signed url, remove `TestBlocker` keyword.

Comment 14 Micah Abbott 2020-12-05 15:52:35 UTC
Higher priority work related to 4.7 features prevented this from being worked on; setting UpcomingSprint

Comment 15 Benjamin Gilbert 2021-01-08 10:52:25 UTC
Setting No Doc Update because this is a 4.7 regression per comment 13.

Comment 17 Yunfei Jiang 2021-01-21 05:13:28 UTC
from OpenShift installer side, verified and pass.

OCP version: 4.7.0-0.nightly-2021-01-19-095812
RHCOS version: 47.83.202101161239-0

cluster was installed successfully.

Comment 18 Michael Nguyen 2021-01-22 13:44:25 UTC
Closing as verified based on https://bugzilla.redhat.com/show_bug.cgi?id=1892521#c17

Comment 21 errata-xmlrpc 2021-02-24 15:29:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.