Install a disconnected UPI cluster on AWS, the bootstrap instance is started on AWS, but can not acceess it via SSH (in bastion host), checking system logs, get following errors, and masters also can not get ignition files from temporary service on bootstrap machine. bootstrap machine system log: [ 5.148509] dracut-cmdline[344]: Using kernel command line parameters: ip=dhcp,dhcp6 rd.driver.pre=dm_multipath BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-f187b8152d5fd0ad77be9c5f928babd2297aaa6f6e0190810712eedfc4e17a38/vmlinuz-4.18.0-193.28.1.el8_2.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ignition.firstboot ostree=/ostree/boot.1/rhcos/f187b8152d5fd0ad77be9c5f928babd2297aaa6f6e0190810712eedfc4e17a38/0 ignition.platform.id=aws [ 6.891496] ignition-setup-base[614]: File /usr/lib/ignition/platform/aws/base.ign does not exist.. Skipping copy [ 7.016910] ignition-setup-user[617]: File /mnt/boot_partition/ignition/config.ign does not exist.. Skipping copy Starting Copy CoreOS Firstboot Networking Config[ 7.346070] ignition[651]: Ignition 2.7.0 [ 7.354002] ignition[651]: Stage: fetch-offline [ 7.364011] ignition[651]: reading system config file "/usr/lib/ignition/base.ign" [ 7.386808] ignition[651]: no config URL provided [ 7.396501] ignition[651]: reading system config file "/usr/lib/ignition/user.ign" [ 7.421530] ignition[651]: no config at "/usr/lib/ignition/user.ign" [ 7.435414] ignition[651]: failed to fetch config: resource requires networking [ 7.442683] ignition[651]: Ignition finished successfully [ 7.751114] ignition[700]: Ignition 2.7.0 [ 7.755161] ignition[700]: Stage: fetch [ 7.761096] ignition[700]: reading system config file "/usr/lib/ignition/base.ign" [ 7.768221] ignition[700]: no config URL provided [ 7.772468] ignition[700]: reading system config file "/usr/lib/ignition/user.ign" [ 7.779085] ignition[700]: no config at "/usr/lib/ignition/user.ign" [ 7.783060] ignition[700]: GET http://169.254.169.254/2019-10-01/user-data: attempt #1 [ 7.790100] ignition[700]: GET error: Get "http://169.254.169.254/2019-10-01/user-data": dial tcp 169.254.169.254:80: connect: network is unreachable [ 8.005128] ignition[700]: GET http://169.254.169.254/2019-10-01/user-data: attempt #2 [ 8.030362] ignition[700]: GET result: OK [ 0.000000] Command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-f187b8152d5fd0ad77be9c5f928babd2297aaa6f6e0190810712eedfc4e17a38/vmlinuz-4.18.0-193.28.1.el8_2.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ignition.firstboot ostree=/ostree/boot.1/rhcos/f187b8152d5fd0ad77be9c5f928babd2297aaa6f6e0190810712eedfc4e17a38/0 ignition.platform.id=aws [ 0.000000] Kernel command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-f187b8152d5fd0ad77be9c5f928babd2297aaa6f6e0190810712eedfc4e17a38/vmlinuz-4.18.0-193.28.1.el8_2.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ignition.firstboot ostree=/ostree/boot.1/rhcos/f187b8152d5fd0ad77be9c5f928babd2297aaa6f6e0190810712eedfc4e17a38/0 ignition.platform.id=aws Version-Release number of the following components: 4.7.0-0.nightly-2020-10-27-051128 How reproducible: Always Steps to Reproduce: 1. create a disconnected cluster via UPI Actual results: failed to start bootstrap machine and related services. Expected results: start bootstrap machine successfully, and all services work well Additional info:
Created attachment 1724978 [details] bootstrap machine full system log
@yunjiang could you provide the ignition config that the bootstrap host is using? From the logs, I see ``` [ 8.615638] NetworkManager[721]: <info> [1603852952.3790] policy: set 'Wired Connection' (ens3) as default for IPv4 routing and DNS [ 8.626468] NetworkManager[721]: <info> [1603852952.3790] policy: set-hostname: set hostname to 'ip-10-0-60-227' (from DHCPv4) [ 8.636940] NetworkManager[721]: <info> [1603852952.3792] device (ens3): Activation: successful, device activated. [ 8.646249] NetworkManager[721]: <info> [1603852952.3792] manager: NetworkManager state is now CONNECTED_GLOBAL [ 8.651751] NetworkManager[721]: <info> [1603852952.3794] manager: startup complete [ 8.659360] NetworkManager[721]: <info> [1603852952.3795] quitting now that startup is complete [ 8.667754] NetworkManager[721]: <info> [1603852952.4437] dhcp4 (ens3): canceled DHCP transaction [ 8.676276] NetworkManager[721]: <info> [1603852952.4438] dhcp4 (ens3): state changed bound -> done [ 8.684998] NetworkManager[721]: <info> [1603852952.4438] device (ens3): DHCPv4: trying to acquire a new lease within 90 seconds [ 8.695369] NetworkManager[721]: <info> [1603852952.4439] manager: NetworkManager state is now CONNECTED_SITE [ 8.704536] NetworkManager[721]: <info> [1603852952.4440] exiting (success) ``` ...then ``` [[0m[0;31m* [0m] A start job is running for Ignition (fetch) (8s / no limit) [K[[0;1;31m*[0m[0;31m* [0m] A start job is running for Ignition (fetch) (9s / no limit) [K[[0;31m*[0;1;31m*[0m[0;31m* [0m] A start job is running for Ignition (fetch) (9s / no limit) [K[ [0;31m*[0;1;31m*[0m[0;31m* [0m] A start job is running for Ignition (fetch) (10s / no limit) [K[ [0;31m*[0;1;31m*[0m[0;31m* [0m] A start job is running for Ignition (fetch) (10s / no limit) [K[ [0;31m*[0;1;31m*[0m[0;31m*[0m] A start job is running for Ignition (fetch) (11s / no limit) [K[ [0;31m*[0;1;31m*[0m] A start job is running for Ignition (fetch) (11s / no limit) [K[ [0;31m*[0m] A start job is running for Ignition (fetch) (12s / no limit) [K[ [0;31m*[0;1;31m*[0m] A start job is running for Ignition (fetch) (12s / no limit) ``` So the system appears to have network access, but it is spinning endlessly trying to fetch some part of the Ignition config. @slowrie could you investigate/triage this more thoroughly?
Created attachment 1725241 [details] system log & ignition files & install log I reproduced the issue, please check attached Bug1892521_logs_1030.tgz for system log & ignition files & install log.
I think the loop is a red-herring (likely it's just spewed out output from when Ignition was waiting for NetworkManager to finish) and that the real error is probably occurring after the console logs shown. The AWS `get-console-output` command only returns up to 64KB (https://docs.aws.amazon.com/cli/latest/reference/ec2/get-console-output.html#description) and the current console logs happen to also be 64KB and end in a weird spot. If possible can you grab the console output 5 minutes after the machine has failed / been killed (`aws ec2 get-console-output -i $instance_id --latest`)?
Created attachment 1726102 [details] Bug1892521_logs_1030.tgz
please check latest logs in Bug1892521_logs_1030.tgz. I collect system console by using `aws ec2 get-console-output -i $instance_id > $log_file` every 30 seconds after bootstrap instance id is available, and last 2 hours. (means still get console logs 90 mins after job failed.) I removed the duplicate logs file (checking via md5sum). EC2 console does not support `--latest` option: >> You can optionally retrieve the latest serial console output at any time during the instance lifecycle. This option is only supported on Instances built on the Nitro System. It is not supported through the Amazon EC2 console. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-console.html
Looking through the new set of logs nothing is jumping out that this is Ignition related. Ignition does seem to be fetching the config from the metadata service once NetworkManager gets the networking up (Ignition has built-in retries which is the logs in the initial post) as seen in this chunk: [ 10.341853] ignition[713]: GET http://169.254.169.254/2019-10-01/user-data: attempt #3 [ 10.348561] NetworkManager[735]: <info> [1604310826.9742] dhcp4 (ens3): option requested_broadcast_address => '1' [ 10.356673] NetworkManager[735]: <info> [1604310826.9742] dhcp4 (ens3): option requested_domain_name => '1' [ 10.365086] NetworkManager[735]: <info> [1604310826.9742] dhcp4 (ens3): option requested_domain_name_servers => '1' [ 10.371674] ignition[713]: GET result: OK The machine seems to be doing a reboot after the fetch stage but I'm not sure why that is and after it comes back Ignition doesn't seem to be being ran at all before we hit the emergency shell. AWS console logs are generally rather mangled so it's hard to have an accurate timeline from them. If possible can you provide a more compact reproducer so we can do additional debugging?
Creating a disconnected cluster with proxy enabled: 1. create VPC by using 01_vpc_disconnected_aws_with_privatelink.yaml (all CF templates are attached in Bug1892521_logs_1104.tgz) 2. create a bastion host with http proxy in public subnet 3. create install-config.yaml with http proxy 4. create CF stack 02_cluster_infra.yaml 5. create CF stack 03_cluster_security.yaml 6. create CF stack 04_cluster_bootstrap_disconnected.yaml * the whole jenkins output is in the file jenkins_output.txt in Bug1892521_logs_1104.tgz
Created attachment 1726524 [details] Bug1892521_logs_1104.tgz
We'll target this as part of 4.7, but we are currently working on higher priority items related to 4.7 features. Is it possible to reproduce this with just a single RHCOS node + Ignition config? Or do you think it requires all the setup for a disconnected environment?
(In reply to Micah Abbott from comment #10) > We'll target this as part of 4.7, but we are currently working on higher > priority items related to 4.7 features. > > Is it possible to reproduce this with just a single RHCOS node + Ignition > config? Or do you think it requires all the setup for a disconnected > environment? I think disconnected env is required, since the normal UPI works well.
QE did some research on this issue: >> Test 1 Normal UPI, access ignition file via s3 URI (S3://) (tested, works well) >> Test 2 Disconnect UPI, access ignition file via s3 URI (s3://), since s3 endpoint is enabled, it should get ignition file, but it doesn't (this bug) >> Test 3 Disconnect UPI, access ignition file via s3 pre-signed url (https://) (tested, works well) OCP 4.6 could handle s3 URI (s3://) correctly in disconnected environment, so there is no such issue on 4.6 (Test 2 got passed on 4.6), seems some settings or processes were changed in rhcos 4.7 image. A good option is, s3 URI could be handled correctly by rhcos 4.7 in disconnect environment, same as 4.6, otherwise, we need to document this issue, using pre-signed url instead of s3 URI in disconnected environment. Since we get a workaround for this issue by using pre-signed url, remove `TestBlocker` keyword.
Higher priority work related to 4.7 features prevented this from being worked on; setting UpcomingSprint
Setting No Doc Update because this is a 4.7 regression per comment 13.
from OpenShift installer side, verified and pass. OCP version: 4.7.0-0.nightly-2021-01-19-095812 RHCOS version: 47.83.202101161239-0 cluster was installed successfully.
Closing as verified based on https://bugzilla.redhat.com/show_bug.cgi?id=1892521#c17
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633