Bug 1967483

Summary: coreos-installer fails to download Ignition (DNS error, failed to lookup address)
Product: OpenShift Container Platform Reporter: Jonas Nordell <jnordell>
Component: RHCOSAssignee: Jonathan Lebon <jlebon>
Status: CLOSED ERRATA QA Contact: HuijingHei <hhei>
Severity: medium Docs Contact:
Priority: low    
Version: 4.7CC: aivaraslaimikis, andbartl, bgalvani, bgilbert, chdeshpa, dornelas, dustymabe, jlebon, jligon, lucab, miabbott, mnguyen, mrussell, nstielau
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: NetworkManager-wait-online.service timed out too early, preventing a connection to be established before coreos-installer started. Consequence: coreos-installer failed to fetch the Ignition config if the network took too long to come up. Fix: The NetworkManager-wait-online.service time out has been increased to its default upstream value. Result: coreos-installer no longer fails to fetch Ignition config since it only runs after networking is up.
Story Points: ---
Clone Of:
: 1983773 1983774 1991712 (view as bug list) Environment:
Last Closed: 2021-10-18 17:32:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1981999    
Bug Blocks: 1983773, 1991712    

Description Jonas Nordell 2021-06-03 08:15:01 UTC
OCP Version at Install Time: 4.7.12
RHCOS Version at Install Time: 4.7.7
Platform: Bare metal
Architecture: x86_64


What are you trying to do? What is your use case?

Installing bare metal nodes with static IPs and Bonding by customizing grub.

What happened? What went wrong or what did you expect?

coreos-installer-service fail because of DNS issues. And probably because coreos-installer-service gives an error just before the DNS issues "Error: parsing arguments" This indicates that network might not be setup properly.


What are the steps to reproduce your issue? Please try to reduce these steps to something that can be reproduced with a single RHCOS node.

Custom Grub configuration:
        linux /images/pxeboot/vmlinuz random.trust_cpu=on coreos.liveiso=rhcos-47.83.202103251640-0 ignition.firstboot ignition.platform.id=metal coreos.inst.install_dev=sda coreos.inst.ignition_url=http://example.com/pub/bootstrap.ign bond=bond0:ens2f0,ens6f0:mode=802.3ad,lacp_rate=fast,miimon=100,xmit_hash_policy=layer2+3 ip=10.1.0.10::10.1.0.1:255.255.255.0:host1.ocp.example.com:bond0:none nameserver=10.1.0.245 nameserver=10.2.0.245

Note:

After the installer failed it is possible to manually start coreos-installer-service and it works as expected.

Comment 4 Luca BRUNO 2021-06-03 09:06:22 UTC
Thanks for the report. It sounds like the network is either taking a long time to configure or not getting proper connectivity at all.

Can you please attach the full journal for that boot? In order to properly investigate this we'll have to have a look at what NetworkManager is doing in background.

Additionally, once you get to the emergency shell, are you able to interactively `curl` the remote Ignition configuration? Does DNS resolution correctly work at that point?

Comment 12 Jonathan Lebon 2021-06-24 18:59:19 UTC
OK, so we see:

Jun 22 06:56:34 dk1osp1001.eva.danskenet.com NetworkManager[1709]: <trace> [1624344994.2255] dns-mgr: update-resolv-conf: write to /etc/resolv.conf succeeded (rc-manager=symlink)
Jun 22 06:56:34 dk1osp1001.eva.danskenet.com NetworkManager[1709]: <trace> [1624344994.2256] dns-mgr: update-resolv-conf: write internal file /run/NetworkManager/resolv.conf succeeded
Jun 22 06:56:34 dk1osp1001.eva.danskenet.com NetworkManager[1709]: <trace> [1624344994.2256] dns-mgr: current configuration: [{'nameservers': <['10.108.180.245', '10.108.240.245']>, 'interface': <'bond0'>, 'priority': <100>, 'vpn': <false>}]

happening before coreos-installer is started:

Jun 22 06:56:37 dk1osp1001.eva.danskenet.com systemd[1]: Starting CoreOS Installer...
Jun 22 06:56:37 dk1osp1001.eva.danskenet.com coreos-installer-service[1879]: coreos-installer install /dev/sda --ignition-url http://artifactory.danskenet.net/artifactory/db-generic-paas/openshift4/install/az2-osp10/master.ign --insecure-ignition --firstboot-args rd.neednet=1 bond=bond0:ens1f0,ens2f0:mode=802.3ad,lacp_rate=fast,miimon=100,xmit_hash_policy=layer2+3 ip=10.133.223.21::10.133.223.1:255.255.255.224:dk1osp1001.eva.danskenet.com:bond0:none nameserver=10.108.180.245 nameserver=10.108.240.245

And we know the settings are correct, because the customer said they can curl the same URL fine from the emergency shell (right? Or was the successful curl for a different URL?).

Anyway, I'm not sure where the issue is, but if it's a race of some kind, I think it would be solved by https://github.com/coreos/coreos-installer/issues/283.

<time passes>

OK, I've implemented that now in https://github.com/coreos/coreos-installer/pull/565.

I'll try to get a scratch build with that for testing.

Comment 13 Jonathan Lebon 2021-06-24 19:36:22 UTC
Scratch build available at: https://s3.amazonaws.com/rhcos-jlebon/coreos-installer-pr565/builds/48.84.202106241856-0/x86_64/meta.json

Can the customer try the live ISO from there (i.e. https://s3.amazonaws.com/rhcos-jlebon/coreos-installer-pr565/builds/48.84.202106241856-0/x86_64/rhcos-48.84.202106241856-0-live.x86_64.iso) and see if it works? By default it'll install RHCOS 4.8, which is still under development, but the machine should be able to pivot back to 4.7 fine. Otherwise, they can also download the metal image there and point at it with `coreos.inst.image_url`. (Usual disclaimer here that this scratch build isn't officially supported and only intended to help in developing a solution.)

Comment 14 Micah Abbott 2021-06-28 15:24:50 UTC
This won't make the OCP 4.8 GA cutoff, so targeting for 4.9.

If we can confirm the scratch build fixes the issue, we can consider backporting to 4.8 and/or 4.7

Comment 25 Jonathan Lebon 2021-07-07 15:31:52 UTC
Fix in https://github.com/coreos/fedora-coreos-config/pull/1088.

Comment 29 Jonathan Lebon 2021-07-20 15:52:31 UTC
Latest RHCOS 4.9 build has the fix for this, so should be ready to be verified.

Comment 30 Jonathan Lebon 2021-07-20 18:18:29 UTC
Sorry for the confusion on this. It has to stay in POST until the 4.9 bootimage bump PR gets merged.

Comment 33 Jonathan Lebon 2021-08-16 15:20:26 UTC
The 4.7 backport for this is https://bugzilla.redhat.com/show_bug.cgi?id=1983774.

Comment 34 Micah Abbott 2021-08-27 13:49:43 UTC
Boot image bump is merged, moving to MODIFIED

Comment 37 RHCOS Bug Bot 2021-09-02 16:36:31 UTC
The fix for this bug will not be delivered to customers until it lands in an updated bootimage.  That process is tracked in bug 1981999, which is in state ASSIGNED.  Moving this bug back to POST.

Comment 38 RHCOS Bug Bot 2021-09-02 17:37:22 UTC
This bug has been reported fixed in a new RHCOS build.  Do not move this bug to MODIFIED until the fix has landed in a new bootimage.

Comment 39 RHCOS Bug Bot 2021-09-22 18:37:26 UTC
The fix for this bug has landed in a bootimage bump, as tracked in bug 1981999 (now in status MODIFIED).  Moving this bug to MODIFIED.

Comment 42 Michael Nguyen 2021-10-01 14:22:23 UTC
Verified on RHCOS 49.84.202109302214-0.  No more overriding the timeout for nm-wait-online

[core@localhost 35coreos-live]$ ls
coreos-live-clear-sssd-cache.service   coreos-livepxe-rootfs.sh
coreos-live-unmount-tmpfs-var.service  is-live-image.sh
coreos-live-unmount-tmpfs-var.sh       live-generator
coreos-liveiso-persist-osmet.service   module-setup.sh
coreos-livepxe-persist-osmet.service   ostree-cmdline.sh
coreos-livepxe-rootfs.service
[core@localhost 35coreos-live]$ pwd
/usr/lib/dracut/modules.d/35coreos-live
[core@localhost 35coreos-live]$ cat live-generator | grep nm-wait-online
[core@localhost 35coreos-live]$ cat module-setup.sh | grep nm-wait-online
[core@localhost 35coreos-live]$ rpm-ostree status
State: idle
Deployments:
* ostree://802484c7158d05c1f34466b433ec4e680afe8a4b34e37d9529b2bf5c00a5a88d
                   Version: 49.84.202109302214-0 (2021-09-30T22:17:42Z)
[core@localhost 35coreos-live]$

Comment 44 errata-xmlrpc 2021-10-18 17:32:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759