Bug 1967483
Summary: | coreos-installer fails to download Ignition (DNS error, failed to lookup address) | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jonas Nordell <jnordell> | |
Component: | RHCOS | Assignee: | Jonathan Lebon <jlebon> | |
Status: | CLOSED ERRATA | QA Contact: | HuijingHei <hhei> | |
Severity: | medium | Docs Contact: | ||
Priority: | low | |||
Version: | 4.7 | CC: | aivaraslaimikis, andbartl, bgalvani, bgilbert, chdeshpa, dornelas, dustymabe, jlebon, jligon, lucab, miabbott, mnguyen, mrussell, nstielau | |
Target Milestone: | --- | |||
Target Release: | 4.9.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: NetworkManager-wait-online.service timed out too early, preventing a connection to be established before coreos-installer started.
Consequence: coreos-installer failed to fetch the Ignition config if the network took too long to come up.
Fix: The NetworkManager-wait-online.service time out has been increased to its default upstream value.
Result: coreos-installer no longer fails to fetch Ignition config since it only runs after networking is up.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1983773 1983774 1991712 (view as bug list) | Environment: | ||
Last Closed: | 2021-10-18 17:32:38 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1981999 | |||
Bug Blocks: | 1983773, 1991712 |
Description
Jonas Nordell
2021-06-03 08:15:01 UTC
Thanks for the report. It sounds like the network is either taking a long time to configure or not getting proper connectivity at all. Can you please attach the full journal for that boot? In order to properly investigate this we'll have to have a look at what NetworkManager is doing in background. Additionally, once you get to the emergency shell, are you able to interactively `curl` the remote Ignition configuration? Does DNS resolution correctly work at that point? OK, so we see: Jun 22 06:56:34 dk1osp1001.eva.danskenet.com NetworkManager[1709]: <trace> [1624344994.2255] dns-mgr: update-resolv-conf: write to /etc/resolv.conf succeeded (rc-manager=symlink) Jun 22 06:56:34 dk1osp1001.eva.danskenet.com NetworkManager[1709]: <trace> [1624344994.2256] dns-mgr: update-resolv-conf: write internal file /run/NetworkManager/resolv.conf succeeded Jun 22 06:56:34 dk1osp1001.eva.danskenet.com NetworkManager[1709]: <trace> [1624344994.2256] dns-mgr: current configuration: [{'nameservers': <['10.108.180.245', '10.108.240.245']>, 'interface': <'bond0'>, 'priority': <100>, 'vpn': <false>}] happening before coreos-installer is started: Jun 22 06:56:37 dk1osp1001.eva.danskenet.com systemd[1]: Starting CoreOS Installer... Jun 22 06:56:37 dk1osp1001.eva.danskenet.com coreos-installer-service[1879]: coreos-installer install /dev/sda --ignition-url http://artifactory.danskenet.net/artifactory/db-generic-paas/openshift4/install/az2-osp10/master.ign --insecure-ignition --firstboot-args rd.neednet=1 bond=bond0:ens1f0,ens2f0:mode=802.3ad,lacp_rate=fast,miimon=100,xmit_hash_policy=layer2+3 ip=10.133.223.21::10.133.223.1:255.255.255.224:dk1osp1001.eva.danskenet.com:bond0:none nameserver=10.108.180.245 nameserver=10.108.240.245 And we know the settings are correct, because the customer said they can curl the same URL fine from the emergency shell (right? Or was the successful curl for a different URL?). Anyway, I'm not sure where the issue is, but if it's a race of some kind, I think it would be solved by https://github.com/coreos/coreos-installer/issues/283. <time passes> OK, I've implemented that now in https://github.com/coreos/coreos-installer/pull/565. I'll try to get a scratch build with that for testing. Scratch build available at: https://s3.amazonaws.com/rhcos-jlebon/coreos-installer-pr565/builds/48.84.202106241856-0/x86_64/meta.json Can the customer try the live ISO from there (i.e. https://s3.amazonaws.com/rhcos-jlebon/coreos-installer-pr565/builds/48.84.202106241856-0/x86_64/rhcos-48.84.202106241856-0-live.x86_64.iso) and see if it works? By default it'll install RHCOS 4.8, which is still under development, but the machine should be able to pivot back to 4.7 fine. Otherwise, they can also download the metal image there and point at it with `coreos.inst.image_url`. (Usual disclaimer here that this scratch build isn't officially supported and only intended to help in developing a solution.) This won't make the OCP 4.8 GA cutoff, so targeting for 4.9. If we can confirm the scratch build fixes the issue, we can consider backporting to 4.8 and/or 4.7 Latest RHCOS 4.9 build has the fix for this, so should be ready to be verified. Sorry for the confusion on this. It has to stay in POST until the 4.9 bootimage bump PR gets merged. The 4.7 backport for this is https://bugzilla.redhat.com/show_bug.cgi?id=1983774. Boot image bump is merged, moving to MODIFIED The fix for this bug will not be delivered to customers until it lands in an updated bootimage. That process is tracked in bug 1981999, which is in state ASSIGNED. Moving this bug back to POST. This bug has been reported fixed in a new RHCOS build. Do not move this bug to MODIFIED until the fix has landed in a new bootimage. The fix for this bug has landed in a bootimage bump, as tracked in bug 1981999 (now in status MODIFIED). Moving this bug to MODIFIED. Verified on RHCOS 49.84.202109302214-0. No more overriding the timeout for nm-wait-online [core@localhost 35coreos-live]$ ls coreos-live-clear-sssd-cache.service coreos-livepxe-rootfs.sh coreos-live-unmount-tmpfs-var.service is-live-image.sh coreos-live-unmount-tmpfs-var.sh live-generator coreos-liveiso-persist-osmet.service module-setup.sh coreos-livepxe-persist-osmet.service ostree-cmdline.sh coreos-livepxe-rootfs.service [core@localhost 35coreos-live]$ pwd /usr/lib/dracut/modules.d/35coreos-live [core@localhost 35coreos-live]$ cat live-generator | grep nm-wait-online [core@localhost 35coreos-live]$ cat module-setup.sh | grep nm-wait-online [core@localhost 35coreos-live]$ rpm-ostree status State: idle Deployments: * ostree://802484c7158d05c1f34466b433ec4e680afe8a4b34e37d9529b2bf5c00a5a88d Version: 49.84.202109302214-0 (2021-09-30T22:17:42Z) [core@localhost 35coreos-live]$ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |