Bug 2029438

Summary:

Bootstrap node cannot resolve api-int because NetworkManager replaces resolv.conf

Product:

OpenShift Container Platform

Reporter:

Jim Ramsay <jramsay>

Component:

Installer

Assignee:

Ben Nemec <bnemec>

Installer sub component:

openshift-installer

QA Contact:

jima

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

medium

CC:

bgalvani, bnemec, jima, otuchfel, padillon, sasha, vpickard, wsun, yboaron

Version:

4.9

Target Milestone:

---

Target Release:

4.11.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Telco; Telco:RAN

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: vsphere rhcos image has no /etc/resolv.conf Consequence: default networkmanager settings cause attempts to access /etc/resolv.conf and throw an error when not found Fix: set rc-manager=unmanaged Result: networkmanager does not attempt to access /etc/resolv.conf

Story Points:

---

Clone Of:

Clones:

2083335 (view as bug list)

Environment:

Last Closed:

2022-08-10 10:40:31 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2083335

Attachments:

Description	Flags
bootkube.service.log	none

Description Jim Ramsay 2021-12-06 14:12:03 UTC

Version:

Deploying OpenShift 4.9.7

Platform:

baremetal

Using AI on ACM

What happened?

Trying to deploy a 3-node compressed cluster all on baremetal.  The lab environment's main DNS does not have a record for the 'api-int.$cluster' address.

The first 2 master nodes install properly, but the bootstrap node was stuck as a bootstrap node forever.

According to the bootkube.service logs, it was trying but failing to resolve api-int.$cluster repeatedly:
> Dec 03 20:34:07 cnfdf02.telco5gran.eng.rdu2.redhat.com bootkube.sh[26004]: Unable to connect to the server: dial tcp: lookup api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com on 10.11.5.19:53: no such host

It's correct in that that upstream DNS (10.11.5.19) does indeed have no record for api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com; However, the internal DNS does:

$ dig @localhost api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> @localhost api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61870
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: d1e499aa081db818 (echoed)
;; QUESTION SECTION:
;api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com.        IN A

;; ANSWER SECTION:
api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com. 30 IN A 10.8.34.52


So the bootstrap node should be able to resolve this address!  However, /etc/resolv.conf says:
> # Generated by NetworkManager
> search telco5gran.eng.rdu2.redhat.com
> nameserver 10.11.5.19

After some conversation in slack, it looks like there may be a race condition between NetworkManager bringing up the interface, the nm-dispatcher adding in localhost to /etc/resolv.conf, and NetworkManager doing further processing which resets resolv.conf to only that which is in the nmconnection file: https://coreos.slack.com/archives/CUPJTHQ5P/p1638577329289700?thread_ts=1638483562.237600&cid=CUPJTHQ5P

What did you expect to happen?

The bootstrap node should always have its own address in /etc/resolv.conf so it can always resolve api-int.$cluster and complete the install successfully.

How to reproduce it (as minimally and precisely as possible)?

Deploy a cluster with a static IPv4 configuration in an environment where there's no DNS record for 'api-int.$cluster'.

Example nmconnection:

[connection]
id=eno1
uuid=60a1b8f8-d3de-44cc-a09e-72fd1e76c9c6
type=ethernet
interface-name=eno1
permissions=
autoconnect=true
autoconnect-priority=1

[ethernet]
mac-address-blacklist=

[ipv4]
address1=10.8.34.12/24
dhcp-client-id=mac
dns=10.11.5.19;
dns-priority=40
dns-search=telco5gran.eng.rdu2.redhat.com;
method=manual
route1=0.0.0.0/0,10.8.34.254
route1_options=table=254

[ipv6]
addr-gen-mode=eui64
dhcp-duid=ll
dhcp-iaid=mac
dns-search=
method=disabled

[proxy]

(Note: This was generated with `nmstate gc <config>` but nmstate is not running on the node)

Anything else we should know:

There are 2 workarounds:
- On a system where the bootstrap node is in this stuck state, running something trivial like "sudo nmcli device disconnect eno1; wait ; sudo nmcli device connect eno1" will cause the localhost entry to be re-added to resolv.conf and the install proceeds
- On a fresh install, adding "127.0.0.1" to the static DNS configuration will cause the install to start, too.

Comment 1 Jim Ramsay 2021-12-06 14:34:46 UTC

Created attachment 1844924 [details]
bootkube.service.log

Comment 2 Omer Tuchfeld 2021-12-06 14:54:38 UTC

Can you please make the following modifications to the bug description:

- Remove references to "nmstate" - nmstate is not being used here, it's just raw nmconnection files generated by the assisted service (using nmstate, but that's beside the point), in your case it's generated to and this is what matters:

[connection]
id=eno1
uuid=60a1b8f8-d3de-44cc-a09e-72fd1e76c9c6
type=ethernet
interface-name=eno1
permissions=
autoconnect=true
autoconnect-priority=1

[ethernet]
mac-address-blacklist=

[ipv4]
address1=10.8.34.12/24
dhcp-client-id=mac
dns=10.11.5.19;
dns-priority=40
dns-search=telco5gran.eng.rdu2.redhat.com;
method=manual
route1=0.0.0.0/0,10.8.34.254
route1_options=table=254

[ipv6]
addr-gen-mode=eui64
dhcp-duid=ll
dhcp-iaid=mac
dns-search=
method=disabled

[proxy]

- Remove the .interfaces stanza from the yaml under "and with nmstate something like the following:", it's assisted-installer specific and is not relevant to the problem. Only the content under ".config" is the actual nmstate config. And even then, please just specify that the nmconnection file above is simply generated with `nmstate gc <config>` and nmstate is not running on the node

- Replace the workaround "I have a workaround: If I manually add "127.0.0.1" to the dns-resolver section of my nmstate, the install succeeds." with this workaround "sudo nmcli device disconnect eno1; wait ; sudo nmcli device connect eno1" - it shows that simply doing a meaningless action on interfaces will trigger the dispatcher script which works as intended.

Comment 10 Matthew Staebler 2021-12-17 14:10:57 UTC

*** Bug 2033550 has been marked as a duplicate of this bug. ***

Comment 11 Matthew Staebler 2021-12-17 14:12:38 UTC

This issue is not unique to baremetal. See https://bugzilla.redhat.com/show_bug.cgi?id=2033550 where the same issue is happening with vSphere.

Comment 12 Michael Filanov 2021-12-29 13:34:41 UTC

*** Bug 2027836 has been marked as a duplicate of this bug. ***

Comment 13 jima 2022-02-09 03:18:28 UTC

The issue happened several times against 4.10 recently on QE CI and manual installation.
Is there any plan to fix the issue on 4.10?

Comment 14 Wei Sun 2022-02-10 06:04:31 UTC

Once this happens, the cluster could not be set up successfully. Per #comment 13, update the severity to high.

Comment 15 Patrick Dillon 2022-03-22 17:47:18 UTC

We are researching who the correct assignee for this bz is.

Comment 21 jima 2022-04-25 06:20:53 UTC

upi-on-vsphere installation failed at bootstrap stage when using nightly build 4.11.0-0.nightly-2022-04-24-085400 (containing the fix) or later payload, it is succeeded against 4.11.0-0.nightly-2022-04-23-153426.

Checked on bootstrap instance, /etc/resolv.conf was not generated.
[root@bootstrap-0 ~]# ls -ltr /etc/resolv.conf
ls: cannot access '/etc/resolv.conf': No such file or directory

And see rc-manager is configured as unmanaged.
[root@bootstrap-0 ~]# ls -ltr /etc/NetworkManager/conf.d/99-vsphere.conf 
-rw-------. 1 root root 28 Apr 25 03:04 /etc/NetworkManager/conf.d/99-vsphere.conf
[root@bootstrap-0 ~]# cat /etc/NetworkManager/conf.d/99-vsphere.conf
[main]
rc-manager=unmanaged

Comment 23 Ben Nemec 2022-05-09 17:59:34 UTC

The UPI bug was fixed by https://github.com/openshift/installer/pull/5842 . This should be ready for testing again.

Comment 25 jima 2022-05-17 08:37:24 UTC

The issue of vsphere upi installation in comment 21 has been fixed in https://github.com/openshift/installer/pull/5842, and verified passed, upi installation is successful without any error.

The original issue described in this bug on ipi-on-vsphere also happens sometimes on QE CI(1-2 time per week), after PR installer#5482 is merged, I monitor QE CI for two weeks, and don't hit such issue in CI and manual installation any more. Issue should be fixed, move bug to VERIFIED.

Comment 28 errata-xmlrpc 2022-08-10 10:40:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069