Bug 2029438 - Bootstrap node cannot resolve api-int because NetworkManager replaces resolv.conf
Summary: Bootstrap node cannot resolve api-int because NetworkManager replaces resolv....
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.11.0
Assignee: Ben Nemec
QA Contact: jima
URL:
Whiteboard: Telco; Telco:RAN
: 2027836 2033550 (view as bug list)
Depends On:
Blocks: 2083335
TreeView+ depends on / blocked
 
Reported: 2021-12-06 14:12 UTC by Jim Ramsay
Modified: 2022-10-05 04:01 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: vsphere rhcos image has no /etc/resolv.conf Consequence: default networkmanager settings cause attempts to access /etc/resolv.conf and throw an error when not found Fix: set rc-manager=unmanaged Result: networkmanager does not attempt to access /etc/resolv.conf
Clone Of:
: 2083335 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:40:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
bootkube.service.log (443.07 KB, text/plain)
2021-12-06 14:34 UTC, Jim Ramsay
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 5482 0 None open Bug 2029438: Set rc-manager=unmanaged on baremetal bootstrap 2021-12-13 21:51:29 UTC
Github openshift installer pull 5842 0 None Merged vsphere upi: missing etc resolv 2022-05-13 14:40:49 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:40:47 UTC

Description Jim Ramsay 2021-12-06 14:12:03 UTC
Version:

Deploying OpenShift 4.9.7

Platform:

baremetal

Using AI on ACM

What happened?

Trying to deploy a 3-node compressed cluster all on baremetal.  The lab environment's main DNS does not have a record for the 'api-int.$cluster' address.

The first 2 master nodes install properly, but the bootstrap node was stuck as a bootstrap node forever.

According to the bootkube.service logs, it was trying but failing to resolve api-int.$cluster repeatedly:
> Dec 03 20:34:07 cnfdf02.telco5gran.eng.rdu2.redhat.com bootkube.sh[26004]: Unable to connect to the server: dial tcp: lookup api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com on 10.11.5.19:53: no such host

It's correct in that that upstream DNS (10.11.5.19) does indeed have no record for api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com; However, the internal DNS does:

$ dig @localhost api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com

; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> @localhost api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61870
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: d1e499aa081db818 (echoed)
;; QUESTION SECTION:
;api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com.        IN A

;; ANSWER SECTION:
api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com. 30 IN A 10.8.34.52


So the bootstrap node should be able to resolve this address!  However, /etc/resolv.conf says:
> # Generated by NetworkManager
> search telco5gran.eng.rdu2.redhat.com
> nameserver 10.11.5.19

After some conversation in slack, it looks like there may be a race condition between NetworkManager bringing up the interface, the nm-dispatcher adding in localhost to /etc/resolv.conf, and NetworkManager doing further processing which resets resolv.conf to only that which is in the nmconnection file: https://coreos.slack.com/archives/CUPJTHQ5P/p1638577329289700?thread_ts=1638483562.237600&cid=CUPJTHQ5P

What did you expect to happen?

The bootstrap node should always have its own address in /etc/resolv.conf so it can always resolve api-int.$cluster and complete the install successfully.

How to reproduce it (as minimally and precisely as possible)?

Deploy a cluster with a static IPv4 configuration in an environment where there's no DNS record for 'api-int.$cluster'.

Example nmconnection:

[connection]
id=eno1
uuid=60a1b8f8-d3de-44cc-a09e-72fd1e76c9c6
type=ethernet
interface-name=eno1
permissions=
autoconnect=true
autoconnect-priority=1

[ethernet]
mac-address-blacklist=

[ipv4]
address1=10.8.34.12/24
dhcp-client-id=mac
dns=10.11.5.19;
dns-priority=40
dns-search=telco5gran.eng.rdu2.redhat.com;
method=manual
route1=0.0.0.0/0,10.8.34.254
route1_options=table=254

[ipv6]
addr-gen-mode=eui64
dhcp-duid=ll
dhcp-iaid=mac
dns-search=
method=disabled

[proxy]

(Note: This was generated with `nmstate gc <config>` but nmstate is not running on the node)

Anything else we should know:

There are 2 workarounds:
- On a system where the bootstrap node is in this stuck state, running something trivial like "sudo nmcli device disconnect eno1; wait ; sudo nmcli device connect eno1" will cause the localhost entry to be re-added to resolv.conf and the install proceeds
- On a fresh install, adding "127.0.0.1" to the static DNS configuration will cause the install to start, too.

Comment 1 Jim Ramsay 2021-12-06 14:34:46 UTC
Created attachment 1844924 [details]
bootkube.service.log

Comment 2 Omer Tuchfeld 2021-12-06 14:54:38 UTC
Can you please make the following modifications to the bug description:

- Remove references to "nmstate" - nmstate is not being used here, it's just raw nmconnection files generated by the assisted service (using nmstate, but that's beside the point), in your case it's generated to and this is what matters:

[connection]
id=eno1
uuid=60a1b8f8-d3de-44cc-a09e-72fd1e76c9c6
type=ethernet
interface-name=eno1
permissions=
autoconnect=true
autoconnect-priority=1

[ethernet]
mac-address-blacklist=

[ipv4]
address1=10.8.34.12/24
dhcp-client-id=mac
dns=10.11.5.19;
dns-priority=40
dns-search=telco5gran.eng.rdu2.redhat.com;
method=manual
route1=0.0.0.0/0,10.8.34.254
route1_options=table=254

[ipv6]
addr-gen-mode=eui64
dhcp-duid=ll
dhcp-iaid=mac
dns-search=
method=disabled

[proxy]

- Remove the .interfaces stanza from the yaml under "and with nmstate something like the following:", it's assisted-installer specific and is not relevant to the problem. Only the content under ".config" is the actual nmstate config. And even then, please just specify that the nmconnection file above is simply generated with `nmstate gc <config>` and nmstate is not running on the node

- Replace the workaround "I have a workaround: If I manually add "127.0.0.1" to the dns-resolver section of my nmstate, the install succeeds." with this workaround "sudo nmcli device disconnect eno1; wait ; sudo nmcli device connect eno1" - it shows that simply doing a meaningless action on interfaces will trigger the dispatcher script which works as intended.

Comment 10 Matthew Staebler 2021-12-17 14:10:57 UTC
*** Bug 2033550 has been marked as a duplicate of this bug. ***

Comment 11 Matthew Staebler 2021-12-17 14:12:38 UTC
This issue is not unique to baremetal. See https://bugzilla.redhat.com/show_bug.cgi?id=2033550 where the same issue is happening with vSphere.

Comment 12 Michael Filanov 2021-12-29 13:34:41 UTC
*** Bug 2027836 has been marked as a duplicate of this bug. ***

Comment 13 jima 2022-02-09 03:18:28 UTC
The issue happened several times against 4.10 recently on QE CI and manual installation.
Is there any plan to fix the issue on 4.10?

Comment 14 Wei Sun 2022-02-10 06:04:31 UTC
Once this happens, the cluster could not be set up successfully. Per #comment 13, update the severity to high.

Comment 15 Patrick Dillon 2022-03-22 17:47:18 UTC
We are researching who the correct assignee for this bz is.

Comment 21 jima 2022-04-25 06:20:53 UTC
upi-on-vsphere installation failed at bootstrap stage when using nightly build 4.11.0-0.nightly-2022-04-24-085400 (containing the fix) or later payload, it is succeeded against 4.11.0-0.nightly-2022-04-23-153426.

Checked on bootstrap instance, /etc/resolv.conf was not generated.
[root@bootstrap-0 ~]# ls -ltr /etc/resolv.conf
ls: cannot access '/etc/resolv.conf': No such file or directory

And see rc-manager is configured as unmanaged.
[root@bootstrap-0 ~]# ls -ltr /etc/NetworkManager/conf.d/99-vsphere.conf 
-rw-------. 1 root root 28 Apr 25 03:04 /etc/NetworkManager/conf.d/99-vsphere.conf
[root@bootstrap-0 ~]# cat /etc/NetworkManager/conf.d/99-vsphere.conf
[main]
rc-manager=unmanaged

Comment 23 Ben Nemec 2022-05-09 17:59:34 UTC
The UPI bug was fixed by https://github.com/openshift/installer/pull/5842 . This should be ready for testing again.

Comment 25 jima 2022-05-17 08:37:24 UTC
The issue of vsphere upi installation in comment 21 has been fixed in https://github.com/openshift/installer/pull/5842, and verified passed, upi installation is successful without any error.

The original issue described in this bug on ipi-on-vsphere also happens sometimes on QE CI(1-2 time per week), after PR installer#5482 is merged, I monitor QE CI for two weeks, and don't hit such issue in CI and manual installation any more. Issue should be fixed, move bug to VERIFIED.

Comment 28 errata-xmlrpc 2022-08-10 10:40:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.