Bug 1888962 - Name resolution not working due to 99-origin-dns.sh not being executed reliably after upgrading to RHEL 7.9
Summary: Name resolution not working due to 99-origin-dns.sh not being executed reliab...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.11.0
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 3.11.z
Assignee: Russell Teague
QA Contact: weiwei jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-16 15:40 UTC by Simon Krenger
Modified: 2024-03-25 16:45 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-18 14:09:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 12263 0 None closed Bug 1888962: roles/openshift_node: Update NetworkManager conf.d default 2021-02-16 03:32:53 UTC
Red Hat Knowledge Base (Solution) 5495021 0 None None None 2020-10-19 14:29:19 UTC
Red Hat Product Errata RHBA-2020:5107 0 None None None 2020-11-18 14:10:44 UTC

Comment 5 Thomas Haller 2020-10-19 08:13:01 UTC
`systemctl reload NetworkManager` is like sending SIGHUP signal. It means to reload configuration from disk, but also trigger a new DNS update.

It does however not mean to run dispatcher scripts. Dispatcher scripts run for various states of the activation, and SIGHUP does not change that.


NetworkManager writes out DNS configuration (like writing to /etc/resolv.conf) at unpredictable moments whenever it thinks it is necessary. For example, when a new DHCP gets received.

You cannot configure NEtworkManager to write to /etc/resolv.conf while also write it with a dispatcher script. That does not work. If you want to manage /etc/resolv.conf yourself (e.g. with a dispatcher script), then tell NetworkManager to not (also) write /etc/resolv.conf. See `dns=` and `rc-manager=` settings in `man NetworkManager.conf`.



> Network name resolution for internal addresses fails on the Node after the upgrade works as expected.

It's not clear how that could have worked reliably before update. The dispatcher script and NetworkManager seem to fight over managing /etc/resolv.conf, that does not work. Maybe it worked before because there was a lucky race, or the configuration was significantly different. Also, what versions of software was used before the update? And how does the configuration look like?


> Above, we can see that "systemctl restart NetworkManager" is fixing

In most cases, `systemctl restart NetworkManager` is not the right solution for fixing anything. Nor is it clear that it would solve the race.



The solution is not to have two components fight over /etc/resolv.conf.

Comment 7 Thomas Haller 2020-10-19 10:52:39 UTC
> We have observed the change in behaviour when going from NetworkManager-1.18.4-3.el7.x86_64 to NetworkManager-1.18.8-1.el7.x86_64.

Please attach two complete syslog outputs that show working (rhel-7.8) and non-working (rhel-7.9). Also, ensure to have debug logging in NetworkManager enabled (level=TRACE), but don't filter the logs to only contain NetworkManager logs. See https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/contrib/fedora/rpm/NetworkManager.conf#n28 for hints about logging.

Comment 12 Simon Krenger 2020-10-19 13:25:50 UTC
Current findings when individually updating packages indicate that NetworkManager is only partially involved, as only updating NetworkManager (NetworkManager-1.18.4-3.el7.x86_64 to NetworkManager-1.18.8-1.el7.x86_64) does NOT reproduce the issue.

The `cloud-init` package (cloud-init-18.5-6.el7_8.5.x86_64 -> cloud-init-19.4-7.el7.x86_64) seems to be the root cause for this issue.

Comment 13 Thomas Haller 2020-10-19 13:43:17 UTC
(In reply to Simon Krenger from comment #12)
> The `cloud-init` package (cloud-init-18.5-6.el7_8.5.x86_64 ->
> cloud-init-19.4-7.el7.x86_64) seems to be the root cause for this issue.

probably due to bug 1748015

Comment 18 weiwei jiang 2020-11-16 06:54:54 UTC
[root@wj311osp1116bmaster-etcd-nfs-1 ~]# oc get nodes -o wide 
NAME                                  STATUS    ROLES     AGE       VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                      KERNEL-VERSION           CONTAINER-RUNTIME
wj311osp1116bmaster-etcd-nfs-1        Ready     master    15m       v1.11.0+d4cacc0   10.0.150.215   <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)   3.10.0-1127.el7.x86_64   docker://1.13.1
wj311osp1116bnode-1                   Ready     compute   11m       v1.11.0+d4cacc0   10.0.151.160   <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)   3.10.0-1127.el7.x86_64   docker://1.13.1
wj311osp1116bnode-registry-router-1   Ready     <none>    11m       v1.11.0+d4cacc0   10.0.151.80    <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)   3.10.0-1127.el7.x86_64   docker://1.13.1
[root@wj311osp1116bmaster-etcd-nfs-1 ~]# oc version 
oc v3.11.318
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://wj311osp1116bmaster-etcd-nfs-1:8443
openshift v3.11.318
kubernetes v1.11.0+d4cacc0

[root@wj311osp1116bmaster-etcd-nfs-1 ~]# cat /etc/NetworkManager/conf.d/99-origin.conf 

[main]
dns=none


#### before upgrade
[root@wj311osp1116bmaster-etcd-nfs-1 ~]# oc -n openshift-monitoring rsh cluster-monitoring-operator-576c6b8b55-sz8cw 
sh-4.2$ nslookup
> kubernetes.default.svc.cluster.local
Server:         10.0.151.160
Address:        10.0.151.160#53

Name:   kubernetes.default.svc.cluster.local
Address: 172.30.0.1
> 
sh-4.2$ exit

[root@wj311osp1116bnode-1 ~]# rpm -qa|grep -i -E "kernel|networkmanager|cloud-init|redhat-release-server"                                                                                                                                                            
cloud-init-18.5-6.el7.x86_64                                                                                                                                                                                                                                                    
kernel-3.10.0-1127.el7.x86_64                                                                                                                                                                                                                                                   
kernel-tools-libs-3.10.0-1127.el7.x86_64                                                                                                                                                                                                                                        
NetworkManager-1.18.4-3.el7.x86_64                                                                                                                                                                                                                                              
NetworkManager-team-1.18.4-3.el7.x86_64
NetworkManager-tui-1.18.4-3.el7.x86_64
kernel-tools-3.10.0-1127.el7.x86_64
NetworkManager-config-server-1.18.4-3.el7.noarch
redhat-release-server-7.8-2.el7.x86_64
NetworkManager-libnm-1.18.4-3.el7.x86_64


#### After upgrade 
[root@wj311osp1116bnode-1 ~]# rpm -qa|grep -i -E "kernel|networkmanager|cloud-init|redhat-release-server" 
NetworkManager-libnm-1.18.8-2.el7_9.x86_64
kernel-3.10.0-1127.el7.x86_64
NetworkManager-team-1.18.8-2.el7_9.x86_64
NetworkManager-1.18.8-2.el7_9.x86_64
kernel-3.10.0-1160.6.1.el7.x86_64
cloud-init-19.4-7.el7_9.2.x86_64
NetworkManager-config-server-1.18.8-2.el7_9.noarch
redhat-release-server-7.9-5.el7_9.x86_64
kernel-tools-libs-3.10.0-1160.6.1.el7.x86_64
kernel-tools-3.10.0-1160.6.1.el7.x86_64
NetworkManager-tui-1.18.8-2.el7_9.x86_64


[root@wj311osp1116bmaster-etcd-nfs-1 ~]# oc get nodes -o wide 
NAME                                  STATUS    ROLES     AGE       VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                      KERNEL-VERSION               CONTAINER-RUNTIME
wj311osp1116bmaster-etcd-nfs-1        Ready     master    32m       v1.11.0+d4cacc0   10.0.150.215   <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)   3.10.0-1127.el7.x86_64       docker://1.13.1
wj311osp1116bnode-1                   Ready     compute   28m       v1.11.0+d4cacc0   10.0.151.160   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)   3.10.0-1160.6.1.el7.x86_64   docker://1.13.1
wj311osp1116bnode-registry-router-1   Ready     <none>    28m       v1.11.0+d4cacc0   10.0.151.80    <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)   3.10.0-1127.el7.x86_64       docker://1.13.1

[root@wj311osp1116bmaster-etcd-nfs-1 ~]# oc -n openshift-monitoring rsh cluster-monitoring-operator-576c6b8b55-sz8cw 
sh-4.2$ nslookup 
> kubernetes.default.svc.cluster.local
Server:         10.0.151.160
Address:        10.0.151.160#53

Name:   kubernetes.default.svc.cluster.local
Address: 172.30.0.1
> 
sh-4.2$ exit

Comment 20 errata-xmlrpc 2020-11-18 14:09:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.318 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5107


Note You need to log in before you can comment on or make changes to this bug.