Bug 2013010

Summary: NetworkManager not updating DNS in /etc/resolv.conf when using DHCP
Product: Red Hat Enterprise Linux 7 Reporter: Robert McSwain <rmcswain>
Component: cloud-initAssignee: Virtualization Maintenance <virt-maint>
Status: NEW --- QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.9CC: bdas, bgalvani, eesposit, eterrell, huzhao, jgreguske, lrintel, rkhan, sfaye, shaselde, sukulkar, thaller, till, usurse, xiachen, xiliang, yacao
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Customer's cloud.cfg file none

Description Robert McSwain 2021-10-11 19:59:58 UTC
Created attachment 1831945 [details]
Customer's cloud.cfg file

Description of problem:
When using the latest version of cloud-init, cloud-init-19.4-7.el7_9.5.x86_64, DNS is not getting updated in /etc/resolv.conf.  I found this KB article, https://access.redhat.com/solutions/4437221, which references this bug https://bugzilla.redhat.com/show_bug.cgi?id=1748015. 

This works on VMware with cloud-init-19.4-7.el7_9.5.x86_64 but not on AWS.

Version-Release number of selected component (if applicable):
cloud-init-19.4-7.el7_9.5.x86_64

Update to cloud-init-19.4-7.el7_9.5.x86_64 from cloud-init-18.5-6.el7

Steps to Reproduce:
1. Upgrade to cloud-init-19.4-7.el7_9.5.x86_64
2. Attempt to use the cloud.cfg file attached
3. DNS will not be updated in resolv.conf

Actual results:
DNS will not be updated in resolv.conf

Expected results:
DNS will be updated in resolv.conf

Additional info:

1. DNS should be set by DHCP, but it's not.
2. Simply downgrading cloud-init to version 18.5-6.el7 is enough for it to work properly again on AWS releases.

Comment 6 Frank Liang 2021-11-15 13:09:09 UTC
While trying to reproduce it in aws, I have below questions not clear.

Create an image with the cloud.cfg file attached and I cannot access the instances because it failed to bring up eth0.
Is customer's system accessible with this configuration applied?

Nov 15 08:12:41 ip-10-116-2-42 network: Bringing up loopback interface:  [  OK  ]
Nov 15 08:12:41 ip-10-116-2-42 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Nov 15 08:12:41 ip-10-116-2-42 NetworkManager[592]: <info>  [1636963961.4000] agent-manager: req[0x56034a49aca0, :1.18/nmcli-connect/0]: agent registered
Nov 15 08:12:41 ip-10-116-2-42 NetworkManager[592]: <info>  [1636963961.4008] audit: op="connection-activate" uuid="5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03" name="System eth0" result="fail" reason="No suitable device found for this connection (device eth0 not available because profile is not compatible with device (permanent MAC address doesn't match))."
Nov 15 08:12:41 ip-10-116-2-42 network: Bringing up interface eth0:  Error: Connection activation failed: No suitable device found for this connection (device eth0 not available because profile is not compatible with device (permanent MAC address doesn't match)).
Nov 15 08:12:41 ip-10-116-2-42 network: [FAILED]
Nov 15 08:12:41 ip-10-116-2-42 systemd: network.service: control process exited, code=exited status=1
Nov 15 08:12:41 ip-10-116-2-42 systemd: Failed to start LSB: Bring up/down networking.
Nov 15 08:12:41 ip-10-116-2-42 systemd: Unit network.service entered failed state.
Nov 15 08:12:41 ip-10-116-2-42 systemd: network.service failed.


What is expected content in '/etc/resolv.conf'?
I checked the 2 sos reports(cloud-init-19.4-7.el7_9.4.x86_64), one from vmware and one from aws. They have the similar content generated by NM.

# cat etc/resolv.conf 
# Generated by NetworkManager
search int.cdphp.com
nameserver 10.201.32.50
nameserver 10.201.16.54
nameserver 10.100.4.112
nameserver 10.100.4.114
options timeout:2 attempts:1

After checked bz1748015, I guess it is not the same. Replaced attached cloud.cfg in a running instance firstly, the resolv.conf is restored without any problem after reboot.
 
[root@ip-10-116-2-42 ec2-user]# cat /etc/resolv.conf
# Generated by NetworkManager
search us-west-2.compute.internal
nameserver 10.116.0.2
[root@ip-10-116-2-42 ec2-user]# truncate -s0 /etc/resolv.conf
[root@ip-10-116-2-42 ec2-user]# cloud-init clean
[root@ip-10-116-2-42 ec2-user]# reboot
Connection to ec2-34-209-65-192.us-west-2.compute.amazonaws.com closed by remote host.
Connection to ec2-34-209-65-192.us-west-2.compute.amazonaws.com closed.
[root@cloud-aws-2 fedora]# ssh ec2-user.compute.amazonaws.com
Last login: Mon Nov 15 09:02:58 2021 from 66.187.232.127
[root@ip-10-116-2-42 ec2-user]# cat /etc/resolv.conf
# Generated by NetworkManager
search us-west-2.compute.internal
nameserver 10.116.0.2

Comment 20 Thomas Haller 2023-05-19 16:05:22 UTC
in comment 0 it indicated that downgrading cloud-init made it work. If that is correct, then cloud-init seems involved.

It's not clear to me which log to look at. Looking at "after.tar.gz" and the related sosreport from https://access.redhat.com/support/cases/#/case/03044228/discussion?commentId=a0a6R00000U2yPzQAJ , we see that NetworkManager does not actually manage eth0:

   <info>  [1668607054.6799] ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-eth0 (5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03,"System eth0") due to NM_CONTROLLED=no. Unmanaged: interface-name:eth0.


We see that resolv.conf is empty, and there is a comment that it was written by NetworkManager.
If there are no devices/profiles active in NetworkManager, NetworkManager has no DNS configuration. That is probably why it wrote an empty resolv.conf. It's questionable whether that makes sense. I seem to remember, we did a change that NetworkManager doesn't write an empty /etc/resolv.conf, if NetworkManager has no configuration. Beniamino, do you recall that?


In any case, if the user clearly does not want NetworkManager to handle /etc/resolv.conf (or anything at all), they can either tell NetworkManager to not touch /etc/resolv.conf (via `[main].dns=none` or `[main].rc-manager=unmanaged` -- see `man NetworkManager.conf`). Or, it seems NetworkManager doesn't do anything useful on that system anyway, and they have their own tool (network-scripts?), then it's better to just `systemctl disable NetworkManager` . Or even just `yum remove NetworkManager`.


But the issue is rather unclear to me and I don't see relevant logs.

Comment 22 Beniamino Galvani 2023-05-24 08:07:23 UTC
(In reply to Thomas Haller from comment #20)
> If there are no devices/profiles active in NetworkManager, NetworkManager
> has no DNS configuration. That is probably why it wrote an empty
> resolv.conf. It's questionable whether that makes sense. I seem to remember,
> we did a change that NetworkManager doesn't write an empty /etc/resolv.conf,
> if NetworkManager has no configuration.

Right, that is what should happen because of the way in which we compute the hash of current DNS configuration. Also, see bug 1344303.

As Thomas said, to understand whether the issue is related to NM (which doesn't seem the case since a cloud-init downgrade fixes the problem), we need NM logs at trace level - with both cloud-init versions.

Comment 24 Beniamino Galvani 2023-06-13 08:47:44 UTC
Hi, as said in the previous comments, we need NetworkManager logs at trace level to understand how cloud-init is configuring NM. Are those logs available anywhere?

To change the log level, set level=TRACE in the [logging] section of /etc/NetworkManager/NetworkManager.conf, then reboot, reproduce the issue and attach the output of `journalctl -b`. If possible, repeat this procedure with both cloud-init versions. Thank you.

Comment 27 Beniamino Galvani 2023-07-07 12:09:20 UTC
The difference between the two logs is in the behavior of cloud-init between 18.5-6 and 19.4-7. The latter performs a "systemctl reload NetworkManager" after starting up , while the former does not:

  cloud-init: Cloud-init v. 18.5 running 'modules:final'

  vs

  cloud-init: Cloud-init v. 19.4 running 'modules:final'
  NetworkManager: <info>  [1688650541.9370] audit: op="reload" arg="0" pid=1707 uid=0 result="success"

In RHEL 7, the NM reload is implemented in the systemd service file as calling the Reload() D-Bus method with flag 0:

  ExecReload=/usr/bin/dbus-send --print-reply --system --type=method_call --dest=org.freedesktop.NetworkManager /org/freedesktop/NetworkManager org.freedesktop.NetworkManager.Reload uint32:0

where the flags are defined as:

  No flags (0x00) means to reload everything that is supported which is identical to sending a SIGHUP.
  (0x01) means to reload the NetworkManager.conf configuration from disk. Note that this does not include connections, which can be reloaded via Setting's ReloadConnections.
  (0x02) means to update DNS configuration, which usually involves writing /etc/resolv.conf anew.
  (0x04) means to restart the DNS plugin.

So, calling reload with flag 0 also requests an explicit update of resolv.conf. Since there are no DNS servers configured in NM for eth0, resolv.conf is overwritten with an empty list.

I think the solution here is that cloud-init should either:

 1) set "dns=none" in NetworkManager configuration if NetworkManager is not supposed to touch resolv.conf

 or

 2) if the purpose of the reload at the end is to make NM aware of the new configuration files in /etc/NetworkManager/conf.d, then cloud-init should send a reload with flag 1, so that resolv.conf is not forcibly rewritten

Comment 28 Beniamino Galvani 2023-07-10 08:00:36 UTC
I believe NetworkManager is behaving as documented and expected here, the cause of the bug seems to be a change in cloud-init between 18.5-6 and 19.4-7. Therefore, I'm reassigning the bz; please reassign it back if you find anything needs to be changed in NM. Thank you.