Bug 2013010
| Summary: | NetworkManager not updating DNS in /etc/resolv.conf when using DHCP | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Robert McSwain <rmcswain> | ||||
| Component: | cloud-init | Assignee: | Virtualization Maintenance <virt-maint> | ||||
| Status: | NEW --- | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 7.9 | CC: | bdas, bgalvani, eesposit, eterrell, huzhao, jgreguske, lrintel, rkhan, sfaye, shaselde, sukulkar, thaller, till, usurse, xiachen, xiliang, yacao | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | Type: | Bug | |||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Robert McSwain
2021-10-11 19:59:58 UTC
While trying to reproduce it in aws, I have below questions not clear. Create an image with the cloud.cfg file attached and I cannot access the instances because it failed to bring up eth0. Is customer's system accessible with this configuration applied? Nov 15 08:12:41 ip-10-116-2-42 network: Bringing up loopback interface: [ OK ] Nov 15 08:12:41 ip-10-116-2-42 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Nov 15 08:12:41 ip-10-116-2-42 NetworkManager[592]: <info> [1636963961.4000] agent-manager: req[0x56034a49aca0, :1.18/nmcli-connect/0]: agent registered Nov 15 08:12:41 ip-10-116-2-42 NetworkManager[592]: <info> [1636963961.4008] audit: op="connection-activate" uuid="5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03" name="System eth0" result="fail" reason="No suitable device found for this connection (device eth0 not available because profile is not compatible with device (permanent MAC address doesn't match))." Nov 15 08:12:41 ip-10-116-2-42 network: Bringing up interface eth0: Error: Connection activation failed: No suitable device found for this connection (device eth0 not available because profile is not compatible with device (permanent MAC address doesn't match)). Nov 15 08:12:41 ip-10-116-2-42 network: [FAILED] Nov 15 08:12:41 ip-10-116-2-42 systemd: network.service: control process exited, code=exited status=1 Nov 15 08:12:41 ip-10-116-2-42 systemd: Failed to start LSB: Bring up/down networking. Nov 15 08:12:41 ip-10-116-2-42 systemd: Unit network.service entered failed state. Nov 15 08:12:41 ip-10-116-2-42 systemd: network.service failed. What is expected content in '/etc/resolv.conf'? I checked the 2 sos reports(cloud-init-19.4-7.el7_9.4.x86_64), one from vmware and one from aws. They have the similar content generated by NM. # cat etc/resolv.conf # Generated by NetworkManager search int.cdphp.com nameserver 10.201.32.50 nameserver 10.201.16.54 nameserver 10.100.4.112 nameserver 10.100.4.114 options timeout:2 attempts:1 After checked bz1748015, I guess it is not the same. Replaced attached cloud.cfg in a running instance firstly, the resolv.conf is restored without any problem after reboot. [root@ip-10-116-2-42 ec2-user]# cat /etc/resolv.conf # Generated by NetworkManager search us-west-2.compute.internal nameserver 10.116.0.2 [root@ip-10-116-2-42 ec2-user]# truncate -s0 /etc/resolv.conf [root@ip-10-116-2-42 ec2-user]# cloud-init clean [root@ip-10-116-2-42 ec2-user]# reboot Connection to ec2-34-209-65-192.us-west-2.compute.amazonaws.com closed by remote host. Connection to ec2-34-209-65-192.us-west-2.compute.amazonaws.com closed. [root@cloud-aws-2 fedora]# ssh ec2-user.compute.amazonaws.com Last login: Mon Nov 15 09:02:58 2021 from 66.187.232.127 [root@ip-10-116-2-42 ec2-user]# cat /etc/resolv.conf # Generated by NetworkManager search us-west-2.compute.internal nameserver 10.116.0.2 in comment 0 it indicated that downgrading cloud-init made it work. If that is correct, then cloud-init seems involved. It's not clear to me which log to look at. Looking at "after.tar.gz" and the related sosreport from https://access.redhat.com/support/cases/#/case/03044228/discussion?commentId=a0a6R00000U2yPzQAJ , we see that NetworkManager does not actually manage eth0: <info> [1668607054.6799] ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-eth0 (5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03,"System eth0") due to NM_CONTROLLED=no. Unmanaged: interface-name:eth0. We see that resolv.conf is empty, and there is a comment that it was written by NetworkManager. If there are no devices/profiles active in NetworkManager, NetworkManager has no DNS configuration. That is probably why it wrote an empty resolv.conf. It's questionable whether that makes sense. I seem to remember, we did a change that NetworkManager doesn't write an empty /etc/resolv.conf, if NetworkManager has no configuration. Beniamino, do you recall that? In any case, if the user clearly does not want NetworkManager to handle /etc/resolv.conf (or anything at all), they can either tell NetworkManager to not touch /etc/resolv.conf (via `[main].dns=none` or `[main].rc-manager=unmanaged` -- see `man NetworkManager.conf`). Or, it seems NetworkManager doesn't do anything useful on that system anyway, and they have their own tool (network-scripts?), then it's better to just `systemctl disable NetworkManager` . Or even just `yum remove NetworkManager`. But the issue is rather unclear to me and I don't see relevant logs. (In reply to Thomas Haller from comment #20) > If there are no devices/profiles active in NetworkManager, NetworkManager > has no DNS configuration. That is probably why it wrote an empty > resolv.conf. It's questionable whether that makes sense. I seem to remember, > we did a change that NetworkManager doesn't write an empty /etc/resolv.conf, > if NetworkManager has no configuration. Right, that is what should happen because of the way in which we compute the hash of current DNS configuration. Also, see bug 1344303. As Thomas said, to understand whether the issue is related to NM (which doesn't seem the case since a cloud-init downgrade fixes the problem), we need NM logs at trace level - with both cloud-init versions. Hi, as said in the previous comments, we need NetworkManager logs at trace level to understand how cloud-init is configuring NM. Are those logs available anywhere? To change the log level, set level=TRACE in the [logging] section of /etc/NetworkManager/NetworkManager.conf, then reboot, reproduce the issue and attach the output of `journalctl -b`. If possible, repeat this procedure with both cloud-init versions. Thank you. The difference between the two logs is in the behavior of cloud-init between 18.5-6 and 19.4-7. The latter performs a "systemctl reload NetworkManager" after starting up , while the former does not: cloud-init: Cloud-init v. 18.5 running 'modules:final' vs cloud-init: Cloud-init v. 19.4 running 'modules:final' NetworkManager: <info> [1688650541.9370] audit: op="reload" arg="0" pid=1707 uid=0 result="success" In RHEL 7, the NM reload is implemented in the systemd service file as calling the Reload() D-Bus method with flag 0: ExecReload=/usr/bin/dbus-send --print-reply --system --type=method_call --dest=org.freedesktop.NetworkManager /org/freedesktop/NetworkManager org.freedesktop.NetworkManager.Reload uint32:0 where the flags are defined as: No flags (0x00) means to reload everything that is supported which is identical to sending a SIGHUP. (0x01) means to reload the NetworkManager.conf configuration from disk. Note that this does not include connections, which can be reloaded via Setting's ReloadConnections. (0x02) means to update DNS configuration, which usually involves writing /etc/resolv.conf anew. (0x04) means to restart the DNS plugin. So, calling reload with flag 0 also requests an explicit update of resolv.conf. Since there are no DNS servers configured in NM for eth0, resolv.conf is overwritten with an empty list. I think the solution here is that cloud-init should either: 1) set "dns=none" in NetworkManager configuration if NetworkManager is not supposed to touch resolv.conf or 2) if the purpose of the reload at the end is to make NM aware of the new configuration files in /etc/NetworkManager/conf.d, then cloud-init should send a reload with flag 1, so that resolv.conf is not forcibly rewritten I believe NetworkManager is behaving as documented and expected here, the cause of the bug seems to be a change in cloud-init between 18.5-6 and 19.4-7. Therefore, I'm reassigning the bz; please reassign it back if you find anything needs to be changed in NM. Thank you. |