Bug 1748015

Summary: [cloud-init][RHEL7] /etc/resolv.conf lose config after reboot (initial instance is ok)
Product: Red Hat Enterprise Linux 7 Reporter: ggrimaux
Component: cloud-initAssignee: Eduardo Otubo <eterrell>
Status: CLOSED ERRATA QA Contact: xiachen
Severity: high Docs Contact:
Priority: high    
Version: 7.7CC: acruz, akaris, amasolov, astupnik, atragler, bgalvani, dhill, eterrell, fgiudici, gkadam, goeran, hasuzuki, hfukumot, huzhao, igortiunov, ivan.ayala, javier.leonperis, jgreguske, jnewbigin, lrintel, mboisver, mkalinin, mmorsy, mprchlik, msimka, mwinkler, pvlasin, rasrivas, ravsingh, ribarry, rkhan, smozowei, sukulkar, thaller, vpoliset, xiachen, xialiu, xiliang, yacao, ymao
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cloud-init-19.4-6.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1807797 (view as bug list) Environment:
Last Closed: 2020-09-29 19:48:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1744146, 1782469, 1797441, 1807797, 1840648, 1947476    
Attachments:
Description Flags
journalctl -u NetworkManager after 2nd reboot none

Description ggrimaux 2019-09-02 13:28:45 UTC
Description of problem:
On OpenStack I launched an instance of rhel-server-7.7-x86_64-kvm.qcow2 downloaded from our website.

After first boot I have the following in /etc/resolv.conf:
[root@test-gg-rhel77 ~]# cat /etc/resolv.conf 
; Created by cloud-init on instance boot automatically, do not edit.
;
# Generated by NetworkManager
nameserver 66.10.10.10
nameserver 33.10.10.10
nameserver 99.10.10.10
# NOTE: the libc resolver may not support more than 3 nameservers.
# The nameservers listed below may not be recognized.
nameserver 10.10.10.10

But on next reboot, all is lost.:

[root@test-gg-rhel77 ~]# cat /etc/resolv.conf 
# Generated by NetworkManager

Although in /var/log/messages you see the nameservers being received by dhcp.
Sep  2 09:18:53 test-gg-rhel77 NetworkManager[605]: <info>  [1567430333.8303] dhh
cp4 (eth0):   lease time 43200
Sep  2 09:18:53 test-gg-rhel77 NetworkManager[605]: <info>  [1567430333.8303] dhh
cp4 (eth0):   nameserver '66.10.10.10'
Sep  2 09:18:53 test-gg-rhel77 NetworkManager[605]: <info>  [1567430333.8304] dhh
cp4 (eth0):   nameserver '33.10.10.10'
Sep  2 09:18:53 test-gg-rhel77 NetworkManager[605]: <info>  [1567430333.8304] dhh
cp4 (eth0):   nameserver '99.10.10.10'
Sep  2 09:18:53 test-gg-rhel77 NetworkManager[605]: <info>  [1567430333.8304] dhh
cp4 (eth0):   nameserver '10.10.10.10'
Sep  2 09:18:53 test-gg-rhel77 NetworkManager[605]: <info>  [1567430333.8304] dhh
cp4 (eth0): state changed unknown -> bound

If I do the same operation on the same environment (OpenStack) but with rhel7.6 we have no problem.

I need help on finding where the problem is.

I have a testing environment for you if it helps you.


Version-Release number of selected component (if applicable):
rhel-7.7

How reproducible:
100%

Steps to Reproduce:
1. Spawn an instance of rhel7.7, upon first boot /etc/resolv.conf has the nameservers configured (from dhcp)
2. On next reboot it loses it, indefinately.
3.

Actual results:
Lost of /etc/resolv.conf config

Expected results:
Keep config after reboots.

Additional info:

Comment 2 Thomas Haller 2019-09-02 13:57:44 UTC
Could you enable level=TRACE logging and provide a logfile showing the issue after reboot?

See https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/contrib/fedora/rpm/NetworkManager.conf#n28 for how to do that.

Thank you.

Comment 3 ggrimaux 2019-09-04 10:52:06 UTC
Created attachment 1611450 [details]
journalctl -u NetworkManager after 2nd reboot

Including the debug log for NetworkManager after the reboot.

We should have the nameserver 11.11.11.11.

In the debug we see it being received.

But resolv.conf is empty as before:
[root@test-gg-rhel77 ~]# cat /etc/resolv.conf 
# Generated by NetworkManager
[root@test-gg-rhel77 ~]#

Let me know if you need anything else.

Thank yo u.

Comment 4 Thomas Haller 2019-09-04 11:33:45 UTC
In the log you see:

Sep 04 05:34:48 test-gg-rhel77 NetworkManager[605]: <debug> [1567589688.1065] CONFIG:   dns=none

which disables writing /etc/resolv.conf. See "main.dns" in `man NetworkManager.conf`.


This is probably caused by a file "/etc/NetworkManager/conf.d/99-cloud-init.conf"



Reassigning to cloud-init.

Comment 5 Göran Uddeborg 2019-09-04 14:11:33 UTC
I believe we can confirm the connection with cloud-init's NM configuration.  In parallel with your investigations, we have investigated and looked for a workaround.  What we do now is we add a file "9A-override-99-cloud-init.conf" containing

    [main]
    dns = default

Then we use the resulting image instead of the original rhel-server-7.7 and things seems to work again.

We still don't understand the reason for the "/etc/NetworkManager/conf.d/99-cloud-init.conf" in the first place.  The file doesn't belong to any package, according to RPM.

Comment 6 Rick Barry 2019-09-04 18:27:45 UTC
Eduardo, any ideas regarding "/etc/NetworkManager/conf.d/99-cloud-init.conf" in 7.7?

Also CC'ing Mohammed from the rhel-guest-image side.

Comment 10 ITD27M01 2019-09-24 07:21:54 UTC
Hi Guys

The reason of the "/etc/NetworkManager/conf.d/99-cloud-init.conf" is the following: [1].

The new version of cloud-init coming from rhel7.7 try to manage the nameserver records in /etc/resolv.conf when nameservers are defined in OpenStack metadata: [2], [3]

Initially the image does not have the "dns=none" configuration for NetworkManager and NM manages the "resolv.conf" file during the first startup of the OpenStack instance. The NetworkManager applies configuration from DHCP  (Strictly speaking, this behavior conflicts with cloud-init because cloud-init configures the resolv.conf from subnet metadata too) but _during_ _the_ _shutdown_, the NetworkManager revert back and clean the configuration comes from DHCP (I think it is normal behavior for going offline from DHCP configuration).

But cloud-init do configuration only one time during first instance boot from glance image. During the second boot, it does not change the resolv.conf file and this file is clean (reverted to clean state by NetworkManager during the first shutdown).

I saw the different behaviour for different networks. I checked the behavior another time with DEBUG logs enabled for cloud-init and NetworkManager in the vxlan tenant. I can see from the logs that NetworkManager started just after cloud-init (but at the same second) and with 'dns=none' configuration:

~~~~~~~~~~~~~~~~
Sep 20 11:34:52 network-manager-rtc cloud-init: 2019-09-20 15:34:51,930 - util.py[DEBUG]: Writing to /etc/resolv.conf - wb: [644] 117 bytes
Sep 20 11:34:52 network-manager-rtc cloud-init: 2019-09-20 15:34:51,931 - util.py[DEBUG]: Writing to /etc/NetworkManager/conf.d/99-cloud-init.conf - wb: [644] 89 bytes
Sep 20 11:34:52 network-manager-rtc NetworkManager[765]: <info>  [1568993692.1014] NetworkManager (version 1.18.0-5.el7) is starting... (for the first time)
Sep 20 11:34:52 network-manager-rtc NetworkManager[765]: <info>  [1568993692.1031] Read config: /etc/NetworkManager/NetworkManager.conf (lib: 00-server.conf, 10-slaves-order.conf
) (etc: 99-cloud-init.conf)
Sep 20 11:34:52 network-manager-rtc NetworkManager[765]: <info>  [1568993692.1311] dns-mgr[0x55be7b9b1220]: init: dns=none,systemd-resolved rc-manager=unmanaged
~~~~~~~~~~~~~~~~

But at the same time the provider network have another behaviour. The cloud-init service started _after_ NetworkManager service and as a result the NetworkManager does not have the 'dns=none' configuration:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sep 20 12:01:37 network-manager systemd: Starting Network Manager...
Sep 20 12:01:37 network-manager NetworkManager[782]: <info>  [1568995297.1206] NetworkManager (version 1.18.0-5.el7) is starting... (for the first time)
Sep 20 12:01:37 network-manager NetworkManager[782]: <info>  [1568995297.1216] Read config: /etc/NetworkManager/NetworkManager.conf (lib: 00-server.conf, 10-slaves-order.conf)
Sep 20 12:01:37 network-manager NetworkManager[782]: <info>  [1568995297.2167] dns-mgr[0x55fffe1ac220]: init: dns=default,systemd-resolved rc-manager=file
Sep 20 12:02:04 network-manager cloud-init: 2019-09-20 16:02:04,789 - util.py[DEBUG]: Writing to /etc/resolv.conf - wb: [644] 169 bytes
Sep 20 12:02:04 network-manager cloud-init: 2019-09-20 16:02:04,791 - util.py[DEBUG]: Writing to /etc/NetworkManager/conf.d/99-cloud-init.conf - wb: [644] 89 bytes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It looks like there some race condition between systemd services for cloud-init and NetworkManager

[1] https://bugs.launchpad.net/cloud-init/+bug/1693251
[2] https://github.com/cloud-init/cloud-init/blob/f57a77577dd708c7f57babf8cd63ec18134bf34a/cloudinit/net/sysconfig.py#L600
[3] https://github.com/cloud-init/cloud-init/blob/f57a77577dd708c7f57babf8cd63ec18134bf34a/cloudinit/net/sysconfig.py#L316

Comment 14 Göran Uddeborg 2019-10-03 06:50:56 UTC
The service file for cloud-init says "After=NetworkManager.service network.service".  Isn't it strange cloud-init starts before NetworkManager in your "vxlan tenant"?

Comment 15 ITD27M01 2019-10-03 07:16:17 UTC
The cloud-init package has four services for different stages of configuration:

~~~~~~~~~~~~~~~~~~~~~~
[root@srv8-xibuyin ~]# systemctl status cloud-* | grep .service
● cloud-init-local.service - Initial cloud-init job (pre-networking)
   Loaded: loaded (/usr/lib/systemd/system/cloud-init-local.service; enabled; vendor preset: disabled)
   CGroup: /system.slice/cloud-init-local.service
● cloud-init.service - Initial cloud-init job (metadata service crawler)
   Loaded: loaded (/usr/lib/systemd/system/cloud-init.service; enabled; vendor preset: disabled)
   CGroup: /system.slice/cloud-init.service
● cloud-config.service - Apply the settings specified in cloud-config
   Loaded: loaded (/usr/lib/systemd/system/cloud-config.service; enabled; vendor preset: disabled)
   CGroup: /system.slice/cloud-config.service
● cloud-final.service - Execute cloud user/final scripts
   Loaded: loaded (/usr/lib/systemd/system/cloud-final.service; enabled; vendor preset: disabled)
   CGroup: /system.slice/cloud-final.service
~~~~~~~~~~~~~~~~~~~~~~

Where is network configuration (by cloud-init) start?

Comment 17 ITD27M01 2019-10-11 10:09:35 UTC
In our network we have the latency for metadata service for some reasons. And the http://169.254.169.254/ URL can response up to 10 seconds in time. So, the race condition may relate to this. And it is a reason the issue is not reproducible in devstack.

Comment 31 Huijuan Zhao 2019-12-26 09:07:03 UTC
I reproduced the issue with rhel-7.7(kernel-3.10.0-1062.el7.x86_64) + cloud-init-18.5-3.el7.x86_64.

Eduardo, I will send the ENV info to you via email, you can access it to debug or test if needed.

Additional info:
No such issue in rhel-7.8(kernel-3.10.0-1111.el7.x86_64) + cloud-init-18.5-6.el7.x86_64.

Comment 33 Eduardo Otubo 2020-01-09 13:44:06 UTC
For all customer cases above: The reason why the behaviour changed from rhel-7.6 to rhel-7.7 might be other than just cloud-init package. The difference between those two (as reported above, cloud-init-18.2-1.el7.x86_64 for rhel-7.6 and cloud-init-18.5-3.el7.x86_64 for rhel-7.7) is the support for Fast Provisioning on Azure, nothing else. It might be some difference on default behaviours on different versions on NetworkManager, perhaps.

One possible explanation that it works on rhel-7.8 / cloud-init-18.5-6.el7.x86_64 is the addition of:

commit 6954486d1318a3d1aa5bbcfcb119bcfcb73618bf
Author: Eduardo Otubo <otubo>
Date:   Fri Sep 6 12:12:11 2019 +0200

    Fix for network configuration not persisting after reboot

Which is a fix for BZ#1593010

So far we have the current scenario:

* RHEL-7.6: WORKS
* RHEL-7.7: broken
* RHEL-7.8: WORKS

Workaround for RHEL-7.7: If the only issue observed is the missing configuration on resolv.conf (not the entire network configuration) than I advise to manually configure the DNS options on cloud-init, as described on this documentation: https://cloudinit.readthedocs.io/en/latest/topics/examples.html#configure-an-instances-resolv-conf If you have a problem on the whole network configuration missing, than we'll probably want to backport the above commit to the 7.7 branch and give it a try - keep in mind this is not a certain fix, as I mentioned, behaviours from other packages (like NetworkManager) could also be affecting this.

Setting back to NEW as no work needs to be done so far.

Comment 38 Andreas Karis 2020-01-19 13:47:40 UTC
Hi,

So whereas I cannot reproduce the issue reported in this bug, I was troubleshooting the logic behind the creation of file /etc/NetworkManager/conf.d/99-cloud-init.conf  and I found what triggers the creation of file /etc/NetworkManager/conf.d/99-cloud-init.conf in both RHEL 7.6 and RHEL 7.7

A priori, the file should get created when the subnet is configured with dns_nameservers, e.g.:
~~~
(overcloud) [stack@undercloud-0 ~]$ neutron subnet-show private1-subnet
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+-------------------+----------------------------------------------------+
| Field             | Value                                              |
+-------------------+----------------------------------------------------+
| allocation_pools  | {"start": "192.168.0.100", "end": "192.168.0.150"} |
| cidr              | 192.168.0.0/24                                     |
| created_at        | 2020-01-16T12:24:12Z                               |
| description       |                                                    |
| dns_nameservers   | 10.11.5.4                                          | <------
|                   | 10.11.5.3                                          | <------
| enable_dhcp       | True                                               |
| gateway_ip        | 192.168.0.1                                        |
| host_routes       |                                                    |
| id                | 1e44d2df-1acb-4d81-8a5e-3ff22ae7224f               |
| ip_version        | 4                                                  |
| ipv6_address_mode |                                                    |
| ipv6_ra_mode      |                                                    |
| name              | private1-subnet                                    |
| network_id        | 7e1c33da-db7d-4003-b962-584e93f7f470               |
| project_id        | 34d6b951c06e4d65ba23e669761d0165                   |
| revision_number   | 1                                                  |
| segment_id        |                                                    |
| service_types     |                                                    |
| subnetpool_id     |                                                    |
| tags              |                                                    |
| tenant_id         | 34d6b951c06e4d65ba23e669761d0165                   |
| updated_at        | 2020-01-19T13:26:25Z                               |
+-------------------+----------------------------------------------------+
~~~

However, in RHEL 7.6, this is not the case, *unless* the instance is booted with a config-drive. 

Here are the combinations that lead to file creation or not:

---------------------------------------------------------------

### RHEL 7.6 ###

i)
* if no DNS nameservers on subnet
-> No file /etc/NetworkManager/conf.d/99-cloud-init.conf

ii)
~~~
# * if DNS nameservers on subnet and no config drive
nova boot --nic net-id=$NETID --image rhel-76 --flavor m1.medium --key-name id_rsa rhel-76-test
~~~
-> No file /etc/NetworkManager/conf.d/99-cloud-init.conf:

iii)
~~~
# * if DNS nameservers on subnet and config drive
nova boot --nic net-id=$NETID --image rhel-76 --flavor m1.medium --key-name id_rsa --config-drive true rhel-76-test
~~~
-> File /etc/NetworkManager/conf.d/99-cloud-init.conf is created

### RHEL 7.7: ###


a) 
* if no DNS nameservers on subnet
-> No file /etc/NetworkManager/conf.d/99-cloud-init.conf

b)
~~~
# * if DNS nameservers on subnet and no config drive
nova boot --nic net-id=$NETID --image TID-RH77 --flavor m1.medium --key-name id_rsa TID-RH77-test
~~~
-> File /etc/NetworkManager/conf.d/99-cloud-init.conf is created:

c)
~~~
# * if DNS nameservers on subnet and config drive
nova boot --nic net-id=$NETID --image TID-RH77 --flavor m1.medium --key-name id_rsa --config-drive true TID-RH77-test
~~~
-> File /etc/NetworkManager/conf.d/99-cloud-init.conf is created:

---------------------------------------------------------------

That comes from:
https://bugs.launchpad.net/cloud-init/+bug/1693251
https://git.launchpad.net/~rmccabe/cloud-init/commit/?id=7a49497502aad683297f36666c4d025a66330cc4

- Andreas

Comment 39 Huijuan Zhao 2020-01-20 10:33:45 UTC
I tested with rhel-7.7(3.10.0-1058.el7.x86_64) + cloud-init-18.5-6.el7.x86_64, no such issue.


Test steps:
1. Launch rhel-7.7 instance on OpenStack
2. Check the cloud-init version is cloud-init-18.5-3.el7.x86_64
Check the /etc/resolv.conf:
$ cat /etc/resolv.conf 
; Created by cloud-init on instance boot automatically, do not edit.
;
# Generated by NetworkManager
search openstacklocal
nameserver 10.11.x.x
nameserver 10.5.x.x
3. Uninstall cloud-init-18.5-3.el7.x86_64, install cloud-init-18.5-6.el7.x86_64
4. Reboot instance
5. Login instance and check the cloud-init version is cloud-init-18.5-6.el7.x86_64
check the /etc/resolv.conf:
$ cat /etc/resolv.conf 
; Created by cloud-init on instance boot automatically, do not edit.
;
# Generated by NetworkManager
nameserver 10.11.5.19
nameserver 10.5.30.45


So for RHEL-7.7:
1. rhel-7.7 + cloud-init-18.5-3.el7.x86_64, broken
2. rhel-7.7 + cloud-init-18.5-6.el7.x86_64, WORKS

Comment 40 Eduardo Otubo 2020-01-20 11:23:37 UTC
(In reply to Huijuan Zhao from comment #39)
> I tested with rhel-7.7(3.10.0-1058.el7.x86_64) +
> cloud-init-18.5-6.el7.x86_64, no such issue.
> 
> 
> Test steps:
> 1. Launch rhel-7.7 instance on OpenStack
> 2. Check the cloud-init version is cloud-init-18.5-3.el7.x86_64
> Check the /etc/resolv.conf:
> $ cat /etc/resolv.conf 
> ; Created by cloud-init on instance boot automatically, do not edit.
> ;
> # Generated by NetworkManager
> search openstacklocal
> nameserver 10.11.x.x
> nameserver 10.5.x.x
> 3. Uninstall cloud-init-18.5-3.el7.x86_64, install
> cloud-init-18.5-6.el7.x86_64
> 4. Reboot instance
> 5. Login instance and check the cloud-init version is
> cloud-init-18.5-6.el7.x86_64
> check the /etc/resolv.conf:
> $ cat /etc/resolv.conf 
> ; Created by cloud-init on instance boot automatically, do not edit.
> ;
> # Generated by NetworkManager
> nameserver 10.11.5.19
> nameserver 10.5.30.45
> 
> 
> So for RHEL-7.7:
> 1. rhel-7.7 + cloud-init-18.5-3.el7.x86_64, broken
> 2. rhel-7.7 + cloud-init-18.5-6.el7.x86_64, WORKS

Can you confirm the *only* scenario that this bug occurs is rhel-7.7 + cloud-init-18.5-3.el7.x86_64? In such case, we can simply advise the customer to update the package?

Comment 41 Rick Barry 2020-01-20 16:21:11 UTC
Posting a comment from email by Vimal Patel:

"I found this issue on RHEL 7.8 snap3.1, and I retested on 7.7 RC (no package updating) and the issue existed on both."

So, it appears the issue exists on RHEL 7.7 and 7.8, although for some reason there's a discrepancy between Vimal's testing and Huijuan's testing. If this is a race condition, then it's not surprising that test environments could see different results. We'll continue the investigation in rhel-7.9 and request z-stream fixes for 7.7 and 7.8 once we sort out a solution.

Comment 66 Eduardo Otubo 2020-05-04 12:54:17 UTC
*** Bug 1829713 has been marked as a duplicate of this bug. ***

Comment 82 errata-xmlrpc 2020-09-29 19:48:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: cloud-init security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3898

Comment 83 Ivan Ayala 2020-11-12 20:59:11 UTC
During our patching we got this error:



Redirecting to /bin/systemctl status cloud-final.service
● cloud-final.service - Execute cloud user/final scripts
   Loaded: loaded (/usr/lib/systemd/system/cloud-final.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2020-11-13 03:54:23 CST; 54s ago
  Process: 1422 ExecStartPost=/usr/bin/systemctl reload-or-try-restart NetworkManager.service (code=exited, status=5)
  Process: 1414 ExecStartPost=/bin/echo try restart NetworkManager.service (code=exited, status=0/SUCCESS)
  Process: 1276 ExecStart=/usr/bin/cloud-init modules --mode=final (code=exited, status=0/SUCCESS



We updated from cloud-init-18.5-6.el7.centos.5.x86_64 to Update 19.4-7.el7.centos.x86_64


We did the rollback and this fixed our issue.

Fix:

yum downgrade https://rpmfind.net/linux/centos/7.8.2003/updates/x86_64/Packages/cloud-init-18.5-6.el7.centos.5.x86_64.rpm

systemctl restart cloud-final

Comment 84 xiachen 2021-04-12 06:51:45 UTC
Hi Ivan,
Could you provide cloud-init.log and journalctl -u cloud-final.service please?
then we can check the error details.

thanks
Amy

(In reply to Ivan Ayala from comment #83)
> During our patching we got this error:
> 
> 
> 
> Redirecting to /bin/systemctl status cloud-final.service
> ● cloud-final.service - Execute cloud user/final scripts
>    Loaded: loaded (/usr/lib/systemd/system/cloud-final.service; enabled;
> vendor preset: disabled)
>    Active: failed (Result: exit-code) since Fri 2020-11-13 03:54:23 CST; 54s
> ago
>   Process: 1422 ExecStartPost=/usr/bin/systemctl reload-or-try-restart
> NetworkManager.service (code=exited, status=5)
>   Process: 1414 ExecStartPost=/bin/echo try restart NetworkManager.service
> (code=exited, status=0/SUCCESS)
>   Process: 1276 ExecStart=/usr/bin/cloud-init modules --mode=final
> (code=exited, status=0/SUCCESS
> 
> 
> 
> We updated from cloud-init-18.5-6.el7.centos.5.x86_64 to Update
> 19.4-7.el7.centos.x86_64
> 
> 
> We did the rollback and this fixed our issue.
> 
> Fix:
> 
> yum downgrade
> https://rpmfind.net/linux/centos/7.8.2003/updates/x86_64/Packages/cloud-init-
> 18.5-6.el7.centos.5.x86_64.rpm
> 
> systemctl restart cloud-final