Bug 1903146

Summary: After minor upgrade cloud-init started to erase /etc/resolv.conf contents on overcloud nodes after reboot
Product: Red Hat OpenStack Reporter: Alex Stupnikov <astupnik>
Component: openstack-tripleo-commonAssignee: Rabi Mishra <ramishra>
Status: CLOSED ERRATA QA Contact: David Rosenfeld <drosenfe>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: ahyder, ariveral, aschultz, coldford, drosenfe, ealcaniz, ggrimaux, jhajyahy, jpretori, ldenny, mburns, mircea.vutcovici, nchandek, pweeks, ramishra, redhat, satmakur, slinaber, spower
Target Milestone: asyncKeywords: TestOnly, Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: openstack-tripleo-common-8.7.1-28.el7ost openstack-tripleo-heat-templates-8.4.1-86.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-02 13:36:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Stupnikov 2020-12-01 13:23:06 UTC
Description of problem:

I am not sure how to properly define affected component: the closest one is os-net-config, but it has nothing to do with reported problem. So I selected openstack-tripleo and hope that it will be re-routed properly.

A fix for bug #1748015 set dependency between NetworkManager.service and cloud-final.service to address the problem with OpenStack guest instances when cloud-init configuration in /etc/resolv.conf was overwritten by NetworkManager.

Unfortunately, this change didn't work well for RHOSP overcloud nodes: cloud-init now runs after network.service and nullifies its configuration in /etc/resolv.conf after reboot.

I am not completely sure how to address this issue: cloud-init people could  say that we are using custom procedure to configure overcloud nodes and we should change it, so I would like to ask for a second look from developers.

I also found the workaround [1] (basically, it reverts the fix proposed by cloud-init people). Could you tell me if it is correct approach here?

We reproduced this issue for RHOSP 13, but I am not sure if RHOSP 16.1 is affected: same cloud-init version is used there.

[1]
--- cloud-final.service 2020-12-01 13:21:16.828434342 +0000
+++ /etc/systemd/system/cloud-init.target.wants/cloud-final.service     2020-12-01 12:59:02.167153009 +0000
@@ -13,7 +13,7 @@
 KillMode=process
 ExecStartPost=/bin/echo "try restart NetworkManager.service"
 # TODO: try-reload-or-restart is available only on systemd >= 229
-ExecStartPost=/usr/bin/systemctl reload-or-try-restart NetworkManager.service
+#ExecStartPost=/usr/bin/systemctl reload-or-try-restart NetworkManager.service
 
 # Output needs to appear in instance console output
 StandardOutput=journal+console

Comment 1 David Rosenfeld 2020-12-01 22:06:00 UTC
Moved to Compute because customer case says rebooting compute node causes /etc/resolv.conf to be rewritten.

Comment 5 ldenny 2021-01-03 22:16:44 UTC
Hi Rabi,

Is this change able to be backported to OSP13? I see this bug is still in ON_DEV status, just wanted to give the customer an update.

Comment 28 Lon Hohberger 2021-06-17 10:31:04 UTC
According to our records, this should be resolved by openstack-tripleo-common-8.7.1-29.el7ost.  This build is available now.

Comment 29 Leander Koornneef 2021-06-23 14:42:25 UTC
In the past few days we have updated two OSP13 environments from Z12 to Z16 and in both cases all the overcloud nodes (controllers, hypervisors) ended up with an empty /etc/resolv.conf after rebooting.
It appears to be the same issue as described by the original reporter of this BZ. Restarting the network service on each node resolves the issue temporarily, at least until the next reboot. 

The openstack-tripleo-common package on the directors is in fact the same version as mentioned above:

(undercloud) [stack@nlhrl1vim52-dir2 ~]$ rpm -q openstack-tripleo-common
openstack-tripleo-common-8.7.1-29.el7ost.noarch

Comment 30 Rabi Mishra 2021-06-23 14:55:09 UTC
> The openstack-tripleo-common package on the directors is in fact the same version as mentioned above:

For minor updates you would need openstack-tripleo-heat-templates-8.4.1-86.el7ost which I don't think shipped with z16.

Comment 31 Leander Koornneef 2021-06-23 15:03:57 UTC
Ah, indeed, we have a different version of that package:

(undercloud) [stack@nlhrl1vim52-dir2 ~]$ rpm -q openstack-tripleo-heat-templates
openstack-tripleo-heat-templates-8.4.1-85.el7ost.noarch

Comment 36 Jad Haj Yahya 2021-07-18 15:20:12 UTC
Run job http://staging-jenkins2-qe-playground.usersys.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-13-from-z14-HA-ipv4/ which performs minor update to latests z stream and also reboots the overcloud

Verified that /etc/resolv.conf was not changed

Comment 40 errata-xmlrpc 2021-08-02 13:36:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13 Bug Fix and Enhancement Advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2978

Comment 44 xiachen 2021-09-02 01:37:42 UTC
*** Bug 1933202 has been marked as a duplicate of this bug. ***

Comment 47 Red Hat Bugzilla 2023-09-15 01:31:27 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days