1324160 – Overcloud nodes have an empty /etc/resolv.conf post upgrade

Bug 1324160 - Overcloud nodes have an empty /etc/resolv.conf post upgrade

Summary: Overcloud nodes have an empty /etc/resolv.conf post upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rhosp-director
Sub Component:
Version:	8.0 (Liberty)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ga
Target Release:	8.0 (Liberty)
Assignee:	Giulio Fidente
QA Contact:	Marius Cornea
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-04-05 17:23 UTC by Marius Cornea
Modified:	2016-04-15 14:32 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-0.8.14-7.el7ost os-net-config-0.2.3-2.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-04-15 14:32:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
updated os-net-config rpm with the change from https://review.openstack.org/#/c/302352/4 (143.95 KB, application/x-rpm) 2016-04-07 12:20 UTC, Marios Andreou	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1567004	None	None	None	2016-04-08 06:08:30 UTC
OpenStack gerrit	302352	'None'	MERGED	Use PEERDNS when no dns_servers or use_dhcp is provided	2021-01-29 21:31:31 UTC
OpenStack gerrit	302769	'None'	MERGED	Add removal of the /etc/resolv.conf.save file for +bug/1567004	2021-01-29 21:32:14 UTC
Red Hat Product Errata	RHBA-2016:0637	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 8 director release candidate Bug Fix Advisory	2016-04-15 18:28:05 UTC

Description Marius Cornea 2016-04-05 17:23:53 UTC

Description of problem:
Overcloud nodes have an empty /etc/resolv.conf after upgrade

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.14-5.el7ost.noarch

How reproducible:


Steps to Reproduce:

1. Deploy overcloud with OSPd 7.3

export THT=~/templates/my-overcloud-7.3
openstack overcloud deploy --templates $THT \
-e $THT/environments/network-isolation-v6.yaml \
-e ~/templates/network-environment-7.3-v6.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
--control-scale 3 \
--compute-scale 1 \
--ceph-storage-scale 2 \
--ntp-server clock.redhat.com \
--libvirt-type qemu

2. Upgrade undercloud
yum update -y
openstack undercloud upgrade

3. Upgrade step 1

export THT=~/templates/my-overcloud-8.0
openstack overcloud deploy --templates $THT \
-e $THT/environments/network-isolation-v6.yaml \
-e ~/templates/network-environment-8.0-v6.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e $THT/environments/major-upgrade-pacemaker-init.yaml \
--control-scale 3 \
--compute-scale 1 \
--ceph-storage-scale 2 \
--ntp-server clock.redhat.com \
--libvirt-type qemu

4. Upgrade step 3
export THT=~/templates/my-overcloud-8.0
openstack overcloud deploy --templates $THT \
-e $THT/environments/network-isolation-v6.yaml \
-e ~/templates/network-environment-8.0-v6.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e $THT/environments/major-upgrade-pacemaker.yaml \
--control-scale 3 \
--compute-scale 1 \
--ceph-storage-scale 2 \
--ntp-server clock.redhat.com \
--libvirt-type qemu

5. Upgrade step 4
upgrade-non-controller.sh --upgrade overcloud-novacompute-0

6. Upgrade step 5
upgrade-non-controller.sh --upgrade overcloud-cephstorage-0
upgrade-non-controller.sh --upgrade overcloud-cephstorage-1

7. Upgrade step 6
export THT=~/templates/my-overcloud-8.0
openstack overcloud deploy --templates $THT \
-e $THT/environments/network-isolation-v6.yaml \
-e ~/templates/network-environment-8.0-v6.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e $THT/environments/major-upgrade-pacemaker-converge.yaml \
--control-scale 3 \
--compute-scale 1 \
--ceph-storage-scale 2 \
--ntp-server clock.redhat.com \
--libvirt-type qemu


Actual results:
The /etc/resolv.conf on the overcloud nodes is empty:

[root@overcloud-controller-1 ~]# cat /etc/resolv.conf 
# Generated by NetworkManager
search localdomain


# No nameservers found; try putting DNS servers into your
# ifcfg files in /etc/sysconfig/network-scripts like so:
#
# DNS1=xxx.xxx.xxx.xxx
# DNS2=xxx.xxx.xxx.xxx
# DOMAIN=lab.foo.com bar.foo.com


Expected results:
The resolv.conf is populated according to the ifcfg scripts which contain the DNS servers:

[root@overcloud-controller-1 ~]# grep -R ^DNS /etc/sysconfig/network-scripts/
/etc/sysconfig/network-scripts/ifcfg-br-ex:DNS1=10.16.36.29
/etc/sysconfig/network-scripts/ifcfg-br-ex:DNS2=10.11.5.19
/etc/sysconfig/network-scripts/ifcfg-br-infra:DNS1=10.16.36.29
/etc/sysconfig/network-scripts/ifcfg-br-infra:DNS2=10.11.5.19
/etc/sysconfig/network-scripts/ifcfg-br-storage:DNS1=10.16.36.29
/etc/sysconfig/network-scripts/ifcfg-br-storage:DNS2=10.11.5.19

Additional info:

/var/log/messages shows a run of the ifdown-post script that updated the resolv.conf

[root@overcloud-controller-1 ~]# grep resolv.conf /var/log/messages 
Apr  5 07:52:50 localhost NET[9295]: /etc/sysconfig/network-scripts/ifup-post : updated /etc/resolv.conf
Apr  5 13:35:14 overcloud-controller-1 NET[5151]: /etc/sysconfig/network-scripts/ifdown-post : updated /etc/resolv.conf

If we correlate the time with the os-collect-config log we can see that the vlan interfaces go down around 13:35:14:

[root@overcloud-controller-1 ~]# journalctl -l -u os-collect-config | grep ifdown | grep 13:35
Apr 05 13:35:14 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:14 PM] [INFO] running ifdown on interface: vlan200
Apr 05 13:35:14 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:14 PM] [INFO] running ifdown on interface: vlan300
Apr 05 13:35:15 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:15 PM] [INFO] running ifdown on interface: vlan100
Apr 05 13:35:15 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:15 PM] [INFO] running ifdown on interface: vlan301

and are brough back up:

Apr 05 13:35:15 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:15 PM] [INFO] running ifup on interface: vlan200
Apr 05 13:35:15 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:15 PM] [INFO] running ifup on interface: vlan300
Apr 05 13:35:16 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:16 PM] [INFO] running ifup on interface: vlan100
Apr 05 13:35:16 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:16 PM] [INFO] running ifup on interface: vlan301

Now I suspect that when the ifdown-post script is run the resolv.conf gets updated and the nameservers get removed. When the ifup scripts are run for the vlan interfaces no nameservers get added to the resolv.conf because the ifcfg-vlan* scripts don't contain any DNS entries. 

I'm not sure why the ifdown/ifup is called for the vlan interfaces.

Comment 3 Brad P. Crochet 2016-04-06 01:47:26 UTC

So far, I have tested this without IPv6 and SSL, and with SSL. Both runs did not reproduce the error. So, if there is indeed an issue, it lies in the IPv6 pathway. I do not have an IPv6 setup at the moment so I can't test it directly. I will continue to investigate possible causes.

Comment 4 Marius Cornea 2016-04-06 08:23:33 UTC

I did a comparison between the ifcfg-vlan scripts between a 7.3 deployment and the upgraded one and there seems to be a change that might generate the restart:

## 7.3 fresh deployment
[root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-vlan200
# This file is autogenerated by os-net-config
DEVICE=vlan200
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
DEVICETYPE=ovs
TYPE=OVSIntPort
OVS_BRIDGE=br-infra
OVS_OPTIONS="tag=200"
IPV6INIT=yes
IPV6_AUTOCONF=no
IPV6ADDR=fd00:fd00:fd00:2000::13


## Upgraded deployment
[root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-vlan200
# This file is autogenerated by os-net-config
DEVICE=vlan200
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
DEVICETYPE=ovs
TYPE=OVSIntPort
OVS_BRIDGE=br-infra
OVS_OPTIONS="tag=200"
IPV6INIT=yes
IPV6_AUTOCONF=no
IPV6ADDR=fd00:fd00:fd00:2000::13/64

Note that the upgraded deployment contains the subnet mask ( /64 ) in the IPV6ADDR.

Comment 5 Marios Andreou 2016-04-06 09:08:25 UTC

(In reply to Marius Cornea from comment #4)
> I did a comparison between the ifcfg-vlan scripts between a 7.3 deployment
> and the upgraded one and there seems to be a change that might generate the
> restart:
> 
> ## 7.3 fresh deployment
> [root@overcloud-controller-0 heat-admin]# cat
> /etc/sysconfig/network-scripts/ifcfg-vlan200
> # This file is autogenerated by os-net-config
> DEVICE=vlan200
> ONBOOT=yes
> HOTPLUG=no
> NM_CONTROLLED=no
> DEVICETYPE=ovs
> TYPE=OVSIntPort
> OVS_BRIDGE=br-infra
> OVS_OPTIONS="tag=200"
> IPV6INIT=yes
> IPV6_AUTOCONF=no
> IPV6ADDR=fd00:fd00:fd00:2000::13
> 

Still poking, update below for anyone else debugging.

I just deployed a 7.3 env like  openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans-v6.yaml -e network_env.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/enable-tls.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/inject-trust-anchor.yaml --ntp-server "0.fedora.pool.ntp.org"

I can confirm that my /etc/sysconfig/network-scripts/ifcfg-vlan20 on a compute node is like above, i.e. without the netmask. IPV6ADDR=fd00:fd00:fd00:2000::12

However looking at the os-net-config data at/etc/os-net-config/config.json the netmask *is* specified [1]. I was initially trying to determine if there was a difference in the way the v6 address was specified in the 7.3 vs stable liberty patches but they seem the same. So I am trying to determine if something changed in os-net-config which made the way the ifcfg files are written change to now include the netmask (perhaps was ignored before),

thanks, marios


[1] [root@overcloud-compute-0 os-net-config]# cat /etc/os-net-config/config.json
{"network_config": [{"dns_servers": [], "name": "br-ex", "members": [{"type": "interface", "name": "nic1", "primary": true}, {"type": "vlan", "addresses": [{"ip_netmask": "fd00:fd00:fd00:2000::12/64"}], "vlan_id": 20}, {"type": "vlan", "addresses": [{"ip_netmask": "fd00:fd00:fd00:3000::11/64"}], "vlan_id": 30}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.0.4/24"}], "vlan_id": 50}], "routes": [{"ip_netmask": "169.254.169.254/32", "next_hop": "192.0.2.1"}, {"default": true, "next_hop": "192.0.2.1"}], "use_dhcp": false, "type": "ovs_bridge", "addresses": [{"ip_netmask": "192.0.2.8/24"}]}]}




> 
> ## Upgraded deployment
> [root@overcloud-controller-0 heat-admin]# cat
> /etc/sysconfig/network-scripts/ifcfg-vlan200
> # This file is autogenerated by os-net-config
> DEVICE=vlan200
> ONBOOT=yes
> HOTPLUG=no
> NM_CONTROLLED=no
> DEVICETYPE=ovs
> TYPE=OVSIntPort
> OVS_BRIDGE=br-infra
> OVS_OPTIONS="tag=200"
> IPV6INIT=yes
> IPV6_AUTOCONF=no
> IPV6ADDR=fd00:fd00:fd00:2000::13/64
> 
> Note that the upgraded deployment contains the subnet mask ( /64 ) in the
> IPV6ADDR.

Comment 6 Marios Andreou 2016-04-06 09:55:58 UTC

thanks  to gfidente... looks like this commit in os-net-config is changing the way the ifcfg files are created to include the netmask https://github.com/openstack/os-net-config/commit/0b130b6b3b4a9e0768e99b1496d2852f2ca47bb7

I also confirmed on my compute node that /usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py looks like data += "IPV6ADDR=%s\n" % first_v6.ip

Comment 7 Marios Andreou 2016-04-06 10:46:45 UTC

(thanks jistr and gfidente) the workaround for now is to explicitly make the NetworkDeployment not happen at all during upgrade. We have a NetworkDeploymentActions parameter which gets mapped to the 'actions' property of the corresponding heat StructuredDeployment http://docs.openstack.org/developer/heat/template_guide/openstack.html#OS::Heat::StructuredDeployment-prop-actions

We can try to set 'NetworkDeploymentActions: []' in the parameter_defaults section of the upgrades environment files: major-upgrade-pacemaker-converge.yaml major-upgrade-pacemaker.yaml major-upgrade-pacemaker-init.yaml

I am not sure yet we can get away with '[]' because "Allowed values: CREATE, UPDATE, DELETE, SUSPEND, RESUME" in that heat doc ^^^ so we may need to explicitly set to something else like 'SUSPEND'. :/

Comment 8 Marius Cornea 2016-04-06 12:06:57 UTC

I tried adding NetworkDeploymentActions: [] to the parameter_defaults of the major-upgrade-pacemaker* environments but at upgrade step 3 the network settings got reapplied and the resolv.conf went empty.

Comment 9 Marios Andreou 2016-04-06 12:22:52 UTC

thanks for testing that mcornea 

After more discussion with shardy and others on #tripleo we don't think it is heat after all that is triggering the network config to be re-applied. It's looking like os-net-config gets updated and that triggers re-application of the config; it is the same config (see comment 5 for  /etc/os-net-config/config.json ) but now os-net-config includes the netmask when writing the ifcfg files as pointed out in comment 6

Comment 10 Steven Hardy 2016-04-06 12:24:30 UTC

I did some investigation and I don't think NetworkDeploymentActions helps here, because it's working as designed:

- If you leave it at the default of ['CREATE'] the deployment will never be reapplied by heat, even if the input_values change.

- If the input_values are unchanged, we don't even attempt to update the NetworkDeployment on update (it'll remain at CREATE_COMPLETE)

- If any input_values change, it'll move to UPDATE_COMPLETE (arguably this is a bug), but we actually don't do anything, we exit before performing any update because UPDATE isn't in DEPLOY_ACTIONS:

https://github.com/openstack/heat/blob/master/heat/engine/resources/openstack/heat/software_deployment.py#L259

I tested this and can confirm this works as expected, however I think because os-net-config is applied directly via an o-r-c script (not a heat-config hook), it may get reapplied every time *any* change to the orc data happens, e.g it's not properly under the control of the SoftwareDeployment:

https://github.com/openstack/tripleo-image-elements/blob/master/elements/os-net-config/os-refresh-config/configure.d/20-os-net-config

This is one reason I'm trying to move away from group: os-apply-config as all such config suffer from this issue: https://review.openstack.org/#/c/271450/

That said, if the config hasn't changed, I don't think re-running os-net-config should do anything, and if it does it's probably a bug in os-net-config itself.

Comment 11 Giulio Fidente 2016-04-06 17:09:47 UTC

The change at https://review.openstack.org/302352 will prevent ifup/ifdown scripts from emptying the resolv.conf when restarting interfaces which don't have DNS1,DNS2

We migh still suffer issues caused by unwanted interfaces restart, should that be a problem an alternative approach is at https://review.openstack.org/#/c/302337/

Comment 12 Marios Andreou 2016-04-07 12:20:30 UTC

Created attachment 1144694 [details]
updated os-net-config rpm with the change from https://review.openstack.org/#/c/302352/4

Comment 13 Marios Andreou 2016-04-07 12:22:52 UTC

I patched the os-net-config rpm with the change at  https://review.openstack.org/#/c/302352/4 (attached). By itself, the change won't fix the issue we are seeing here. Setting the PEERDNS=no will help for future changes to the ifcfg-vlanXX files. However for the upgrade we need to delete the /etc/resolv.conf.save file before updating the os-net-config package so that it simply *cannot* be restored to /etc/resolv.conf (and so not overwritten).

Am trying this as workaround for now - we could add it to the UpgradeInitCommand...

Comment 14 Marios Andreou 2016-04-07 15:35:10 UTC

I tested the fix at https://review.openstack.org/#/c/302769/3 - copy pasting my comment from there - would be great to have someone else verify too please):



so FWIW

I tested this on an 3/1 v6 environment deployed like 

openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans-v6.yaml -e network_env.yaml --ntp-server "0.fedora.pool.ntp.org"


On all nodes bar controller-2 i manually installed the updated version of os-net-config that includes gfidente fix from https://review.openstack.org/#/c/302352/4 (that rpm is attached to the bugzilla, follow the gerrig bug

So on controller-2 we only removed the /etc/resolv.conf.save file.  

I completed the init successfully, with this change applied like

openstack overcloud deploy --templates -e  /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans-v6.yaml -e network_env.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml -e rhos-release-8.yaml


Once completed I then upgraded controllers (step 3 is where this was reported yesterday) and it finished OK. The nodes have retained their resolv.conf fine:

[stack@instack ~]$  for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i 'hostname; sudo grep nameserver /etc/resolv.conf'; done
overcloud-controller-0.localdomain
# No nameservers found; try putting DNS servers into your
nameserver 192.168.122.1
overcloud-controller-1.localdomain
# No nameservers found; try putting DNS servers into your
nameserver 192.168.122.1
overcloud-controller-2.localdomain
# No nameservers found; try putting DNS servers into your
nameserver 192.168.122.1
overcloud-compute-0.localdomain
# No nameservers found; try putting DNS servers into your
nameserver 192.168.122.1

Comment 16 Marios Andreou 2016-04-08 07:35:40 UTC

info for anyone looking to test the removal of the /etc/resolv.conf.save file - since the change at  https://review.openstack.org/#/c/302769/ is not yet in stable/liberty you can include the change in your environment before starting the upgrade:


sudo su
pushd /usr/share/openstack-tripleo-heat-templates

# replace with the file from the review:
curl "https://review.openstack.org/gitweb?p=openstack/tripleo-heat-templates.git;a=blob_plain;f=extraconfig/tasks/major_upgrade_pacemaker_init.yaml;hb=706c2fe4b62f95ac13ee800fc08e549180afc810" > extraconfig/tasks/major_upgrade_pacemaker_init.yaml
# sanity check
cat extraconfig/tasks/major_upgrade_pacemaker_init.yaml

popd
exit

Comment 17 Mike Burns 2016-04-08 10:45:38 UTC

(In reply to marios from comment #16)
> info for anyone looking to test the removal of the /etc/resolv.conf.save
> file - since the change at  https://review.openstack.org/#/c/302769/ is not
> yet in stable/liberty you can include the change in your environment before
> starting the upgrade:
> 

This change is in the latest tht build.   It was manually backported once it landed on master.

Comment 19 errata-xmlrpc 2016-04-15 14:32:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0637.html

Note You need to log in before you can comment on or make changes to this bug.