Description of problem: Overcloud nodes have an empty /etc/resolv.conf after upgrade Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-0.8.14-5.el7ost.noarch How reproducible: Steps to Reproduce: 1. Deploy overcloud with OSPd 7.3 export THT=~/templates/my-overcloud-7.3 openstack overcloud deploy --templates $THT \ -e $THT/environments/network-isolation-v6.yaml \ -e ~/templates/network-environment-7.3-v6.yaml \ -e $THT/environments/storage-environment.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ --control-scale 3 \ --compute-scale 1 \ --ceph-storage-scale 2 \ --ntp-server clock.redhat.com \ --libvirt-type qemu 2. Upgrade undercloud yum update -y openstack undercloud upgrade 3. Upgrade step 1 export THT=~/templates/my-overcloud-8.0 openstack overcloud deploy --templates $THT \ -e $THT/environments/network-isolation-v6.yaml \ -e ~/templates/network-environment-8.0-v6.yaml \ -e $THT/environments/storage-environment.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ -e $THT/environments/major-upgrade-pacemaker-init.yaml \ --control-scale 3 \ --compute-scale 1 \ --ceph-storage-scale 2 \ --ntp-server clock.redhat.com \ --libvirt-type qemu 4. Upgrade step 3 export THT=~/templates/my-overcloud-8.0 openstack overcloud deploy --templates $THT \ -e $THT/environments/network-isolation-v6.yaml \ -e ~/templates/network-environment-8.0-v6.yaml \ -e $THT/environments/storage-environment.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ -e $THT/environments/major-upgrade-pacemaker.yaml \ --control-scale 3 \ --compute-scale 1 \ --ceph-storage-scale 2 \ --ntp-server clock.redhat.com \ --libvirt-type qemu 5. Upgrade step 4 upgrade-non-controller.sh --upgrade overcloud-novacompute-0 6. Upgrade step 5 upgrade-non-controller.sh --upgrade overcloud-cephstorage-0 upgrade-non-controller.sh --upgrade overcloud-cephstorage-1 7. Upgrade step 6 export THT=~/templates/my-overcloud-8.0 openstack overcloud deploy --templates $THT \ -e $THT/environments/network-isolation-v6.yaml \ -e ~/templates/network-environment-8.0-v6.yaml \ -e $THT/environments/storage-environment.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ -e $THT/environments/major-upgrade-pacemaker-converge.yaml \ --control-scale 3 \ --compute-scale 1 \ --ceph-storage-scale 2 \ --ntp-server clock.redhat.com \ --libvirt-type qemu Actual results: The /etc/resolv.conf on the overcloud nodes is empty: [root@overcloud-controller-1 ~]# cat /etc/resolv.conf # Generated by NetworkManager search localdomain # No nameservers found; try putting DNS servers into your # ifcfg files in /etc/sysconfig/network-scripts like so: # # DNS1=xxx.xxx.xxx.xxx # DNS2=xxx.xxx.xxx.xxx # DOMAIN=lab.foo.com bar.foo.com Expected results: The resolv.conf is populated according to the ifcfg scripts which contain the DNS servers: [root@overcloud-controller-1 ~]# grep -R ^DNS /etc/sysconfig/network-scripts/ /etc/sysconfig/network-scripts/ifcfg-br-ex:DNS1=10.16.36.29 /etc/sysconfig/network-scripts/ifcfg-br-ex:DNS2=10.11.5.19 /etc/sysconfig/network-scripts/ifcfg-br-infra:DNS1=10.16.36.29 /etc/sysconfig/network-scripts/ifcfg-br-infra:DNS2=10.11.5.19 /etc/sysconfig/network-scripts/ifcfg-br-storage:DNS1=10.16.36.29 /etc/sysconfig/network-scripts/ifcfg-br-storage:DNS2=10.11.5.19 Additional info: /var/log/messages shows a run of the ifdown-post script that updated the resolv.conf [root@overcloud-controller-1 ~]# grep resolv.conf /var/log/messages Apr 5 07:52:50 localhost NET[9295]: /etc/sysconfig/network-scripts/ifup-post : updated /etc/resolv.conf Apr 5 13:35:14 overcloud-controller-1 NET[5151]: /etc/sysconfig/network-scripts/ifdown-post : updated /etc/resolv.conf If we correlate the time with the os-collect-config log we can see that the vlan interfaces go down around 13:35:14: [root@overcloud-controller-1 ~]# journalctl -l -u os-collect-config | grep ifdown | grep 13:35 Apr 05 13:35:14 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:14 PM] [INFO] running ifdown on interface: vlan200 Apr 05 13:35:14 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:14 PM] [INFO] running ifdown on interface: vlan300 Apr 05 13:35:15 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:15 PM] [INFO] running ifdown on interface: vlan100 Apr 05 13:35:15 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:15 PM] [INFO] running ifdown on interface: vlan301 and are brough back up: Apr 05 13:35:15 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:15 PM] [INFO] running ifup on interface: vlan200 Apr 05 13:35:15 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:15 PM] [INFO] running ifup on interface: vlan300 Apr 05 13:35:16 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:16 PM] [INFO] running ifup on interface: vlan100 Apr 05 13:35:16 overcloud-controller-1.localdomain os-collect-config[3630]: [2016/04/05 01:35:16 PM] [INFO] running ifup on interface: vlan301 Now I suspect that when the ifdown-post script is run the resolv.conf gets updated and the nameservers get removed. When the ifup scripts are run for the vlan interfaces no nameservers get added to the resolv.conf because the ifcfg-vlan* scripts don't contain any DNS entries. I'm not sure why the ifdown/ifup is called for the vlan interfaces.
So far, I have tested this without IPv6 and SSL, and with SSL. Both runs did not reproduce the error. So, if there is indeed an issue, it lies in the IPv6 pathway. I do not have an IPv6 setup at the moment so I can't test it directly. I will continue to investigate possible causes.
I did a comparison between the ifcfg-vlan scripts between a 7.3 deployment and the upgraded one and there seems to be a change that might generate the restart: ## 7.3 fresh deployment [root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-vlan200 # This file is autogenerated by os-net-config DEVICE=vlan200 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no DEVICETYPE=ovs TYPE=OVSIntPort OVS_BRIDGE=br-infra OVS_OPTIONS="tag=200" IPV6INIT=yes IPV6_AUTOCONF=no IPV6ADDR=fd00:fd00:fd00:2000::13 ## Upgraded deployment [root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-vlan200 # This file is autogenerated by os-net-config DEVICE=vlan200 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no DEVICETYPE=ovs TYPE=OVSIntPort OVS_BRIDGE=br-infra OVS_OPTIONS="tag=200" IPV6INIT=yes IPV6_AUTOCONF=no IPV6ADDR=fd00:fd00:fd00:2000::13/64 Note that the upgraded deployment contains the subnet mask ( /64 ) in the IPV6ADDR.
(In reply to Marius Cornea from comment #4) > I did a comparison between the ifcfg-vlan scripts between a 7.3 deployment > and the upgraded one and there seems to be a change that might generate the > restart: > > ## 7.3 fresh deployment > [root@overcloud-controller-0 heat-admin]# cat > /etc/sysconfig/network-scripts/ifcfg-vlan200 > # This file is autogenerated by os-net-config > DEVICE=vlan200 > ONBOOT=yes > HOTPLUG=no > NM_CONTROLLED=no > DEVICETYPE=ovs > TYPE=OVSIntPort > OVS_BRIDGE=br-infra > OVS_OPTIONS="tag=200" > IPV6INIT=yes > IPV6_AUTOCONF=no > IPV6ADDR=fd00:fd00:fd00:2000::13 > Still poking, update below for anyone else debugging. I just deployed a 7.3 env like openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans-v6.yaml -e network_env.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/enable-tls.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/inject-trust-anchor.yaml --ntp-server "0.fedora.pool.ntp.org" I can confirm that my /etc/sysconfig/network-scripts/ifcfg-vlan20 on a compute node is like above, i.e. without the netmask. IPV6ADDR=fd00:fd00:fd00:2000::12 However looking at the os-net-config data at/etc/os-net-config/config.json the netmask *is* specified [1]. I was initially trying to determine if there was a difference in the way the v6 address was specified in the 7.3 vs stable liberty patches but they seem the same. So I am trying to determine if something changed in os-net-config which made the way the ifcfg files are written change to now include the netmask (perhaps was ignored before), thanks, marios [1] [root@overcloud-compute-0 os-net-config]# cat /etc/os-net-config/config.json {"network_config": [{"dns_servers": [], "name": "br-ex", "members": [{"type": "interface", "name": "nic1", "primary": true}, {"type": "vlan", "addresses": [{"ip_netmask": "fd00:fd00:fd00:2000::12/64"}], "vlan_id": 20}, {"type": "vlan", "addresses": [{"ip_netmask": "fd00:fd00:fd00:3000::11/64"}], "vlan_id": 30}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.0.4/24"}], "vlan_id": 50}], "routes": [{"ip_netmask": "169.254.169.254/32", "next_hop": "192.0.2.1"}, {"default": true, "next_hop": "192.0.2.1"}], "use_dhcp": false, "type": "ovs_bridge", "addresses": [{"ip_netmask": "192.0.2.8/24"}]}]} > > ## Upgraded deployment > [root@overcloud-controller-0 heat-admin]# cat > /etc/sysconfig/network-scripts/ifcfg-vlan200 > # This file is autogenerated by os-net-config > DEVICE=vlan200 > ONBOOT=yes > HOTPLUG=no > NM_CONTROLLED=no > DEVICETYPE=ovs > TYPE=OVSIntPort > OVS_BRIDGE=br-infra > OVS_OPTIONS="tag=200" > IPV6INIT=yes > IPV6_AUTOCONF=no > IPV6ADDR=fd00:fd00:fd00:2000::13/64 > > Note that the upgraded deployment contains the subnet mask ( /64 ) in the > IPV6ADDR.
thanks to gfidente... looks like this commit in os-net-config is changing the way the ifcfg files are created to include the netmask https://github.com/openstack/os-net-config/commit/0b130b6b3b4a9e0768e99b1496d2852f2ca47bb7 I also confirmed on my compute node that /usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py looks like data += "IPV6ADDR=%s\n" % first_v6.ip
(thanks jistr and gfidente) the workaround for now is to explicitly make the NetworkDeployment not happen at all during upgrade. We have a NetworkDeploymentActions parameter which gets mapped to the 'actions' property of the corresponding heat StructuredDeployment http://docs.openstack.org/developer/heat/template_guide/openstack.html#OS::Heat::StructuredDeployment-prop-actions We can try to set 'NetworkDeploymentActions: []' in the parameter_defaults section of the upgrades environment files: major-upgrade-pacemaker-converge.yaml major-upgrade-pacemaker.yaml major-upgrade-pacemaker-init.yaml I am not sure yet we can get away with '[]' because "Allowed values: CREATE, UPDATE, DELETE, SUSPEND, RESUME" in that heat doc ^^^ so we may need to explicitly set to something else like 'SUSPEND'. :/
I tried adding NetworkDeploymentActions: [] to the parameter_defaults of the major-upgrade-pacemaker* environments but at upgrade step 3 the network settings got reapplied and the resolv.conf went empty.
thanks for testing that mcornea After more discussion with shardy and others on #tripleo we don't think it is heat after all that is triggering the network config to be re-applied. It's looking like os-net-config gets updated and that triggers re-application of the config; it is the same config (see comment 5 for /etc/os-net-config/config.json ) but now os-net-config includes the netmask when writing the ifcfg files as pointed out in comment 6
I did some investigation and I don't think NetworkDeploymentActions helps here, because it's working as designed: - If you leave it at the default of ['CREATE'] the deployment will never be reapplied by heat, even if the input_values change. - If the input_values are unchanged, we don't even attempt to update the NetworkDeployment on update (it'll remain at CREATE_COMPLETE) - If any input_values change, it'll move to UPDATE_COMPLETE (arguably this is a bug), but we actually don't do anything, we exit before performing any update because UPDATE isn't in DEPLOY_ACTIONS: https://github.com/openstack/heat/blob/master/heat/engine/resources/openstack/heat/software_deployment.py#L259 I tested this and can confirm this works as expected, however I think because os-net-config is applied directly via an o-r-c script (not a heat-config hook), it may get reapplied every time *any* change to the orc data happens, e.g it's not properly under the control of the SoftwareDeployment: https://github.com/openstack/tripleo-image-elements/blob/master/elements/os-net-config/os-refresh-config/configure.d/20-os-net-config This is one reason I'm trying to move away from group: os-apply-config as all such config suffer from this issue: https://review.openstack.org/#/c/271450/ That said, if the config hasn't changed, I don't think re-running os-net-config should do anything, and if it does it's probably a bug in os-net-config itself.
The change at https://review.openstack.org/302352 will prevent ifup/ifdown scripts from emptying the resolv.conf when restarting interfaces which don't have DNS1,DNS2 We migh still suffer issues caused by unwanted interfaces restart, should that be a problem an alternative approach is at https://review.openstack.org/#/c/302337/
Created attachment 1144694 [details] updated os-net-config rpm with the change from https://review.openstack.org/#/c/302352/4
I patched the os-net-config rpm with the change at https://review.openstack.org/#/c/302352/4 (attached). By itself, the change won't fix the issue we are seeing here. Setting the PEERDNS=no will help for future changes to the ifcfg-vlanXX files. However for the upgrade we need to delete the /etc/resolv.conf.save file before updating the os-net-config package so that it simply *cannot* be restored to /etc/resolv.conf (and so not overwritten). Am trying this as workaround for now - we could add it to the UpgradeInitCommand...
I tested the fix at https://review.openstack.org/#/c/302769/3 - copy pasting my comment from there - would be great to have someone else verify too please): so FWIW I tested this on an 3/1 v6 environment deployed like openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans-v6.yaml -e network_env.yaml --ntp-server "0.fedora.pool.ntp.org" On all nodes bar controller-2 i manually installed the updated version of os-net-config that includes gfidente fix from https://review.openstack.org/#/c/302352/4 (that rpm is attached to the bugzilla, follow the gerrig bug So on controller-2 we only removed the /etc/resolv.conf.save file. I completed the init successfully, with this change applied like openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans-v6.yaml -e network_env.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml -e rhos-release-8.yaml Once completed I then upgraded controllers (step 3 is where this was reported yesterday) and it finished OK. The nodes have retained their resolv.conf fine: [stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i 'hostname; sudo grep nameserver /etc/resolv.conf'; done overcloud-controller-0.localdomain # No nameservers found; try putting DNS servers into your nameserver 192.168.122.1 overcloud-controller-1.localdomain # No nameservers found; try putting DNS servers into your nameserver 192.168.122.1 overcloud-controller-2.localdomain # No nameservers found; try putting DNS servers into your nameserver 192.168.122.1 overcloud-compute-0.localdomain # No nameservers found; try putting DNS servers into your nameserver 192.168.122.1
info for anyone looking to test the removal of the /etc/resolv.conf.save file - since the change at https://review.openstack.org/#/c/302769/ is not yet in stable/liberty you can include the change in your environment before starting the upgrade: sudo su pushd /usr/share/openstack-tripleo-heat-templates # replace with the file from the review: curl "https://review.openstack.org/gitweb?p=openstack/tripleo-heat-templates.git;a=blob_plain;f=extraconfig/tasks/major_upgrade_pacemaker_init.yaml;hb=706c2fe4b62f95ac13ee800fc08e549180afc810" > extraconfig/tasks/major_upgrade_pacemaker_init.yaml # sanity check cat extraconfig/tasks/major_upgrade_pacemaker_init.yaml popd exit
(In reply to marios from comment #16) > info for anyone looking to test the removal of the /etc/resolv.conf.save > file - since the change at https://review.openstack.org/#/c/302769/ is not > yet in stable/liberty you can include the change in your environment before > starting the upgrade: > This change is in the latest tht build. It was manually backported once it landed on master.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0637.html