Bug 1324160
| Summary: | Overcloud nodes have an empty /etc/resolv.conf post upgrade | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> | ||||
| Component: | rhosp-director | Assignee: | Giulio Fidente <gfidente> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Marius Cornea <mcornea> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 8.0 (Liberty) | CC: | brad, dbecker, gfidente, kbasil, mandreou, mburns, morazi, rhel-osp-director-maint | ||||
| Target Milestone: | ga | ||||||
| Target Release: | 8.0 (Liberty) | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | openstack-tripleo-heat-templates-0.8.14-7.el7ost os-net-config-0.2.3-2.el7ost | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-04-15 14:32:08 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Marius Cornea
2016-04-05 17:23:53 UTC
So far, I have tested this without IPv6 and SSL, and with SSL. Both runs did not reproduce the error. So, if there is indeed an issue, it lies in the IPv6 pathway. I do not have an IPv6 setup at the moment so I can't test it directly. I will continue to investigate possible causes. I did a comparison between the ifcfg-vlan scripts between a 7.3 deployment and the upgraded one and there seems to be a change that might generate the restart: ## 7.3 fresh deployment [root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-vlan200 # This file is autogenerated by os-net-config DEVICE=vlan200 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no DEVICETYPE=ovs TYPE=OVSIntPort OVS_BRIDGE=br-infra OVS_OPTIONS="tag=200" IPV6INIT=yes IPV6_AUTOCONF=no IPV6ADDR=fd00:fd00:fd00:2000::13 ## Upgraded deployment [root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-vlan200 # This file is autogenerated by os-net-config DEVICE=vlan200 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no DEVICETYPE=ovs TYPE=OVSIntPort OVS_BRIDGE=br-infra OVS_OPTIONS="tag=200" IPV6INIT=yes IPV6_AUTOCONF=no IPV6ADDR=fd00:fd00:fd00:2000::13/64 Note that the upgraded deployment contains the subnet mask ( /64 ) in the IPV6ADDR. (In reply to Marius Cornea from comment #4) > I did a comparison between the ifcfg-vlan scripts between a 7.3 deployment > and the upgraded one and there seems to be a change that might generate the > restart: > > ## 7.3 fresh deployment > [root@overcloud-controller-0 heat-admin]# cat > /etc/sysconfig/network-scripts/ifcfg-vlan200 > # This file is autogenerated by os-net-config > DEVICE=vlan200 > ONBOOT=yes > HOTPLUG=no > NM_CONTROLLED=no > DEVICETYPE=ovs > TYPE=OVSIntPort > OVS_BRIDGE=br-infra > OVS_OPTIONS="tag=200" > IPV6INIT=yes > IPV6_AUTOCONF=no > IPV6ADDR=fd00:fd00:fd00:2000::13 > Still poking, update below for anyone else debugging. I just deployed a 7.3 env like openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans-v6.yaml -e network_env.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/enable-tls.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/inject-trust-anchor.yaml --ntp-server "0.fedora.pool.ntp.org" I can confirm that my /etc/sysconfig/network-scripts/ifcfg-vlan20 on a compute node is like above, i.e. without the netmask. IPV6ADDR=fd00:fd00:fd00:2000::12 However looking at the os-net-config data at/etc/os-net-config/config.json the netmask *is* specified [1]. I was initially trying to determine if there was a difference in the way the v6 address was specified in the 7.3 vs stable liberty patches but they seem the same. So I am trying to determine if something changed in os-net-config which made the way the ifcfg files are written change to now include the netmask (perhaps was ignored before), thanks, marios [1] [root@overcloud-compute-0 os-net-config]# cat /etc/os-net-config/config.json {"network_config": [{"dns_servers": [], "name": "br-ex", "members": [{"type": "interface", "name": "nic1", "primary": true}, {"type": "vlan", "addresses": [{"ip_netmask": "fd00:fd00:fd00:2000::12/64"}], "vlan_id": 20}, {"type": "vlan", "addresses": [{"ip_netmask": "fd00:fd00:fd00:3000::11/64"}], "vlan_id": 30}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.0.4/24"}], "vlan_id": 50}], "routes": [{"ip_netmask": "169.254.169.254/32", "next_hop": "192.0.2.1"}, {"default": true, "next_hop": "192.0.2.1"}], "use_dhcp": false, "type": "ovs_bridge", "addresses": [{"ip_netmask": "192.0.2.8/24"}]}]} > > ## Upgraded deployment > [root@overcloud-controller-0 heat-admin]# cat > /etc/sysconfig/network-scripts/ifcfg-vlan200 > # This file is autogenerated by os-net-config > DEVICE=vlan200 > ONBOOT=yes > HOTPLUG=no > NM_CONTROLLED=no > DEVICETYPE=ovs > TYPE=OVSIntPort > OVS_BRIDGE=br-infra > OVS_OPTIONS="tag=200" > IPV6INIT=yes > IPV6_AUTOCONF=no > IPV6ADDR=fd00:fd00:fd00:2000::13/64 > > Note that the upgraded deployment contains the subnet mask ( /64 ) in the > IPV6ADDR. thanks to gfidente... looks like this commit in os-net-config is changing the way the ifcfg files are created to include the netmask https://github.com/openstack/os-net-config/commit/0b130b6b3b4a9e0768e99b1496d2852f2ca47bb7 I also confirmed on my compute node that /usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py looks like data += "IPV6ADDR=%s\n" % first_v6.ip (thanks jistr and gfidente) the workaround for now is to explicitly make the NetworkDeployment not happen at all during upgrade. We have a NetworkDeploymentActions parameter which gets mapped to the 'actions' property of the corresponding heat StructuredDeployment http://docs.openstack.org/developer/heat/template_guide/openstack.html#OS::Heat::StructuredDeployment-prop-actions We can try to set 'NetworkDeploymentActions: []' in the parameter_defaults section of the upgrades environment files: major-upgrade-pacemaker-converge.yaml major-upgrade-pacemaker.yaml major-upgrade-pacemaker-init.yaml I am not sure yet we can get away with '[]' because "Allowed values: CREATE, UPDATE, DELETE, SUSPEND, RESUME" in that heat doc ^^^ so we may need to explicitly set to something else like 'SUSPEND'. :/ I tried adding NetworkDeploymentActions: [] to the parameter_defaults of the major-upgrade-pacemaker* environments but at upgrade step 3 the network settings got reapplied and the resolv.conf went empty. thanks for testing that mcornea After more discussion with shardy and others on #tripleo we don't think it is heat after all that is triggering the network config to be re-applied. It's looking like os-net-config gets updated and that triggers re-application of the config; it is the same config (see comment 5 for /etc/os-net-config/config.json ) but now os-net-config includes the netmask when writing the ifcfg files as pointed out in comment 6 I did some investigation and I don't think NetworkDeploymentActions helps here, because it's working as designed: - If you leave it at the default of ['CREATE'] the deployment will never be reapplied by heat, even if the input_values change. - If the input_values are unchanged, we don't even attempt to update the NetworkDeployment on update (it'll remain at CREATE_COMPLETE) - If any input_values change, it'll move to UPDATE_COMPLETE (arguably this is a bug), but we actually don't do anything, we exit before performing any update because UPDATE isn't in DEPLOY_ACTIONS: https://github.com/openstack/heat/blob/master/heat/engine/resources/openstack/heat/software_deployment.py#L259 I tested this and can confirm this works as expected, however I think because os-net-config is applied directly via an o-r-c script (not a heat-config hook), it may get reapplied every time *any* change to the orc data happens, e.g it's not properly under the control of the SoftwareDeployment: https://github.com/openstack/tripleo-image-elements/blob/master/elements/os-net-config/os-refresh-config/configure.d/20-os-net-config This is one reason I'm trying to move away from group: os-apply-config as all such config suffer from this issue: https://review.openstack.org/#/c/271450/ That said, if the config hasn't changed, I don't think re-running os-net-config should do anything, and if it does it's probably a bug in os-net-config itself. The change at https://review.openstack.org/302352 will prevent ifup/ifdown scripts from emptying the resolv.conf when restarting interfaces which don't have DNS1,DNS2 We migh still suffer issues caused by unwanted interfaces restart, should that be a problem an alternative approach is at https://review.openstack.org/#/c/302337/ Created attachment 1144694 [details] updated os-net-config rpm with the change from https://review.openstack.org/#/c/302352/4 I patched the os-net-config rpm with the change at https://review.openstack.org/#/c/302352/4 (attached). By itself, the change won't fix the issue we are seeing here. Setting the PEERDNS=no will help for future changes to the ifcfg-vlanXX files. However for the upgrade we need to delete the /etc/resolv.conf.save file before updating the os-net-config package so that it simply *cannot* be restored to /etc/resolv.conf (and so not overwritten). Am trying this as workaround for now - we could add it to the UpgradeInitCommand... I tested the fix at https://review.openstack.org/#/c/302769/3 - copy pasting my comment from there - would be great to have someone else verify too please): so FWIW I tested this on an 3/1 v6 environment deployed like openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans-v6.yaml -e network_env.yaml --ntp-server "0.fedora.pool.ntp.org" On all nodes bar controller-2 i manually installed the updated version of os-net-config that includes gfidente fix from https://review.openstack.org/#/c/302352/4 (that rpm is attached to the bugzilla, follow the gerrig bug So on controller-2 we only removed the /etc/resolv.conf.save file. I completed the init successfully, with this change applied like openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans-v6.yaml -e network_env.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml -e rhos-release-8.yaml Once completed I then upgraded controllers (step 3 is where this was reported yesterday) and it finished OK. The nodes have retained their resolv.conf fine: [stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i 'hostname; sudo grep nameserver /etc/resolv.conf'; done overcloud-controller-0.localdomain # No nameservers found; try putting DNS servers into your nameserver 192.168.122.1 overcloud-controller-1.localdomain # No nameservers found; try putting DNS servers into your nameserver 192.168.122.1 overcloud-controller-2.localdomain # No nameservers found; try putting DNS servers into your nameserver 192.168.122.1 overcloud-compute-0.localdomain # No nameservers found; try putting DNS servers into your nameserver 192.168.122.1 info for anyone looking to test the removal of the /etc/resolv.conf.save file - since the change at https://review.openstack.org/#/c/302769/ is not yet in stable/liberty you can include the change in your environment before starting the upgrade: sudo su pushd /usr/share/openstack-tripleo-heat-templates # replace with the file from the review: curl "https://review.openstack.org/gitweb?p=openstack/tripleo-heat-templates.git;a=blob_plain;f=extraconfig/tasks/major_upgrade_pacemaker_init.yaml;hb=706c2fe4b62f95ac13ee800fc08e549180afc810" > extraconfig/tasks/major_upgrade_pacemaker_init.yaml # sanity check cat extraconfig/tasks/major_upgrade_pacemaker_init.yaml popd exit (In reply to marios from comment #16) > info for anyone looking to test the removal of the /etc/resolv.conf.save > file - since the change at https://review.openstack.org/#/c/302769/ is not > yet in stable/liberty you can include the change in your environment before > starting the upgrade: > This change is in the latest tht build. It was manually backported once it landed on master. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0637.html |