Bug 1559151
Summary: | OSP11 -> OSP12 upgrade: after rebooting controller nodes post upgrade at boot time interfaces set under ovs bridges have no network connectivity | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> |
Component: | openstack-tripleo-heat-templates | Assignee: | Marios Andreou <mandreou> |
Status: | CLOSED ERRATA | QA Contact: | Yurii Prokulevych <yprokule> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 12.0 (Pike) | CC: | bfournie, dbecker, jamsmith, jlibosva, jschluet, mandreou, mburns, mcornea, morazi, rhel-osp-director-maint, tvignaud |
Target Milestone: | z3 | Keywords: | Regression, Triaged, ZStream |
Target Release: | 12.0 (Pike) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-heat-templates-7.0.9-14.el7ost | Doc Type: | Bug Fix |
Doc Text: |
Connectivity problems that occurred after OSP11-to-OSP12 upgrades have been resolved by the removal of an obsolete network configuration file.
The file was /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json. Its presence on post-upgrade systems caused connectivity problems after a reboot on any overcloud node. Interfaces set under OVS bridges had no connectivity. For example, controller nodes were unable to rejoin the pacemaker cluster.
The upgrade process now removes the file and prevents the connectivity problems.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-08-20 12:59:48 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Marius Cornea
2018-03-21 19:48:57 UTC
the only thing that quickly comes to mind, given 'openvswitch' and 'os-net-config' is this https://review.openstack.org/#/c/510577/1/puppet/services/tripleo-packages.yaml which sets -no-activate on the os-net-config run during the upgrade for BZ 1491628. I am not sure it is related yet though, or what happens after reboot with os-net-config The br-ex bridge is missing the flows again so it doesn't switch packets. The symptom is the same as at the bug 1473763, the br-ex ifcfg file doesn't contain fix provided by that bug: https://review.openstack.org/#/c/496707/ $ cat etc/sysconfig/network-scripts/ifcfg-br-ex # This file is autogenerated by os-net-config DEVICE=br-ex ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no DEVICETYPE=ovs TYPE=OVSBridge BOOTPROTO=static IPADDR=10.0.0.101 NETMASK=255.255.255.0 OVS_EXTRA="set bridge br-ex other-config:hwaddr=52:54:00:2e:2f:e3 -- set bridge br-ex fail_mode=standalone" ^^ see that " -- del-controller br-ex" is missing in the ifcfg-br-ex file So the question is whether os-net-config was updated before the upgrade as stated at https://bugzilla.redhat.com/show_bug.cgi?id=1491628#c18 (In reply to Jakub Libosvar from comment #5) > So the question is whether os-net-config was updated before the upgrade as > stated at https://bugzilla.redhat.com/show_bug.cgi?id=1491628#c18 According to the logs it looks like the os-net-config didn't get updated before upgrade as 'test -s /etc/os-net-config/config.json' command failed: Mar 20 22:14:16 localhost os-collect-config: TASK [Check that os-net-config has configuration] ****************************** Mar 20 22:14:16 localhost os-collect-config: fatal: [localhost]: FAILED! => {"changed": true, "cmd": "test -s /etc/os-net-config/config.json", "delta": "0:00:00.004004", "end": "2018-03-21 02:09:41.930566", "msg": "non-zero return code", "rc": 1, "start": "2018-03-21 02:09:41.926562", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} Mar 20 22:14:16 localhost os-collect-config: ...ignoring Mar 20 22:14:16 localhost os-collect-config: TASK [Upgrade os-net-config] *************************************************** Mar 20 22:14:16 localhost os-collect-config: skipping: [localhost] Mar 20 22:14:16 localhost os-collect-config: TASK [take new os-net-config parameters into account now] ********************** Mar 20 22:14:16 localhost os-collect-config: skipping: [localhost] Mar 20 22:14:16 localhost os-collect-config: TASK [Update all packages] ***************************************************** Mar 20 22:14:16 localhost os-collect-config: changed: [localhost] Mar 20 22:14:16 localhost os-collect-config: PLAY RECAP ********************************************************************* After some investigation /etc/os-net-config/config.json is empty as a result of OSP11 deployment(issue described in bug 1514949). The upgrade relies on the config.json file to be populated otherwise os-net-config doesn't get updated before the rest of the packages: https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/puppet/services/tripleo-packages.yaml#L61 Is there any chance we can implement a workaround automatically in the upgrade procedure to cover this use case and: 1/ populate /etc/os-net-config/config.json before starting upgrade 2/ remove /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json from overcloud nodes per https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c8 This would make the upgrade experience seamless for operators who are in this position. (In reply to Marius Cornea from comment #8) > Is there any chance we can implement a workaround automatically in the > upgrade procedure to cover this use case and: > > 1/ populate /etc/os-net-config/config.json before starting upgrade > 2/ remove > /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json from > overcloud nodes per https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c8 > > This would make the upgrade experience seamless for operators who are in > this position. Should update process on 11 populate the file? This fix - https://review.openstack.org/#/c/555452/ should resolve the issue with the empty config.json when backported to queens. (In reply to Marius Cornea from comment #8) > Is there any chance we can implement a workaround automatically in the > upgrade procedure to cover this use case and: > > 1/ populate /etc/os-net-config/config.json before starting upgrade > 2/ remove > /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json from > overcloud nodes per https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c8 > > This would make the upgrade experience seamless for operators who are in > this position. (adding a note here as briefly discussed with morazi/mcornea on irc): 2/ is easy enough and here [0] is a candidate for a 'general' place we could add a validation to. We could either add something to validate fi that is present (/libexec config.json) or even just delete it unconditionally. 1/ is a bit harder, because the population of the config.json comes from the heat stack. I am still not clear why we didn't have the config.json populated in this particular case. Since this was upgrade from 11..12 .. the pike templates e.g. at [1] are already defining $network_config and passing it into an invocation of os-net-config. SO that should have already happened during the upgrade init heat stack update, or, it wasn't happening then either, because we still had the /libexec around? In which case, perhaps we need to add the removal of that element os-net-config from the UpgradeInit, i mean [2][3]. [0] https://github.com/openstack/tripleo-heat-templates/blob/0299096401e87a1f1996f981b8387f51640dc22e/puppet/services/tripleo-packages.yaml#L64 [1] https://github.com/openstack/tripleo-heat-templates/blob/fb82009d62c903f92f0cf287c01b5406c5d7da37/network/config/single-nic-vlans/controller.yaml#L87-L89 [2] https://github.com/openstack/tripleo-heat-templates/blob/fb82009d62c903f92f0cf287c01b5406c5d7da37/environments/major-upgrade-composable-steps.yaml#L7 [3] https://github.com/openstack/tripleo-heat-templates/blob/fb82009d62c903f92f0cf287c01b5406c5d7da37/environments/major-upgrade-composable-steps-docker.yaml#L7 >1/ is a bit harder, because the population of the config.json comes from the heat >stack. I am still not clear why we didn't have the config.json populated in this >particular case. Since this was upgrade from 11..12 .. the pike templates e.g. at [1] >are already defining $network_config and passing it into an invocation of os-net->config. SO that should have already happened during the upgrade init heat stack >update, or, it wasn't happening then either, because we still had the /libexec around? Marios - I'm still unclear on all of the mechanics related to config.json particularly on upgrade but one thing to note is that config.json IS populated properly by os-net-config, but then it gets overwritten when os-apply-config invokes os-net-config using the empty file from /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json. This is the behavior described here https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c2 and https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c4 and also seen by Dan and Marious in this setup, correct me if I'm wrong. Also /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json only exists on versions prior to OSP-12, in OSP-12 it was changed to element_config.json (https://review.openstack.org/#/c/470014/). (In reply to Bob Fournier from comment #15) > Also /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json > only exists on versions prior to OSP-12, in OSP-12 it was changed to > element_config.json (https://review.openstack.org/#/c/470014/). ack thanks for the clarifications Bob, I'll post something momentarily to remove the /usr/libexec-os-apply-config/templates/etc/os-net-config/config.json at the start of the upgrade (as per the discussion here and at https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c5 sounds like that should solve the problem) Thanks Marios. After talking this over with Dan Prince, this should also work - https://review.openstack.org/#/c/556539/. This will remove the files used by os-apply-config (old style nic config file) when run-os-net-config.sh is invoked (new style nic config) to avoid the overwrite problem on upgrade. (In reply to Bob Fournier from comment #17) > Thanks Marios. After talking this over with Dan Prince, this should also > work - https://review.openstack.org/#/c/556539/. This will remove the files > used by os-apply-config (old style nic config file) when > run-os-net-config.sh is invoked (new style nic config) to avoid the > overwrite problem on upgrade. ack thanks, lets add that to trackers. I also just posted https://review.openstack.org/556533 which removes the libexec config with the upgrade init command (we already have the mechanism for it, this just adds the command there, i don't see the harm in doing both. lets update trackers?) Information for build openstack-tripleo-heat-templates-7.0.9-14.el7ost https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=679218 Verified. OC nodes managed to form the cluster after upgrade and reboot. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2331 |