1590651 – [UPGRADES][SPLIT-STACK] Cannot ssh to VM after major upgrade converge step

Bug 1590651 - [UPGRADES][SPLIT-STACK] Cannot ssh to VM after major upgrade converge step

Summary: [UPGRADES][SPLIT-STACK] Cannot ssh to VM after major upgrade converge step

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	os-net-config
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ga
Target Release:	13.0 (Queens)
Assignee:	Bob Fournier
QA Contact:	Yurii Prokulevych
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-13 06:41 UTC by Yurii Prokulevych
Modified:	2018-06-27 13:59 UTC (History)
CC List:	22 users (show)
Fixed In Version:	os-net-config-8.4.1-4.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-27 13:58:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	575220	None	master: MERGED	os-net-config: Restore the order of params in ifcfg file that was inadvertently changed (I77162d28b1dc173e3a90cb385a3af9...	2018-06-14 17:53:31 UTC
OpenStack gerrit	575432	None	stable/queens: NEW	os-net-config: Restore the order of params in ifcfg file that was inadvertently changed (I77162d28b1dc173e3a90cb385a3af9...	2018-06-14 17:53:26 UTC
Red Hat Product Errata	RHEA-2018:2086	None	None	None	2018-06-27 13:59:06 UTC

Description Yurii Prokulevych 2018-06-13 06:41:48 UTC

Description of problem:
-----------------------
Cannot ssh to VM after/during major upgrade converge step though instances are reported active.

openstack server list -f yaml
- Flavor: v1-1G-5G
  ID: d19b8c8c-54cc-40f2-9ae7-d847bc68fe6d
  Image: upgrade_workload
  Name: instance_6e00778d92
  Networks: internal_net=192.168.0.21, 10.0.0.217
  Status: ACTIVE
- Flavor: v1-1G-5G
  ID: f781803e-81c6-472d-8fed-f8887da08922
  Image: upgrade_workload
  Name: instance_5c39032710
  Networks: internal_net=192.168.0.15, 10.0.0.215
  Status: ACTIVE

ssh cirros.0.217
ssh: connect to host 10.0.0.217 port 22: No route to host

ssh cirros.0.215
ssh: connect to host 10.0.0.215 port 22: No route to host

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
openstack-neutron-openvswitch-12.0.2-0.20180421011361.0ec54fd.el7ost.noarch
python2-ironic-neutron-agent-1.0.0-1.el7ost.noarch
openstack-neutron-common-12.0.2-0.20180421011361.0ec54fd.el7ost.noarch
puppet-neutron-12.4.1-0.20180412211913.el7ost.noarch
python2-neutron-lib-1.13.0-1.el7ost.noarch
openstack-neutron-ml2-12.0.2-0.20180421011361.0ec54fd.el7ost.noarch
python2-neutronclient-6.7.0-1.el7ost.noarch
python-neutron-12.0.2-0.20180421011361.0ec54fd.el7ost.noarch
openstack-neutron-12.0.2-0.20180421011361.0ec54fd.el7ost.noarch


Steps to Reproduce:
-------------------
1. Install RHOS-12 with pre-provisioned servers(split-stack)
2. Upgrade UC to RHOS-13
3. Launch VM and associate floating ip to it, make sure it's reachable
4. Upgrade OC to RHOS-13
5. Try to reach VM with its FIP


Actual results:
---------------
VM is not reachable


Expected results:
-----------------
VM is reachable:

Additional info:
----------------
Virtual split-stack environment: 3controllers + 3messaging + 3database + 3ceph + 2networker + 2compute

Comment 2 nlevinki 2018-06-13 08:42:23 UTC

Hi,
Is it the same cause as 1589684 ?

Comment 3 Slawek Kaplonski 2018-06-13 09:39:31 UTC

@nlevinki: I don't think it's same issue. In sos reports attached there I don't see br-ex to be down and up again which caused this issue.

Problem here is that during upgrade process br-ex bridge interface was "restarted":

Jun 12 17:42:47 networker-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-br br-ex
Jun 12 17:42:47 networker-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --may-exist add-br br-ex -- set bridge br-ex other-config:hwaddr=52:54:00:b9:fc:e0 -- set bridge br-ex fail_mode=standalone -- del-controller br-ex

This was triggered by os-net-config script which (probably) did some changes in one of files /etc/sysconfig/network-srcipts/{ifcfg-br-ex,route-br-ex,route6-br-ex}

After bridge was created again, it don't have proper openflow rules which should be created by neutron-openvswitch-agent and because of that, there is no connectivity to qrouter-XXX namespace.
As a workaround You may restart neutron_ovs_agent container and it will reconfigure flows on this bridge.

There is already patch merged to upstream Queens branch which adds monitoring of such external bridges, so ovs agent should reconfigure such bridge automatically without any restart.
BZ for that is: https://bugzilla.redhat.com/show_bug.cgi?id=1576286 and upstream patch:  https://review.openstack.org/#/c/567145/

Comment 4 Assaf Muller 2018-06-13 13:21:35 UTC

I've marked https://bugzilla.redhat.com/show_bug.cgi?id=1576286 as a blocker, we'll merge the fix right now.

Comment 5 Carlos Camacho 2018-06-13 14:48:16 UTC

@dalvarez moved to POST but there is no tracker? can you confirm that https://review.openstack.org/#/c/567145/ should fix this? If so, can we add it as a tracker?

Comment 6 Bernard Cafarelli 2018-06-13 14:51:42 UTC

Tracker is in https://bugzilla.redhat.com/show_bug.cgi?id=1576286 (which has blocker+ flag), should probably mark this one as depending on #1576286 or maybe duplicate?

Comment 7 Slawek Kaplonski 2018-06-13 14:53:02 UTC

@Bernard: I wouldn't mark it as duplicate as in fact those one different issues where one is an result of another. So IMO "depends on" would be better here

Comment 8 Assaf Muller 2018-06-13 15:22:46 UTC

Agreed, depends on is better. This is *not* a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1576286 because in this RHBZ, we're seeing two issues:

1) Director upgrade via os-net-config is restarting ifcfg files, which also deletes and recreates br-ex
2) If (1) happens, the OVS doesn't reprogram flows on br-ex

This RHBZ should track (1), while https://bugzilla.redhat.com/show_bug.cgi?id=1576286 is tracking (2).

Comment 9 Assaf Muller 2018-06-13 15:23:53 UTC

In light of comment 8 I'm moving this to HardProv DFG.

Comment 10 Bob Fournier 2018-06-13 15:47:49 UTC

I'd like to get some info on the configuration prior to upgrade. For example were the old-style nic config files being used and you needed to change to the new style configs (which is required in OSP-13)?  Can you provide the nic configs and network environment files before and after upgrade (if different, otherwise just before)?  Also, what was the deployment command that was run on upgrade (i.e. what files were included), and has that changed from the initial deployment?

Comment 13 Bob Fournier 2018-06-13 20:31:36 UTC

We believe that all interfaces and bridges are getting restarted on upgrade because the order of parameters in the ifcfg files has changed slightly in Queens due to this change - https://review.openstack.org/#/c/485132/9/os_net_config/impl_ifcfg.py.

Here is an OSP-12 ifcfg file:
[root@networker-1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-br-ex
# This file is autogenerated by os-net-config
DEVICE=br-ex
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
DEVICETYPE=ovs
TYPE=OVSBridge
<snip>
 
Here is an OSP-13 ifcfg file:
# This file is autogenerated by os-net-config
DEVICE=br-ex
HOTPLUG=no
ONBOOT=yes   <=== different location
NM_CONTROLLED=no
DEVICETYPE=ovs
TYPE=OVSBridge
<snip>

os-net-config does a file diff between the existing ifcfg and what it intends to write and would treat this as a change requiring restart of devices.

There is a patch upstream:
https://review.openstack.org/#/c/575220/

Comment 24 Marius Cornea 2018-06-15 21:40:19 UTC

FWIW in the upgrade tasks there is a workaround that prevents os-net-config from triggering the ifcfg restarts(running os-net-config with --no-activate option):

https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/tripleo-packages.yaml#L84-L90

But in case of the pre-deployed servers os-net-config gets updated before the upgrade tasks by:

https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/deployed-server/deployed-server-bootstrap-rhel.sh#L5-L12

Hence the following workaround condition fails(os-net-config is already updated at the upgrade tasks time):

https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/tripleo-packages.yaml#L93

https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/tripleo-packages.yaml#L74-L77

Comment 25 Yurii Prokulevych 2018-06-19 13:31:28 UTC

Verified with os-net-config-8.4.1-4.el7ost.noarch

Comment 27 errata-xmlrpc 2018-06-27 13:58:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Note You need to log in before you can comment on or make changes to this bug.

amuller
augol
bcafarel
bfournie
ccamacho
chrisw
dalvarez
dsneddon
hbrock
hjensas
jschluet
jslagle
mandreou
mbultel
mburns
mcornea
nlevinki
nyechiel
sclewis
skaplons
srevivo
yprokule