Description of problem: ----------------------- Cannot ssh to VM after/during major upgrade converge step though instances are reported active. openstack server list -f yaml - Flavor: v1-1G-5G ID: d19b8c8c-54cc-40f2-9ae7-d847bc68fe6d Image: upgrade_workload Name: instance_6e00778d92 Networks: internal_net=192.168.0.21, 10.0.0.217 Status: ACTIVE - Flavor: v1-1G-5G ID: f781803e-81c6-472d-8fed-f8887da08922 Image: upgrade_workload Name: instance_5c39032710 Networks: internal_net=192.168.0.15, 10.0.0.215 Status: ACTIVE ssh cirros.0.217 ssh: connect to host 10.0.0.217 port 22: No route to host ssh cirros.0.215 ssh: connect to host 10.0.0.215 port 22: No route to host Version-Release number of selected component (if applicable): ------------------------------------------------------------- openstack-neutron-openvswitch-12.0.2-0.20180421011361.0ec54fd.el7ost.noarch python2-ironic-neutron-agent-1.0.0-1.el7ost.noarch openstack-neutron-common-12.0.2-0.20180421011361.0ec54fd.el7ost.noarch puppet-neutron-12.4.1-0.20180412211913.el7ost.noarch python2-neutron-lib-1.13.0-1.el7ost.noarch openstack-neutron-ml2-12.0.2-0.20180421011361.0ec54fd.el7ost.noarch python2-neutronclient-6.7.0-1.el7ost.noarch python-neutron-12.0.2-0.20180421011361.0ec54fd.el7ost.noarch openstack-neutron-12.0.2-0.20180421011361.0ec54fd.el7ost.noarch Steps to Reproduce: ------------------- 1. Install RHOS-12 with pre-provisioned servers(split-stack) 2. Upgrade UC to RHOS-13 3. Launch VM and associate floating ip to it, make sure it's reachable 4. Upgrade OC to RHOS-13 5. Try to reach VM with its FIP Actual results: --------------- VM is not reachable Expected results: ----------------- VM is reachable: Additional info: ---------------- Virtual split-stack environment: 3controllers + 3messaging + 3database + 3ceph + 2networker + 2compute
Hi, Is it the same cause as 1589684 ?
@nlevinki: I don't think it's same issue. In sos reports attached there I don't see br-ex to be down and up again which caused this issue. Problem here is that during upgrade process br-ex bridge interface was "restarted": Jun 12 17:42:47 networker-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-br br-ex Jun 12 17:42:47 networker-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --may-exist add-br br-ex -- set bridge br-ex other-config:hwaddr=52:54:00:b9:fc:e0 -- set bridge br-ex fail_mode=standalone -- del-controller br-ex This was triggered by os-net-config script which (probably) did some changes in one of files /etc/sysconfig/network-srcipts/{ifcfg-br-ex,route-br-ex,route6-br-ex} After bridge was created again, it don't have proper openflow rules which should be created by neutron-openvswitch-agent and because of that, there is no connectivity to qrouter-XXX namespace. As a workaround You may restart neutron_ovs_agent container and it will reconfigure flows on this bridge. There is already patch merged to upstream Queens branch which adds monitoring of such external bridges, so ovs agent should reconfigure such bridge automatically without any restart. BZ for that is: https://bugzilla.redhat.com/show_bug.cgi?id=1576286 and upstream patch: https://review.openstack.org/#/c/567145/
I've marked https://bugzilla.redhat.com/show_bug.cgi?id=1576286 as a blocker, we'll merge the fix right now.
@dalvarez moved to POST but there is no tracker? can you confirm that https://review.openstack.org/#/c/567145/ should fix this? If so, can we add it as a tracker?
Tracker is in https://bugzilla.redhat.com/show_bug.cgi?id=1576286 (which has blocker+ flag), should probably mark this one as depending on #1576286 or maybe duplicate?
@Bernard: I wouldn't mark it as duplicate as in fact those one different issues where one is an result of another. So IMO "depends on" would be better here
Agreed, depends on is better. This is *not* a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1576286 because in this RHBZ, we're seeing two issues: 1) Director upgrade via os-net-config is restarting ifcfg files, which also deletes and recreates br-ex 2) If (1) happens, the OVS doesn't reprogram flows on br-ex This RHBZ should track (1), while https://bugzilla.redhat.com/show_bug.cgi?id=1576286 is tracking (2).
In light of comment 8 I'm moving this to HardProv DFG.
I'd like to get some info on the configuration prior to upgrade. For example were the old-style nic config files being used and you needed to change to the new style configs (which is required in OSP-13)? Can you provide the nic configs and network environment files before and after upgrade (if different, otherwise just before)? Also, what was the deployment command that was run on upgrade (i.e. what files were included), and has that changed from the initial deployment?
We believe that all interfaces and bridges are getting restarted on upgrade because the order of parameters in the ifcfg files has changed slightly in Queens due to this change - https://review.openstack.org/#/c/485132/9/os_net_config/impl_ifcfg.py. Here is an OSP-12 ifcfg file: [root@networker-1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-br-ex # This file is autogenerated by os-net-config DEVICE=br-ex ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no DEVICETYPE=ovs TYPE=OVSBridge <snip> Here is an OSP-13 ifcfg file: # This file is autogenerated by os-net-config DEVICE=br-ex HOTPLUG=no ONBOOT=yes <=== different location NM_CONTROLLED=no DEVICETYPE=ovs TYPE=OVSBridge <snip> os-net-config does a file diff between the existing ifcfg and what it intends to write and would treat this as a change requiring restart of devices. There is a patch upstream: https://review.openstack.org/#/c/575220/
FWIW in the upgrade tasks there is a workaround that prevents os-net-config from triggering the ifcfg restarts(running os-net-config with --no-activate option): https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/tripleo-packages.yaml#L84-L90 But in case of the pre-deployed servers os-net-config gets updated before the upgrade tasks by: https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/deployed-server/deployed-server-bootstrap-rhel.sh#L5-L12 Hence the following workaround condition fails(os-net-config is already updated at the upgrade tasks time): https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/tripleo-packages.yaml#L93 https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/tripleo-packages.yaml#L74-L77
Verified with os-net-config-8.4.1-4.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086