Bug 1968445

Summary:	[OSP16.2]nova-compute service is down after openvswitch_restart
Product:	Red Hat OpenStack	Reporter:	Eran Kuris <ekuris>
Component:	openvswitch	Assignee:	Ilya Maximets <i.maximets>
Status:	CLOSED ERRATA	QA Contact:	Eran Kuris <ekuris>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	16.2 (Train)	CC:	apevec, bfournie, ccamposr, chrisw, dalvarez, echaudro, egallen, fhallal, hbrock, i.maximets, jlibosva, jslagle, mburns, rsafrono, spower
Target Milestone:	beta	Keywords:	AutomationBlocker, Regression, Triaged
Target Release:	16.2 (Train on RHEL 8.4)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	rhosp-openvswitch-2.15-4.el8ost.1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1970832 (view as bug list)		Environment:
Last Closed:	2021-09-15 07:15:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1970832
Bug Blocks:

Description Eran Kuris 2021-06-07 12:25:33 UTC

Description of problem:
After openvswitch2.15 is restarted, nova-compute service is down as a results tests ate failing with error: "No valid host was found. There are not enough hosts available" 

Version-Release number of selected component (if applicable):
core_puddle_version 
RHOS-16.2-RHEL-8-20210525.n.0
ovn-2021-21.03.0-21.el8fdp.x86_64
openvswitch2.15-2.15.0-22.el8fdp.x86_64

How reproducible:
100%

Steps to Reproduce:
1. install osp16.2 with new ovs 2.15 and ovn-2021
2. run : openstack compute service list {all services are up}
3. sudo systemctl restart openvswitch.service on compute node
4. run : openstack compute service list {nova-compute service is down}

Actual results:


Expected results:


Additional info:
workaround is to reboot compute nodes

Comment 2 Eelco Chaudron 2021-06-08 06:34:43 UTC

Was there any troubleshooting done to figure out what is causing this? I openvswitch programmed correctly after the restart?

As I do not have an OSP system can you make yours available?

Comment 6 Eelco Chaudron 2021-06-08 09:24:58 UTC

Checked Eran's setup, and it looks like that after the OVS restart the undercloud network is not set up correctly as OVN can't talk to the SB controller.


[root@compute-1 heat-admin]# podman exec ovn_controller ovn-appctl -t ovn-controller connection-status
not connected

[root@compute-1 heat-admin]# ovs-vsctl list open
_uuid               : c659a74f-9893-4260-a873-fe0b5ae1d88d
bridges             : [8863c7c3-ed20-4ef6-ae37-d97fefca4457, af1f0af3-8f54-4b7f-af73-75e8880b7b6d, ca80eca0-8ecb-43d0-80a4-f04fac6acad1]
cur_cfg             : 533
datapath_types      : [netdev, system]
datapaths           : {}
db_version          : "8.2.0"
dpdk_initialized    : false
dpdk_version        : "DPDK 20.11.0"
external_ids        : {hostname=compute-1.redhat.local, ovn-bridge=br-int, ovn-bridge-mappings="datacentre:br-ex,tenant:br-isolated", ovn-encap-ip="172.17.2.15", ovn-encap-type=geneve, ovn-openflow-probe-interval="60", ovn-remote="tcp:172.17.1.122:6642", ovn-remote-probe-interval="60000", rundir="/var/run/openvswitch", system-id="277508c2-98bd-4c54-9576-aac113bda829"}
iface_types         : [bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, stt, system, tap, vxlan]
manager_options     : [016abb0c-b9dd-43ac-84d8-fef8b5a489b8]
next_cfg            : 533
other_config        : {}
ovs_version         : "2.15.1"
ssl                 : []
statistics          : {}
system_type         : rhel
system_version      : "8.4"


[root@compute-1 heat-admin]# ping 172.17.1.122
PING 172.17.1.122 (172.17.1.122) 56(84) bytes of data.
^C
--- 172.17.1.122 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3111ms


[root@compute-1 heat-admin]# ip r get 172.17.1.122
172.17.1.122 via 192.168.24.1 dev ens3 src 192.168.24.50 uid 0 
    cache

Comment 7 Daniel Alvarez Sanchez 2021-06-08 09:38:51 UTC

I think I know what's going.

When the node is rebooted, things look good; vlan20 is in OVS (br-isolated) and has an IP configured on the SB database network:

    Bridge br-isolated
        fail_mode: standalone
        Port vlan30
            tag: 30
            Interface vlan30
                type: internal
        Port vlan20
            tag: 20
            Interface vlan20
                type: internal


[root@compute-1 ~]# ip a sh vlan20
9: vlan20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 2a:53:8f:15:e5:57 brd ff:ff:ff:ff:ff:ff
    inet 172.17.1.109/24 brd 172.17.1.255 scope global vlan20
       valid_lft forever preferred_lft forever
    inet6 fe80::2853:8fff:fe15:e557/64 scope link
       valid_lft forever preferred_lft forever


[root@compute-1 ~]# ovs-vsctl get open . external_ids:ovn-remote
"tcp:172.17.1.122:6642"


[root@compute-1 ~]# ping 172.17.1.122 -c1
PING 172.17.1.122 (172.17.1.122) 56(84) bytes of data.
64 bytes from 172.17.1.122: icmp_seq=1 ttl=64 time=0.274 ms

--- 172.17.1.122 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.274/0.274/0.274/0.000 ms



Then when you restart OVS, the IP goes away:


[root@compute-1 ~]# sudo systemctl restart openvswitch.service
[root@compute-1 ~]# ip a sh vlan20
16: vlan20: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether fa:a1:d3:53:50:21 brd ff:ff:ff:ff:ff:ff
[root@compute-1 ~]# ip a sh vlan20



Adding the IP back to vlan20 makes it work:


[root@compute-1 ~]# podman exec ovn_controller ovn-appctl -t ovn-controller connection-status
not connected


[root@compute-1 ~]# ip l s dev vlan20 up
[root@compute-1 ~]# ip a a 172.17.1.109/24 dev vlan20
[root@compute-1 ~]# ping 172.17.1.122 -c1
PING 172.17.1.122 (172.17.1.122) 56(84) bytes of data.
64 bytes from 172.17.1.122: icmp_seq=1 ttl=64 time=1.25 ms

--- 172.17.1.122 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.248/1.248/1.248/0.000 ms
[root@compute-1 ~]# podman exec ovn_controller ovn-appctl -t ovn-controller connection-status
connected



My take is:

1) The control plane should be put outside OVS so that a restart of OVS doesn't cause reconnections to the OVN DBs which are costly. We're trying to move away from this configuration for a long time.
2) The IP configuration is loss upon restart of OVS; i don't think this is caused by the newer OVS version but can you please confirm? I'd say this had been like this before and OVS is working ok

Comment 8 Daniel Alvarez Sanchez 2021-06-08 09:41:37 UTC

(In reply to Daniel Alvarez Sanchez from comment #7)

> My take is:
> 
> 1) The control plane should be put outside OVS so that a restart of OVS
> doesn't cause reconnections to the OVN DBs which are costly. We're trying to
> move away from this configuration for a long time.
> 2) The IP configuration is loss upon restart of OVS; i don't think this is
> caused by the newer OVS version but can you please confirm? I'd say this had
> been like this before and OVS is working ok

Actually, as pointed out by Dumitru, it could be an issue in OVS 2.15:
https://mail.openvswitch.org/pipermail/ovs-discuss/2021-June/051222.html

Comment 9 Eelco Chaudron 2021-06-08 09:46:27 UTC

Taking BZ to figure this out ;)

Comment 13 Daniel Alvarez Sanchez 2021-06-08 12:36:02 UTC

Assigning it to Ilya who has root caused the issue on ovsdb-idl and will post a fix upstream soon for it

Comment 15 Ilya Maximets 2021-06-08 13:25:05 UTC

The issue is that ovs-vswitchd starts configuring bridges while it's
not yet connected to/not yet received all the data from ovsdb-server.
So, ovs-vswitchd thinks that there should be no bridges and ports and
deletes them.  After receiving the actual data, ports and bridges
will be re-created, but IPs and other information is already lost at
this point.

This is regression of ovsdb-idl split that happened in 2.15.

Fix posted upstream for review:
  https://patchwork.ozlabs.org/project/openvswitch/patch/20210608131723.2996019-1-i.maximets@ovn.org/

Comment 16 Roman Safronov 2021-06-08 17:50:53 UTC

*** Bug 1967142 has been marked as a duplicate of this bug. ***

Comment 17 Eran Kuris 2021-06-09 07:40:39 UTC

(In reply to Ilya Maximets from comment #15)
> The issue is that ovs-vswitchd starts configuring bridges while it's
> not yet connected to/not yet received all the data from ovsdb-server.
> So, ovs-vswitchd thinks that there should be no bridges and ports and
> deletes them.  After receiving the actual data, ports and bridges
> will be re-created, but IPs and other information is already lost at
> this point.
> 
> This is regression of ovsdb-idl split that happened in 2.15.
> 
> Fix posted upstream for review:
>  
> https://patchwork.ozlabs.org/project/openvswitch/patch/20210608131723.
> 2996019-1-i.maximets/

can we get a new rpm so we can test the propose fix?

Comment 18 Ilya Maximets 2021-06-09 08:16:36 UTC

(In reply to Eran Kuris from comment #17)
> can we get a new rpm so we can test the propose fix?

Sure.  Here it is:

http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.15/2.15.0/23.bz1968445.0.1.el8fdp/

Comment 27 Eran Kuris 2021-06-17 08:00:24 UTC

The issue was fixed according to the latest build run:
 https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-geneve/260/testReport/neutron_plugin.tests.scenario.test_multicast/MulticastTestIPv4Common/test_igmp_snooping_after_openvswitch_restart_id_d6730359_5d78_438c_ad70_5c8aadac6a1d_/

core_puddle: RHOS-16.2-RHEL-8-20210614.n.1
021-06-14T11:53:47+0000 DEBUG Installed: openvswitch2.15-2.15.0-24.el8fdp.x86_64

Comment 30 errata-xmlrpc 2021-09-15 07:15:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483