Bug 1968445 - [OSP16.2]nova-compute service is down after openvswitch_restart
Summary: [OSP16.2]nova-compute service is down after openvswitch_restart
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: beta
: 16.2 (Train on RHEL 8.4)
Assignee: Ilya Maximets
QA Contact: Eran Kuris
URL:
Whiteboard:
: 1967142 (view as bug list)
Depends On: 1970832
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-07 12:25 UTC by Eran Kuris
Modified: 2021-09-15 07:16 UTC (History)
15 users (show)

Fixed In Version: rhosp-openvswitch-2.15-4.el8ost.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1970832 (view as bug list)
Environment:
Last Closed: 2021-09-15 07:15:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2021:3483 0 None None None 2021-09-15 07:16:07 UTC

Description Eran Kuris 2021-06-07 12:25:33 UTC
Description of problem:
After openvswitch2.15 is restarted, nova-compute service is down as a results tests ate failing with error: "No valid host was found. There are not enough hosts available" 

Version-Release number of selected component (if applicable):
core_puddle_version 
RHOS-16.2-RHEL-8-20210525.n.0
ovn-2021-21.03.0-21.el8fdp.x86_64
openvswitch2.15-2.15.0-22.el8fdp.x86_64

How reproducible:
100%

Steps to Reproduce:
1. install osp16.2 with new ovs 2.15 and ovn-2021
2. run : openstack compute service list {all services are up}
3. sudo systemctl restart openvswitch.service on compute node
4. run : openstack compute service list {nova-compute service is down}

Actual results:


Expected results:


Additional info:
workaround is to reboot compute nodes

Comment 2 Eelco Chaudron 2021-06-08 06:34:43 UTC
Was there any troubleshooting done to figure out what is causing this? I openvswitch programmed correctly after the restart?

As I do not have an OSP system can you make yours available?

Comment 6 Eelco Chaudron 2021-06-08 09:24:58 UTC
Checked Eran's setup, and it looks like that after the OVS restart the undercloud network is not set up correctly as OVN can't talk to the SB controller.


[root@compute-1 heat-admin]# podman exec ovn_controller ovn-appctl -t ovn-controller connection-status
not connected

[root@compute-1 heat-admin]# ovs-vsctl list open
_uuid               : c659a74f-9893-4260-a873-fe0b5ae1d88d
bridges             : [8863c7c3-ed20-4ef6-ae37-d97fefca4457, af1f0af3-8f54-4b7f-af73-75e8880b7b6d, ca80eca0-8ecb-43d0-80a4-f04fac6acad1]
cur_cfg             : 533
datapath_types      : [netdev, system]
datapaths           : {}
db_version          : "8.2.0"
dpdk_initialized    : false
dpdk_version        : "DPDK 20.11.0"
external_ids        : {hostname=compute-1.redhat.local, ovn-bridge=br-int, ovn-bridge-mappings="datacentre:br-ex,tenant:br-isolated", ovn-encap-ip="172.17.2.15", ovn-encap-type=geneve, ovn-openflow-probe-interval="60", ovn-remote="tcp:172.17.1.122:6642", ovn-remote-probe-interval="60000", rundir="/var/run/openvswitch", system-id="277508c2-98bd-4c54-9576-aac113bda829"}
iface_types         : [bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, stt, system, tap, vxlan]
manager_options     : [016abb0c-b9dd-43ac-84d8-fef8b5a489b8]
next_cfg            : 533
other_config        : {}
ovs_version         : "2.15.1"
ssl                 : []
statistics          : {}
system_type         : rhel
system_version      : "8.4"


[root@compute-1 heat-admin]# ping 172.17.1.122
PING 172.17.1.122 (172.17.1.122) 56(84) bytes of data.
^C
--- 172.17.1.122 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3111ms


[root@compute-1 heat-admin]# ip r get 172.17.1.122
172.17.1.122 via 192.168.24.1 dev ens3 src 192.168.24.50 uid 0 
    cache

Comment 7 Daniel Alvarez Sanchez 2021-06-08 09:38:51 UTC
I think I know what's going.

When the node is rebooted, things look good; vlan20 is in OVS (br-isolated) and has an IP configured on the SB database network:

    Bridge br-isolated
        fail_mode: standalone
        Port vlan30
            tag: 30
            Interface vlan30
                type: internal
        Port vlan20
            tag: 20
            Interface vlan20
                type: internal


[root@compute-1 ~]# ip a sh vlan20
9: vlan20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 2a:53:8f:15:e5:57 brd ff:ff:ff:ff:ff:ff
    inet 172.17.1.109/24 brd 172.17.1.255 scope global vlan20
       valid_lft forever preferred_lft forever
    inet6 fe80::2853:8fff:fe15:e557/64 scope link
       valid_lft forever preferred_lft forever


[root@compute-1 ~]# ovs-vsctl get open . external_ids:ovn-remote
"tcp:172.17.1.122:6642"


[root@compute-1 ~]# ping 172.17.1.122 -c1
PING 172.17.1.122 (172.17.1.122) 56(84) bytes of data.
64 bytes from 172.17.1.122: icmp_seq=1 ttl=64 time=0.274 ms

--- 172.17.1.122 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.274/0.274/0.274/0.000 ms



Then when you restart OVS, the IP goes away:


[root@compute-1 ~]# sudo systemctl restart openvswitch.service
[root@compute-1 ~]# ip a sh vlan20
16: vlan20: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether fa:a1:d3:53:50:21 brd ff:ff:ff:ff:ff:ff
[root@compute-1 ~]# ip a sh vlan20



Adding the IP back to vlan20 makes it work:


[root@compute-1 ~]# podman exec ovn_controller ovn-appctl -t ovn-controller connection-status
not connected


[root@compute-1 ~]# ip l s dev vlan20 up
[root@compute-1 ~]# ip a a 172.17.1.109/24 dev vlan20
[root@compute-1 ~]# ping 172.17.1.122 -c1
PING 172.17.1.122 (172.17.1.122) 56(84) bytes of data.
64 bytes from 172.17.1.122: icmp_seq=1 ttl=64 time=1.25 ms

--- 172.17.1.122 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.248/1.248/1.248/0.000 ms
[root@compute-1 ~]# podman exec ovn_controller ovn-appctl -t ovn-controller connection-status
connected



My take is:

1) The control plane should be put outside OVS so that a restart of OVS doesn't cause reconnections to the OVN DBs which are costly. We're trying to move away from this configuration for a long time.
2) The IP configuration is loss upon restart of OVS; i don't think this is caused by the newer OVS version but can you please confirm? I'd say this had been like this before and OVS is working ok

Comment 8 Daniel Alvarez Sanchez 2021-06-08 09:41:37 UTC
(In reply to Daniel Alvarez Sanchez from comment #7)

> My take is:
> 
> 1) The control plane should be put outside OVS so that a restart of OVS
> doesn't cause reconnections to the OVN DBs which are costly. We're trying to
> move away from this configuration for a long time.
> 2) The IP configuration is loss upon restart of OVS; i don't think this is
> caused by the newer OVS version but can you please confirm? I'd say this had
> been like this before and OVS is working ok

Actually, as pointed out by Dumitru, it could be an issue in OVS 2.15:
https://mail.openvswitch.org/pipermail/ovs-discuss/2021-June/051222.html

Comment 9 Eelco Chaudron 2021-06-08 09:46:27 UTC
Taking BZ to figure this out ;)

Comment 13 Daniel Alvarez Sanchez 2021-06-08 12:36:02 UTC
Assigning it to Ilya who has root caused the issue on ovsdb-idl and will post a fix upstream soon for it

Comment 15 Ilya Maximets 2021-06-08 13:25:05 UTC
The issue is that ovs-vswitchd starts configuring bridges while it's
not yet connected to/not yet received all the data from ovsdb-server.
So, ovs-vswitchd thinks that there should be no bridges and ports and
deletes them.  After receiving the actual data, ports and bridges
will be re-created, but IPs and other information is already lost at
this point.

This is regression of ovsdb-idl split that happened in 2.15.

Fix posted upstream for review:
  https://patchwork.ozlabs.org/project/openvswitch/patch/20210608131723.2996019-1-i.maximets@ovn.org/

Comment 16 Roman Safronov 2021-06-08 17:50:53 UTC
*** Bug 1967142 has been marked as a duplicate of this bug. ***

Comment 17 Eran Kuris 2021-06-09 07:40:39 UTC
(In reply to Ilya Maximets from comment #15)
> The issue is that ovs-vswitchd starts configuring bridges while it's
> not yet connected to/not yet received all the data from ovsdb-server.
> So, ovs-vswitchd thinks that there should be no bridges and ports and
> deletes them.  After receiving the actual data, ports and bridges
> will be re-created, but IPs and other information is already lost at
> this point.
> 
> This is regression of ovsdb-idl split that happened in 2.15.
> 
> Fix posted upstream for review:
>  
> https://patchwork.ozlabs.org/project/openvswitch/patch/20210608131723.
> 2996019-1-i.maximets/

can we get a new rpm so we can test the propose fix?

Comment 18 Ilya Maximets 2021-06-09 08:16:36 UTC
(In reply to Eran Kuris from comment #17)
> can we get a new rpm so we can test the propose fix?

Sure.  Here it is:

http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.15/2.15.0/23.bz1968445.0.1.el8fdp/

Comment 30 errata-xmlrpc 2021-09-15 07:15:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483


Note You need to log in before you can comment on or make changes to this bug.