Description of problem: After openvswitch2.15 is restarted, nova-compute service is down as a results tests ate failing with error: "No valid host was found. There are not enough hosts available" Version-Release number of selected component (if applicable): core_puddle_version RHOS-16.2-RHEL-8-20210525.n.0 ovn-2021-21.03.0-21.el8fdp.x86_64 openvswitch2.15-2.15.0-22.el8fdp.x86_64 How reproducible: 100% Steps to Reproduce: 1. install osp16.2 with new ovs 2.15 and ovn-2021 2. run : openstack compute service list {all services are up} 3. sudo systemctl restart openvswitch.service on compute node 4. run : openstack compute service list {nova-compute service is down} Actual results: Expected results: Additional info: workaround is to reboot compute nodes
Was there any troubleshooting done to figure out what is causing this? I openvswitch programmed correctly after the restart? As I do not have an OSP system can you make yours available?
Checked Eran's setup, and it looks like that after the OVS restart the undercloud network is not set up correctly as OVN can't talk to the SB controller. [root@compute-1 heat-admin]# podman exec ovn_controller ovn-appctl -t ovn-controller connection-status not connected [root@compute-1 heat-admin]# ovs-vsctl list open _uuid : c659a74f-9893-4260-a873-fe0b5ae1d88d bridges : [8863c7c3-ed20-4ef6-ae37-d97fefca4457, af1f0af3-8f54-4b7f-af73-75e8880b7b6d, ca80eca0-8ecb-43d0-80a4-f04fac6acad1] cur_cfg : 533 datapath_types : [netdev, system] datapaths : {} db_version : "8.2.0" dpdk_initialized : false dpdk_version : "DPDK 20.11.0" external_ids : {hostname=compute-1.redhat.local, ovn-bridge=br-int, ovn-bridge-mappings="datacentre:br-ex,tenant:br-isolated", ovn-encap-ip="172.17.2.15", ovn-encap-type=geneve, ovn-openflow-probe-interval="60", ovn-remote="tcp:172.17.1.122:6642", ovn-remote-probe-interval="60000", rundir="/var/run/openvswitch", system-id="277508c2-98bd-4c54-9576-aac113bda829"} iface_types : [bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, stt, system, tap, vxlan] manager_options : [016abb0c-b9dd-43ac-84d8-fef8b5a489b8] next_cfg : 533 other_config : {} ovs_version : "2.15.1" ssl : [] statistics : {} system_type : rhel system_version : "8.4" [root@compute-1 heat-admin]# ping 172.17.1.122 PING 172.17.1.122 (172.17.1.122) 56(84) bytes of data. ^C --- 172.17.1.122 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3111ms [root@compute-1 heat-admin]# ip r get 172.17.1.122 172.17.1.122 via 192.168.24.1 dev ens3 src 192.168.24.50 uid 0 cache
I think I know what's going. When the node is rebooted, things look good; vlan20 is in OVS (br-isolated) and has an IP configured on the SB database network: Bridge br-isolated fail_mode: standalone Port vlan30 tag: 30 Interface vlan30 type: internal Port vlan20 tag: 20 Interface vlan20 type: internal [root@compute-1 ~]# ip a sh vlan20 9: vlan20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 2a:53:8f:15:e5:57 brd ff:ff:ff:ff:ff:ff inet 172.17.1.109/24 brd 172.17.1.255 scope global vlan20 valid_lft forever preferred_lft forever inet6 fe80::2853:8fff:fe15:e557/64 scope link valid_lft forever preferred_lft forever [root@compute-1 ~]# ovs-vsctl get open . external_ids:ovn-remote "tcp:172.17.1.122:6642" [root@compute-1 ~]# ping 172.17.1.122 -c1 PING 172.17.1.122 (172.17.1.122) 56(84) bytes of data. 64 bytes from 172.17.1.122: icmp_seq=1 ttl=64 time=0.274 ms --- 172.17.1.122 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.274/0.274/0.274/0.000 ms Then when you restart OVS, the IP goes away: [root@compute-1 ~]# sudo systemctl restart openvswitch.service [root@compute-1 ~]# ip a sh vlan20 16: vlan20: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether fa:a1:d3:53:50:21 brd ff:ff:ff:ff:ff:ff [root@compute-1 ~]# ip a sh vlan20 Adding the IP back to vlan20 makes it work: [root@compute-1 ~]# podman exec ovn_controller ovn-appctl -t ovn-controller connection-status not connected [root@compute-1 ~]# ip l s dev vlan20 up [root@compute-1 ~]# ip a a 172.17.1.109/24 dev vlan20 [root@compute-1 ~]# ping 172.17.1.122 -c1 PING 172.17.1.122 (172.17.1.122) 56(84) bytes of data. 64 bytes from 172.17.1.122: icmp_seq=1 ttl=64 time=1.25 ms --- 172.17.1.122 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 1.248/1.248/1.248/0.000 ms [root@compute-1 ~]# podman exec ovn_controller ovn-appctl -t ovn-controller connection-status connected My take is: 1) The control plane should be put outside OVS so that a restart of OVS doesn't cause reconnections to the OVN DBs which are costly. We're trying to move away from this configuration for a long time. 2) The IP configuration is loss upon restart of OVS; i don't think this is caused by the newer OVS version but can you please confirm? I'd say this had been like this before and OVS is working ok
(In reply to Daniel Alvarez Sanchez from comment #7) > My take is: > > 1) The control plane should be put outside OVS so that a restart of OVS > doesn't cause reconnections to the OVN DBs which are costly. We're trying to > move away from this configuration for a long time. > 2) The IP configuration is loss upon restart of OVS; i don't think this is > caused by the newer OVS version but can you please confirm? I'd say this had > been like this before and OVS is working ok Actually, as pointed out by Dumitru, it could be an issue in OVS 2.15: https://mail.openvswitch.org/pipermail/ovs-discuss/2021-June/051222.html
Taking BZ to figure this out ;)
Assigning it to Ilya who has root caused the issue on ovsdb-idl and will post a fix upstream soon for it
The issue is that ovs-vswitchd starts configuring bridges while it's not yet connected to/not yet received all the data from ovsdb-server. So, ovs-vswitchd thinks that there should be no bridges and ports and deletes them. After receiving the actual data, ports and bridges will be re-created, but IPs and other information is already lost at this point. This is regression of ovsdb-idl split that happened in 2.15. Fix posted upstream for review: https://patchwork.ozlabs.org/project/openvswitch/patch/20210608131723.2996019-1-i.maximets@ovn.org/
*** Bug 1967142 has been marked as a duplicate of this bug. ***
(In reply to Ilya Maximets from comment #15) > The issue is that ovs-vswitchd starts configuring bridges while it's > not yet connected to/not yet received all the data from ovsdb-server. > So, ovs-vswitchd thinks that there should be no bridges and ports and > deletes them. After receiving the actual data, ports and bridges > will be re-created, but IPs and other information is already lost at > this point. > > This is regression of ovsdb-idl split that happened in 2.15. > > Fix posted upstream for review: > > https://patchwork.ozlabs.org/project/openvswitch/patch/20210608131723. > 2996019-1-i.maximets/ can we get a new rpm so we can test the propose fix?
(In reply to Eran Kuris from comment #17) > can we get a new rpm so we can test the propose fix? Sure. Here it is: http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.15/2.15.0/23.bz1968445.0.1.el8fdp/
The issue was fixed according to the latest build run: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-geneve/260/testReport/neutron_plugin.tests.scenario.test_multicast/MulticastTestIPv4Common/test_igmp_snooping_after_openvswitch_restart_id_d6730359_5d78_438c_ad70_5c8aadac6a1d_/ core_puddle: RHOS-16.2-RHEL-8-20210614.n.1 021-06-14T11:53:47+0000 DEBUG Installed: openvswitch2.15-2.15.0-24.el8fdp.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483