Bug 1968445
| Summary: | [OSP16.2]nova-compute service is down after openvswitch_restart | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Eran Kuris <ekuris> | |
| Component: | openvswitch | Assignee: | Ilya Maximets <i.maximets> | |
| Status: | CLOSED ERRATA | QA Contact: | Eran Kuris <ekuris> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 16.2 (Train) | CC: | apevec, bfournie, ccamposr, chrisw, dalvarez, echaudro, egallen, fhallal, hbrock, i.maximets, jlibosva, jslagle, mburns, rsafrono, spower | |
| Target Milestone: | beta | Keywords: | AutomationBlocker, Regression, Triaged | |
| Target Release: | 16.2 (Train on RHEL 8.4) | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | rhosp-openvswitch-2.15-4.el8ost.1 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1970832 (view as bug list) | Environment: | ||
| Last Closed: | 2021-09-15 07:15:41 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1970832 | |||
| Bug Blocks: | ||||
Was there any troubleshooting done to figure out what is causing this? I openvswitch programmed correctly after the restart? As I do not have an OSP system can you make yours available? Checked Eran's setup, and it looks like that after the OVS restart the undercloud network is not set up correctly as OVN can't talk to the SB controller.
[root@compute-1 heat-admin]# podman exec ovn_controller ovn-appctl -t ovn-controller connection-status
not connected
[root@compute-1 heat-admin]# ovs-vsctl list open
_uuid : c659a74f-9893-4260-a873-fe0b5ae1d88d
bridges : [8863c7c3-ed20-4ef6-ae37-d97fefca4457, af1f0af3-8f54-4b7f-af73-75e8880b7b6d, ca80eca0-8ecb-43d0-80a4-f04fac6acad1]
cur_cfg : 533
datapath_types : [netdev, system]
datapaths : {}
db_version : "8.2.0"
dpdk_initialized : false
dpdk_version : "DPDK 20.11.0"
external_ids : {hostname=compute-1.redhat.local, ovn-bridge=br-int, ovn-bridge-mappings="datacentre:br-ex,tenant:br-isolated", ovn-encap-ip="172.17.2.15", ovn-encap-type=geneve, ovn-openflow-probe-interval="60", ovn-remote="tcp:172.17.1.122:6642", ovn-remote-probe-interval="60000", rundir="/var/run/openvswitch", system-id="277508c2-98bd-4c54-9576-aac113bda829"}
iface_types : [bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, stt, system, tap, vxlan]
manager_options : [016abb0c-b9dd-43ac-84d8-fef8b5a489b8]
next_cfg : 533
other_config : {}
ovs_version : "2.15.1"
ssl : []
statistics : {}
system_type : rhel
system_version : "8.4"
[root@compute-1 heat-admin]# ping 172.17.1.122
PING 172.17.1.122 (172.17.1.122) 56(84) bytes of data.
^C
--- 172.17.1.122 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3111ms
[root@compute-1 heat-admin]# ip r get 172.17.1.122
172.17.1.122 via 192.168.24.1 dev ens3 src 192.168.24.50 uid 0
cache
I think I know what's going.
When the node is rebooted, things look good; vlan20 is in OVS (br-isolated) and has an IP configured on the SB database network:
Bridge br-isolated
fail_mode: standalone
Port vlan30
tag: 30
Interface vlan30
type: internal
Port vlan20
tag: 20
Interface vlan20
type: internal
[root@compute-1 ~]# ip a sh vlan20
9: vlan20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 2a:53:8f:15:e5:57 brd ff:ff:ff:ff:ff:ff
inet 172.17.1.109/24 brd 172.17.1.255 scope global vlan20
valid_lft forever preferred_lft forever
inet6 fe80::2853:8fff:fe15:e557/64 scope link
valid_lft forever preferred_lft forever
[root@compute-1 ~]# ovs-vsctl get open . external_ids:ovn-remote
"tcp:172.17.1.122:6642"
[root@compute-1 ~]# ping 172.17.1.122 -c1
PING 172.17.1.122 (172.17.1.122) 56(84) bytes of data.
64 bytes from 172.17.1.122: icmp_seq=1 ttl=64 time=0.274 ms
--- 172.17.1.122 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.274/0.274/0.274/0.000 ms
Then when you restart OVS, the IP goes away:
[root@compute-1 ~]# sudo systemctl restart openvswitch.service
[root@compute-1 ~]# ip a sh vlan20
16: vlan20: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether fa:a1:d3:53:50:21 brd ff:ff:ff:ff:ff:ff
[root@compute-1 ~]# ip a sh vlan20
Adding the IP back to vlan20 makes it work:
[root@compute-1 ~]# podman exec ovn_controller ovn-appctl -t ovn-controller connection-status
not connected
[root@compute-1 ~]# ip l s dev vlan20 up
[root@compute-1 ~]# ip a a 172.17.1.109/24 dev vlan20
[root@compute-1 ~]# ping 172.17.1.122 -c1
PING 172.17.1.122 (172.17.1.122) 56(84) bytes of data.
64 bytes from 172.17.1.122: icmp_seq=1 ttl=64 time=1.25 ms
--- 172.17.1.122 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.248/1.248/1.248/0.000 ms
[root@compute-1 ~]# podman exec ovn_controller ovn-appctl -t ovn-controller connection-status
connected
My take is:
1) The control plane should be put outside OVS so that a restart of OVS doesn't cause reconnections to the OVN DBs which are costly. We're trying to move away from this configuration for a long time.
2) The IP configuration is loss upon restart of OVS; i don't think this is caused by the newer OVS version but can you please confirm? I'd say this had been like this before and OVS is working ok
(In reply to Daniel Alvarez Sanchez from comment #7) > My take is: > > 1) The control plane should be put outside OVS so that a restart of OVS > doesn't cause reconnections to the OVN DBs which are costly. We're trying to > move away from this configuration for a long time. > 2) The IP configuration is loss upon restart of OVS; i don't think this is > caused by the newer OVS version but can you please confirm? I'd say this had > been like this before and OVS is working ok Actually, as pointed out by Dumitru, it could be an issue in OVS 2.15: https://mail.openvswitch.org/pipermail/ovs-discuss/2021-June/051222.html Taking BZ to figure this out ;) Assigning it to Ilya who has root caused the issue on ovsdb-idl and will post a fix upstream soon for it The issue is that ovs-vswitchd starts configuring bridges while it's not yet connected to/not yet received all the data from ovsdb-server. So, ovs-vswitchd thinks that there should be no bridges and ports and deletes them. After receiving the actual data, ports and bridges will be re-created, but IPs and other information is already lost at this point. This is regression of ovsdb-idl split that happened in 2.15. Fix posted upstream for review: https://patchwork.ozlabs.org/project/openvswitch/patch/20210608131723.2996019-1-i.maximets@ovn.org/ *** Bug 1967142 has been marked as a duplicate of this bug. *** (In reply to Ilya Maximets from comment #15) > The issue is that ovs-vswitchd starts configuring bridges while it's > not yet connected to/not yet received all the data from ovsdb-server. > So, ovs-vswitchd thinks that there should be no bridges and ports and > deletes them. After receiving the actual data, ports and bridges > will be re-created, but IPs and other information is already lost at > this point. > > This is regression of ovsdb-idl split that happened in 2.15. > > Fix posted upstream for review: > > https://patchwork.ozlabs.org/project/openvswitch/patch/20210608131723. > 2996019-1-i.maximets/ can we get a new rpm so we can test the propose fix? (In reply to Eran Kuris from comment #17) > can we get a new rpm so we can test the propose fix? Sure. Here it is: http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.15/2.15.0/23.bz1968445.0.1.el8fdp/ The issue was fixed according to the latest build run: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-geneve/260/testReport/neutron_plugin.tests.scenario.test_multicast/MulticastTestIPv4Common/test_igmp_snooping_after_openvswitch_restart_id_d6730359_5d78_438c_ad70_5c8aadac6a1d_/ core_puddle: RHOS-16.2-RHEL-8-20210614.n.1 021-06-14T11:53:47+0000 DEBUG Installed: openvswitch2.15-2.15.0-24.el8fdp.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483 |
Description of problem: After openvswitch2.15 is restarted, nova-compute service is down as a results tests ate failing with error: "No valid host was found. There are not enough hosts available" Version-Release number of selected component (if applicable): core_puddle_version RHOS-16.2-RHEL-8-20210525.n.0 ovn-2021-21.03.0-21.el8fdp.x86_64 openvswitch2.15-2.15.0-22.el8fdp.x86_64 How reproducible: 100% Steps to Reproduce: 1. install osp16.2 with new ovs 2.15 and ovn-2021 2. run : openstack compute service list {all services are up} 3. sudo systemctl restart openvswitch.service on compute node 4. run : openstack compute service list {nova-compute service is down} Actual results: Expected results: Additional info: workaround is to reboot compute nodes