This bug was initially created as a copy of Bug #2089416 I am copying this bug because: This copy is for errata purposes only. The original issue is fixed in RHEL 8, and this one tracks for RHEL 9. Description of problem: During our last OpenStack update from 16.1 to 16.2, we encountered a network dataplane outage on instances at step 3.3 from the documentation [2]. It was detected using a ping on multiple instances and lasted 1 or 2 minutes. We found two OVN commits that seems relevant to this behaviour : https://github.com/ovn-org/ovn/commit/896adfd2d8b3369110e9618bd190d190105372a9 https://github.com/ovn-org/ovn/commit/d53c599ed05ea3c708a045a9434875458effa21e We hope these patches will be soon backported into RHOSP OVN to avoid this issue for the next upgrades. This outage had a big impact for some of our clients, especially those using Kubernetes clusters as nodes were failing and pods were massively re-scheduled which also led to high CPU usage on compute nodes. [2] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/keeping_red_hat_openstack_platform_updated/index#proc_updating-ovn-controller-container_updating-overcloud
Verified on ovn-2021-21.12.0-94.el9: [root@dell-per730-20 bz2139425]# rpm -qa | grep -E "openvswitch2.17|ovn-2021" openvswitch2.17-2.17.0-50.el9fdp.x86_64 ovn-2021-21.12.0-94.el9fdp.x86_64 ovn-2021-central-21.12.0-94.el9fdp.x86_64 ovn-2021-host-21.12.0-94.el9fdp.x86_64 + ip netns exec vm1 ping 172.16.0.102 -c 1 PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data. 64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=25.4 ms --- 172.16.0.102 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 25.384/25.384/25.384/0.000 ms + ip netns exec vm1 ping 172.16.0.100 -c 1 PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data. 64 bytes from 172.16.0.100: icmp_seq=1 ttl=63 time=8.13 ms --- 172.16.0.100 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 8.133/8.133/8.133/0.000 ms + ovs-vsctl set open . external_ids:ovn-ofctrl-wait-before-clear=7000 + systemctl restart ovn-controller + ip netns exec vm1 ping 172.16.0.102 -c 300 -i 0.1 + wait + tail ping.log 64 bytes from 172.16.0.102: icmp_seq=295 ttl=62 time=0.036 ms 64 bytes from 172.16.0.102: icmp_seq=296 ttl=62 time=0.035 ms 64 bytes from 172.16.0.102: icmp_seq=297 ttl=62 time=0.035 ms 64 bytes from 172.16.0.102: icmp_seq=298 ttl=62 time=0.036 ms 64 bytes from 172.16.0.102: icmp_seq=299 ttl=62 time=0.038 ms 64 bytes from 172.16.0.102: icmp_seq=300 ttl=62 time=0.036 ms --- 172.16.0.102 ping statistics --- 300 packets transmitted, 300 received, 0% packet loss, time 31090ms rtt min/avg/max/mdev = 0.017/0.072/2.844/0.184 ms <=== no downtime
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn-2021 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8569