This bug was initially created as a copy of Bug #2089416 I am copying this bug because: This copy is made for errata purposes. The original issue was reported against ovn-2021, but this is for ovn22.03 RHEL8. Description of problem: During our last OpenStack update from 16.1 to 16.2, we encountered a network dataplane outage on instances at step 3.3 from the documentation [2]. It was detected using a ping on multiple instances and lasted 1 or 2 minutes. We found two OVN commits that seems relevant to this behaviour : https://github.com/ovn-org/ovn/commit/896adfd2d8b3369110e9618bd190d190105372a9 https://github.com/ovn-org/ovn/commit/d53c599ed05ea3c708a045a9434875458effa21e We hope these patches will be soon backported into RHOSP OVN to avoid this issue for the next upgrades. This outage had a big impact for some of our clients, especially those using Kubernetes clusters as nodes were failing and pods were massively re-scheduled which also led to high CPU usage on compute nodes. [2] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/keeping_red_hat_openstack_platform_updated/index#proc_updating-ovn-controller-container_updating-overcloud
tested with following script: systemctl start openvswitch systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.10.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.10.25 systemctl restart ovn-controller ovn-nbctl ls-add public ovn-nbctl lsp-add public ln_p1 ovn-nbctl lsp-set-addresses ln_p1 unknown ovn-nbctl lsp-set-type ln_p1 localnet ovn-nbctl lsp-set-options ln_p1 network_name=nattest i=1 for m in `seq 0 4`;do for n in `seq 1 99`;do ovn-nbctl lr-add r${i} ovn-nbctl lrp-add r${i} r${i}_public 00:de:ad:ff:$m:$n 172.16.$m.$n/16 ovn-nbctl lrp-add r${i} r${i}_s${i} 00:de:ad:fe:$m:$n 173.$m.$n.1/24 ovn-nbctl lr-nat-add r${i} dnat_and_snat 172.16.${m}.$((n+100)) 173.$m.$n.2 ovn-nbctl set logical_router r${i} options:chassis=hv1 # s1 ovn-nbctl ls-add s${i} # s1 - r1 ovn-nbctl lsp-add s${i} s${i}_r${i} ovn-nbctl lsp-set-type s${i}_r${i} router ovn-nbctl lsp-set-addresses s${i}_r${i} router ovn-nbctl lsp-set-options s${i}_r${i} router-port=r${i}_s${i} # s1 - vm1 ovn-nbctl lsp-add s$i vm$i ovn-nbctl lsp-set-addresses vm$i "00:de:ad:01:$m:$n 173.$m.$n.2" ovn-nbctl lsp-add public public_r${i} ovn-nbctl lsp-set-type public_r${i} router ovn-nbctl lsp-set-addresses public_r${i} router ovn-nbctl lsp-set-options public_r${i} router-port=r${i}_public let i++ if [ $i -gt 300 ];then break; fi done if [ $i -gt 300 ];then break; fi done #add host vm1 ip netns add vm1 ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal ip link set vm1 netns vm1 ip netns exec vm1 ip link set vm1 address 00:de:ad:01:00:01 ip netns exec vm1 ip addr add 173.0.1.2/24 dev vm1 ip netns exec vm1 ip link set vm1 up ovs-vsctl set Interface vm1 external_ids:iface-id=vm1 ip netns add vm2 ovs-vsctl add-port br-int vm2 -- set interface vm2 type=internal ip link set vm2 netns vm2 ip netns exec vm2 ip link set vm2 address 00:de:ad:01:00:02 ip netns exec vm2 ip addr add 173.0.2.2/24 dev vm2 ip netns exec vm2 ip link set vm2 up ovs-vsctl set Interface vm2 external_ids:iface-id=vm2 #set provide network ovs-vsctl add-br nat_test ip link set nat_test up ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=nattest:nat_test ip netns add vm0 ovs-vsctl add-port nat_test vm0 -- set interface vm0 type=internal ip link set vm0 netns vm0 ip netns exec vm0 ip link set vm0 address 00:00:00:00:00:01 ip netns exec vm0 ip addr add 172.16.0.100/16 dev vm0 ip netns exec vm0 ip link set vm0 up ovs-vsctl set Interface vm0 external_ids:iface-id=vm0 ip netns exec vm1 ip route add default via 173.0.1.1 ip netns exec vm2 ip route add default via 173.0.2.1 ovn-nbctl --wait=hv sync sleep 30 ip netns exec vm1 ping 172.16.0.102 -c 1 ip netns exec vm1 ping 172.16.0.100 -c 1 ovs-vsctl set open . external_ids:ovn-ofctrl-wait-before-clear=7000 ip netns exec vm1 ping 172.16.0.102 -c 300 -i 0.1 &> ping.log & systemctl restart ovn-controller wait tail ping.log reproduced on ovn22.03-22.03.0-69.el9: [root@dell-per730-20 bz2139425]# rpm -qa | grep -E "ovn22.03|openvswitch2.17" openvswitch2.17-2.17.0-50.el9fdp.x86_64 ovn22.03-22.03.0-69.el9fdp.x86_64 ovn22.03-central-22.03.0-69.el9fdp.x86_64 ovn22.03-host-22.03.0-69.el9fdp.x86_64 + ip netns exec vm1 ping 172.16.0.102 -c 1 PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data. 64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=20.8 ms --- 172.16.0.102 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 20.781/20.781/20.781/0.000 ms + ip netns exec vm1 ping 172.16.0.100 -c 1 PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data. 64 bytes from 172.16.0.100: icmp_seq=1 ttl=63 time=13.0 ms --- 172.16.0.100 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 13.014/13.014/13.014/0.000 ms + ovs-vsctl set open . external_ids:ovn-ofctrl-wait-before-clear=7000 + systemctl restart ovn-controller + ip netns exec vm1 ping 172.16.0.102 -c 300 -i 0.1 + wait + tail ping.log 64 bytes from 172.16.0.102: icmp_seq=295 ttl=62 time=0.034 ms 64 bytes from 172.16.0.102: icmp_seq=296 ttl=62 time=0.016 ms 64 bytes from 172.16.0.102: icmp_seq=297 ttl=62 time=0.034 ms 64 bytes from 172.16.0.102: icmp_seq=298 ttl=62 time=0.039 ms 64 bytes from 172.16.0.102: icmp_seq=299 ttl=62 time=0.035 ms 64 bytes from 172.16.0.102: icmp_seq=300 ttl=62 time=0.034 ms --- 172.16.0.102 ping statistics --- 300 packets transmitted, 144 received, 52% packet loss, time 31096ms rtt min/avg/max/mdev = 0.016/0.085/3.382/0.348 ms <=== packet loss 64 bytes from 172.16.0.102: icmp_seq=32 ttl=62 time=0.076 ms 64 bytes from 172.16.0.102: icmp_seq=33 ttl=62 time=0.065 ms 64 bytes from 172.16.0.102: icmp_seq=190 ttl=62 time=2.52 ms <=== about 16s down time after restart ovn-controller 64 bytes from 172.16.0.102: icmp_seq=191 ttl=62 time=0.285 ms Verified on ovn22.03-22.03.0-118.el9: [root@dell-per730-20 bz2139425]# rpm -qa | grep -E "openvswitch2.17|ovn22.03" openvswitch2.17-2.17.0-50.el9fdp.x86_64 ovn22.03-22.03.0-118.el9fdp.x86_64 ovn22.03-central-22.03.0-118.el9fdp.x86_64 ovn22.03-host-22.03.0-118.el9fdp.x86_64 + ip netns exec vm1 ping 172.16.0.102 -c 1 PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data. 64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=26.3 ms --- 172.16.0.102 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 26.290/26.290/26.290/0.000 ms + ip netns exec vm1 ping 172.16.0.100 -c 1 PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data. 64 bytes from 172.16.0.100: icmp_seq=1 ttl=63 time=14.5 ms --- 172.16.0.100 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 14.485/14.485/14.485/0.000 ms + ovs-vsctl set open . external_ids:ovn-ofctrl-wait-before-clear=7000 + systemctl restart ovn-controller + ip netns exec vm1 ping 172.16.0.102 -c 300 -i 0.1 + wait + tail ping.log 64 bytes from 172.16.0.102: icmp_seq=295 ttl=62 time=0.036 ms 64 bytes from 172.16.0.102: icmp_seq=296 ttl=62 time=0.035 ms 64 bytes from 172.16.0.102: icmp_seq=297 ttl=62 time=0.036 ms 64 bytes from 172.16.0.102: icmp_seq=298 ttl=62 time=0.017 ms 64 bytes from 172.16.0.102: icmp_seq=299 ttl=62 time=0.038 ms 64 bytes from 172.16.0.102: icmp_seq=300 ttl=62 time=0.036 ms --- 172.16.0.102 ping statistics --- 300 packets transmitted, 300 received, 0% packet loss, time 31095ms rtt min/avg/max/mdev = 0.017/0.068/1.796/0.147 ms <== no packet loss, which means there is no down time after restart ovn-controller
Verified on ovn22.03-22.03.0-118.el8: + ip netns exec vm1 ping 172.16.0.102 -c 1 PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data. 64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=33.1 ms --- 172.16.0.102 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 33.085/33.085/33.085/0.000 ms + ip netns exec vm1 ping 172.16.0.100 -c 1 PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data. 64 bytes from 172.16.0.100: icmp_seq=1 ttl=63 time=4.42 ms --- 172.16.0.100 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 4.419/4.419/4.419/0.000 ms + ovs-vsctl set open . external_ids:ovn-ofctrl-wait-before-clear=7000 + systemctl restart ovn-controller + ip netns exec vm1 ping 172.16.0.102 -c 300 -i 0.1 + wait + tail ping.log 64 bytes from 172.16.0.102: icmp_seq=295 ttl=62 time=0.023 ms 64 bytes from 172.16.0.102: icmp_seq=296 ttl=62 time=0.024 ms 64 bytes from 172.16.0.102: icmp_seq=297 ttl=62 time=0.023 ms 64 bytes from 172.16.0.102: icmp_seq=298 ttl=62 time=0.024 ms 64 bytes from 172.16.0.102: icmp_seq=299 ttl=62 time=0.023 ms 64 bytes from 172.16.0.102: icmp_seq=300 ttl=62 time=0.023 ms --- 172.16.0.102 ping statistics --- 300 packets transmitted, 300 received, 0% packet loss, time 31091ms rtt min/avg/max/mdev = 0.023/0.061/1.842/0.143 ms [root@dell-per750-18 bz2139425]# rpm -qa | grep -E "openvswitch2.17|ovn22.03" ovn22.03-central-22.03.0-118.el8fdp.x86_64 ovn22.03-22.03.0-118.el8fdp.x86_64 ovn22.03-host-22.03.0-118.el8fdp.x86_64 openvswitch2.17-2.17.0-61.el8fdp.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn22.03 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8570