Bug 2139425 - data plane downtime during the first flow installation.
Summary: data plane downtime during the first flow installation.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn22.03
Version: FDP 22.D
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: OVN Team
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-11-02 13:32 UTC by Mark Michelson
Modified: 2022-11-21 18:40 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-21 18:40:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-2424 0 None None None 2022-11-02 13:37:02 UTC
Red Hat Product Errata RHBA-2022:8570 0 None None None 2022-11-21 18:40:27 UTC

Description Mark Michelson 2022-11-02 13:32:54 UTC
This bug was initially created as a copy of Bug #2089416

I am copying this bug because: 
This copy is made for errata purposes. The original issue was reported against ovn-2021, but this is for ovn22.03 RHEL8.


Description of problem:
During our last OpenStack update from 16.1 to 16.2, we encountered a network dataplane outage on instances at step 3.3 from the documentation [2].  It was detected using a ping on multiple instances  and lasted 1 or 2 minutes.
We found two OVN commits that seems relevant to this behaviour :

    https://github.com/ovn-org/ovn/commit/896adfd2d8b3369110e9618bd190d190105372a9

    https://github.com/ovn-org/ovn/commit/d53c599ed05ea3c708a045a9434875458effa21e

We hope these patches will be soon backported into RHOSP OVN to avoid this issue for the next upgrades.

This outage had a big impact for some of our clients, especially those using Kubernetes clusters as nodes were failing and pods were massively re-scheduled which also led to high CPU usage on compute nodes.

[2] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/keeping_red_hat_openstack_platform_updated/index#proc_updating-ovn-controller-container_updating-overcloud

Comment 3 Jianlin Shi 2022-11-03 02:30:00 UTC
tested with following script:

systemctl start openvswitch                       
systemctl start ovn-northd          
ovn-nbctl set-connection ptcp:6641                                                                    
ovn-sbctl set-connection ptcp:6642       
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.10.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.10.25
systemctl restart ovn-controller                                                                      
                                                   
ovn-nbctl ls-add public                                                                               
ovn-nbctl lsp-add public ln_p1
ovn-nbctl lsp-set-addresses ln_p1 unknown
ovn-nbctl lsp-set-type ln_p1 localnet                                                                 
ovn-nbctl lsp-set-options ln_p1 network_name=nattest
                                                                                                      
i=1                                                                                                   
for m in `seq 0 4`;do               
  for n in `seq 1 99`;do                                                                              
    ovn-nbctl lr-add r${i}                                                                            
    ovn-nbctl lrp-add r${i} r${i}_public 00:de:ad:ff:$m:$n 172.16.$m.$n/16                            
    ovn-nbctl lrp-add r${i} r${i}_s${i} 00:de:ad:fe:$m:$n 173.$m.$n.1/24
    ovn-nbctl lr-nat-add r${i} dnat_and_snat 172.16.${m}.$((n+100)) 173.$m.$n.2
    ovn-nbctl set logical_router r${i} options:chassis=hv1
                                                                                                      
                # s1                              
    ovn-nbctl ls-add s${i}                                                                            
                                                                                                      
                # s1 - r1       
    ovn-nbctl lsp-add s${i} s${i}_r${i}
    ovn-nbctl lsp-set-type s${i}_r${i} router                   
    ovn-nbctl lsp-set-addresses s${i}_r${i} router
    ovn-nbctl lsp-set-options s${i}_r${i} router-port=r${i}_s${i}            
                # s1 - vm1
    ovn-nbctl lsp-add s$i vm$i
    ovn-nbctl lsp-set-addresses vm$i "00:de:ad:01:$m:$n 173.$m.$n.2"
    ovn-nbctl lsp-add public public_r${i}
    ovn-nbctl lsp-set-type public_r${i} router
    ovn-nbctl lsp-set-addresses public_r${i} router 
                 
    ovn-nbctl lsp-set-options public_r${i} router-port=r${i}_public
    let i++ 
    if [ $i -gt 300 ];then
       break;
    fi
  done
  if [ $i -gt 300 ];then
    break;
  fi
done
#add host vm1
ip netns add vm1
ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal
ip link set vm1 netns vm1
ip netns exec vm1 ip link set vm1 address 00:de:ad:01:00:01
ip netns exec vm1 ip addr add 173.0.1.2/24 dev vm1
ip netns exec vm1 ip link set vm1 up
ovs-vsctl set Interface vm1 external_ids:iface-id=vm1
                 
ip netns add vm2
ovs-vsctl add-port br-int vm2 -- set interface vm2 type=internal
ip link set vm2 netns vm2
ip netns exec vm2 ip link set vm2 address 00:de:ad:01:00:02
ip netns exec vm2 ip addr add 173.0.2.2/24 dev vm2
ip netns exec vm2 ip link set vm2 up
ovs-vsctl set Interface vm2 external_ids:iface-id=vm2
                 
#set provide network
ovs-vsctl add-br nat_test
ip link set nat_test up
ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=nattest:nat_test

ip netns add vm0
ovs-vsctl add-port nat_test vm0 -- set interface vm0 type=internal
ip link set vm0 netns vm0
ip netns exec vm0 ip link set vm0 address 00:00:00:00:00:01
ip netns exec vm0 ip addr add 172.16.0.100/16 dev vm0
ip netns exec vm0 ip link set vm0 up
ovs-vsctl set Interface vm0 external_ids:iface-id=vm0
ip netns exec vm1 ip route add default via 173.0.1.1
ip netns exec vm2 ip route add default via 173.0.2.1

ovn-nbctl --wait=hv sync
sleep 30
ip netns exec vm1 ping 172.16.0.102 -c 1
ip netns exec vm1 ping 172.16.0.100 -c 1
ovs-vsctl set open . external_ids:ovn-ofctrl-wait-before-clear=7000
ip netns exec vm1 ping 172.16.0.102 -c 300 -i 0.1 &> ping.log &
systemctl restart ovn-controller
wait
tail ping.log


reproduced on ovn22.03-22.03.0-69.el9:

[root@dell-per730-20 bz2139425]# rpm -qa | grep -E "ovn22.03|openvswitch2.17"
openvswitch2.17-2.17.0-50.el9fdp.x86_64                                                               
ovn22.03-22.03.0-69.el9fdp.x86_64                                                                     
ovn22.03-central-22.03.0-69.el9fdp.x86_64
ovn22.03-host-22.03.0-69.el9fdp.x86_64

+ ip netns exec vm1 ping 172.16.0.102 -c 1
PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data.
64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=20.8 ms

--- 172.16.0.102 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 20.781/20.781/20.781/0.000 ms
+ ip netns exec vm1 ping 172.16.0.100 -c 1
PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data.
64 bytes from 172.16.0.100: icmp_seq=1 ttl=63 time=13.0 ms

--- 172.16.0.100 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 13.014/13.014/13.014/0.000 ms
+ ovs-vsctl set open . external_ids:ovn-ofctrl-wait-before-clear=7000
+ systemctl restart ovn-controller
+ ip netns exec vm1 ping 172.16.0.102 -c 300 -i 0.1
+ wait
+ tail ping.log
64 bytes from 172.16.0.102: icmp_seq=295 ttl=62 time=0.034 ms
64 bytes from 172.16.0.102: icmp_seq=296 ttl=62 time=0.016 ms
64 bytes from 172.16.0.102: icmp_seq=297 ttl=62 time=0.034 ms
64 bytes from 172.16.0.102: icmp_seq=298 ttl=62 time=0.039 ms
64 bytes from 172.16.0.102: icmp_seq=299 ttl=62 time=0.035 ms
64 bytes from 172.16.0.102: icmp_seq=300 ttl=62 time=0.034 ms

--- 172.16.0.102 ping statistics ---
300 packets transmitted, 144 received, 52% packet loss, time 31096ms
rtt min/avg/max/mdev = 0.016/0.085/3.382/0.348 ms

<=== packet loss

64 bytes from 172.16.0.102: icmp_seq=32 ttl=62 time=0.076 ms
64 bytes from 172.16.0.102: icmp_seq=33 ttl=62 time=0.065 ms
64 bytes from 172.16.0.102: icmp_seq=190 ttl=62 time=2.52 ms

<=== about 16s down time after restart ovn-controller
64 bytes from 172.16.0.102: icmp_seq=191 ttl=62 time=0.285 ms

Verified on ovn22.03-22.03.0-118.el9:

[root@dell-per730-20 bz2139425]# rpm -qa | grep -E "openvswitch2.17|ovn22.03"                         
openvswitch2.17-2.17.0-50.el9fdp.x86_64
ovn22.03-22.03.0-118.el9fdp.x86_64
ovn22.03-central-22.03.0-118.el9fdp.x86_64
ovn22.03-host-22.03.0-118.el9fdp.x86_64

+ ip netns exec vm1 ping 172.16.0.102 -c 1
PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data.
64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=26.3 ms

--- 172.16.0.102 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 26.290/26.290/26.290/0.000 ms
+ ip netns exec vm1 ping 172.16.0.100 -c 1
PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data.
64 bytes from 172.16.0.100: icmp_seq=1 ttl=63 time=14.5 ms

--- 172.16.0.100 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 14.485/14.485/14.485/0.000 ms
+ ovs-vsctl set open . external_ids:ovn-ofctrl-wait-before-clear=7000
+ systemctl restart ovn-controller
+ ip netns exec vm1 ping 172.16.0.102 -c 300 -i 0.1
+ wait
+ tail ping.log
64 bytes from 172.16.0.102: icmp_seq=295 ttl=62 time=0.036 ms
64 bytes from 172.16.0.102: icmp_seq=296 ttl=62 time=0.035 ms
64 bytes from 172.16.0.102: icmp_seq=297 ttl=62 time=0.036 ms
64 bytes from 172.16.0.102: icmp_seq=298 ttl=62 time=0.017 ms
64 bytes from 172.16.0.102: icmp_seq=299 ttl=62 time=0.038 ms
64 bytes from 172.16.0.102: icmp_seq=300 ttl=62 time=0.036 ms

--- 172.16.0.102 ping statistics ---
300 packets transmitted, 300 received, 0% packet loss, time 31095ms
rtt min/avg/max/mdev = 0.017/0.068/1.796/0.147 ms

<== no packet loss, which means there is no down time after restart ovn-controller

Comment 4 Jianlin Shi 2022-11-03 05:44:00 UTC
Verified on ovn22.03-22.03.0-118.el8:

+ ip netns exec vm1 ping 172.16.0.102 -c 1
PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data.
64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=33.1 ms

--- 172.16.0.102 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 33.085/33.085/33.085/0.000 ms
+ ip netns exec vm1 ping 172.16.0.100 -c 1
PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data.
64 bytes from 172.16.0.100: icmp_seq=1 ttl=63 time=4.42 ms

--- 172.16.0.100 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 4.419/4.419/4.419/0.000 ms
+ ovs-vsctl set open . external_ids:ovn-ofctrl-wait-before-clear=7000
+ systemctl restart ovn-controller
+ ip netns exec vm1 ping 172.16.0.102 -c 300 -i 0.1
+ wait
+ tail ping.log
64 bytes from 172.16.0.102: icmp_seq=295 ttl=62 time=0.023 ms
64 bytes from 172.16.0.102: icmp_seq=296 ttl=62 time=0.024 ms
64 bytes from 172.16.0.102: icmp_seq=297 ttl=62 time=0.023 ms
64 bytes from 172.16.0.102: icmp_seq=298 ttl=62 time=0.024 ms
64 bytes from 172.16.0.102: icmp_seq=299 ttl=62 time=0.023 ms
64 bytes from 172.16.0.102: icmp_seq=300 ttl=62 time=0.023 ms

--- 172.16.0.102 ping statistics ---
300 packets transmitted, 300 received, 0% packet loss, time 31091ms
rtt min/avg/max/mdev = 0.023/0.061/1.842/0.143 ms
[root@dell-per750-18 bz2139425]# rpm -qa | grep -E "openvswitch2.17|ovn22.03"
ovn22.03-central-22.03.0-118.el8fdp.x86_64
ovn22.03-22.03.0-118.el8fdp.x86_64
ovn22.03-host-22.03.0-118.el8fdp.x86_64
openvswitch2.17-2.17.0-61.el8fdp.x86_64

Comment 6 errata-xmlrpc 2022-11-21 18:40:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn22.03 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8570


Note You need to log in before you can comment on or make changes to this bug.