Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2139415

Summary: data plane downtime during the first flow installation.
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Mark Michelson <mmichels>
Component: ovn-2021Assignee: OVN Team <ovnteam>
Status: CLOSED ERRATA QA Contact: Jianlin Shi <jishi>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: FDP 22.KCC: ctrautma, jiji
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-21 18:21:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mark Michelson 2022-11-02 13:13:10 UTC
This bug was initially created as a copy of Bug #2089416

I am copying this bug because: 
This copy is for errata purposes only. The original issue is fixed in RHEL 8, and this one tracks for RHEL 9.


Description of problem:
During our last OpenStack update from 16.1 to 16.2, we encountered a network dataplane outage on instances at step 3.3 from the documentation [2].  It was detected using a ping on multiple instances  and lasted 1 or 2 minutes.
We found two OVN commits that seems relevant to this behaviour :

    https://github.com/ovn-org/ovn/commit/896adfd2d8b3369110e9618bd190d190105372a9

    https://github.com/ovn-org/ovn/commit/d53c599ed05ea3c708a045a9434875458effa21e

We hope these patches will be soon backported into RHOSP OVN to avoid this issue for the next upgrades.

This outage had a big impact for some of our clients, especially those using Kubernetes clusters as nodes were failing and pods were massively re-scheduled which also led to high CPU usage on compute nodes.

[2] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/keeping_red_hat_openstack_platform_updated/index#proc_updating-ovn-controller-container_updating-overcloud

Comment 3 Jianlin Shi 2022-11-03 05:36:19 UTC
Verified on ovn-2021-21.12.0-94.el9:

[root@dell-per730-20 bz2139425]# rpm -qa | grep -E "openvswitch2.17|ovn-2021"                         
openvswitch2.17-2.17.0-50.el9fdp.x86_64                                                               
ovn-2021-21.12.0-94.el9fdp.x86_64                                                                     
ovn-2021-central-21.12.0-94.el9fdp.x86_64                                                             
ovn-2021-host-21.12.0-94.el9fdp.x86_64

+ ip netns exec vm1 ping 172.16.0.102 -c 1                                                            
PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data.                                                
64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=25.4 ms                                            
                                                                                                      
--- 172.16.0.102 ping statistics ---                                                                  
1 packets transmitted, 1 received, 0% packet loss, time 0ms                                           
rtt min/avg/max/mdev = 25.384/25.384/25.384/0.000 ms                                                  
+ ip netns exec vm1 ping 172.16.0.100 -c 1                                                            
PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data.                                                
64 bytes from 172.16.0.100: icmp_seq=1 ttl=63 time=8.13 ms                                            
                                                                                                      
--- 172.16.0.100 ping statistics ---                                                                  
1 packets transmitted, 1 received, 0% packet loss, time 0ms                                           
rtt min/avg/max/mdev = 8.133/8.133/8.133/0.000 ms                                                     
+ ovs-vsctl set open . external_ids:ovn-ofctrl-wait-before-clear=7000                                 
+ systemctl restart ovn-controller                                                                    
+ ip netns exec vm1 ping 172.16.0.102 -c 300 -i 0.1                                                   
+ wait                                                                                                
+ tail ping.log                                                                                       
64 bytes from 172.16.0.102: icmp_seq=295 ttl=62 time=0.036 ms                                         
64 bytes from 172.16.0.102: icmp_seq=296 ttl=62 time=0.035 ms                                         
64 bytes from 172.16.0.102: icmp_seq=297 ttl=62 time=0.035 ms                                         
64 bytes from 172.16.0.102: icmp_seq=298 ttl=62 time=0.036 ms                                         
64 bytes from 172.16.0.102: icmp_seq=299 ttl=62 time=0.038 ms                                         
64 bytes from 172.16.0.102: icmp_seq=300 ttl=62 time=0.036 ms                                         
                                                                                                      
--- 172.16.0.102 ping statistics ---                                                                  
300 packets transmitted, 300 received, 0% packet loss, time 31090ms                                   
rtt min/avg/max/mdev = 0.017/0.072/2.844/0.184 ms

<=== no downtime

Comment 5 errata-xmlrpc 2022-11-21 18:21:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn-2021 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8569