Bug 2139415 - data plane downtime during the first flow installation.
Summary: data plane downtime during the first flow installation.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn-2021
Version: FDP 22.K
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: OVN Team
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-11-02 13:13 UTC by Mark Michelson
Modified: 2022-11-21 18:21 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-21 18:21:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-2423 0 None None None 2022-11-02 13:19:47 UTC
Red Hat Product Errata RHBA-2022:8569 0 None None None 2022-11-21 18:21:18 UTC

Description Mark Michelson 2022-11-02 13:13:10 UTC
This bug was initially created as a copy of Bug #2089416

I am copying this bug because: 
This copy is for errata purposes only. The original issue is fixed in RHEL 8, and this one tracks for RHEL 9.


Description of problem:
During our last OpenStack update from 16.1 to 16.2, we encountered a network dataplane outage on instances at step 3.3 from the documentation [2].  It was detected using a ping on multiple instances  and lasted 1 or 2 minutes.
We found two OVN commits that seems relevant to this behaviour :

    https://github.com/ovn-org/ovn/commit/896adfd2d8b3369110e9618bd190d190105372a9

    https://github.com/ovn-org/ovn/commit/d53c599ed05ea3c708a045a9434875458effa21e

We hope these patches will be soon backported into RHOSP OVN to avoid this issue for the next upgrades.

This outage had a big impact for some of our clients, especially those using Kubernetes clusters as nodes were failing and pods were massively re-scheduled which also led to high CPU usage on compute nodes.

[2] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/keeping_red_hat_openstack_platform_updated/index#proc_updating-ovn-controller-container_updating-overcloud

Comment 3 Jianlin Shi 2022-11-03 05:36:19 UTC
Verified on ovn-2021-21.12.0-94.el9:

[root@dell-per730-20 bz2139425]# rpm -qa | grep -E "openvswitch2.17|ovn-2021"                         
openvswitch2.17-2.17.0-50.el9fdp.x86_64                                                               
ovn-2021-21.12.0-94.el9fdp.x86_64                                                                     
ovn-2021-central-21.12.0-94.el9fdp.x86_64                                                             
ovn-2021-host-21.12.0-94.el9fdp.x86_64

+ ip netns exec vm1 ping 172.16.0.102 -c 1                                                            
PING 172.16.0.102 (172.16.0.102) 56(84) bytes of data.                                                
64 bytes from 172.16.0.102: icmp_seq=1 ttl=62 time=25.4 ms                                            
                                                                                                      
--- 172.16.0.102 ping statistics ---                                                                  
1 packets transmitted, 1 received, 0% packet loss, time 0ms                                           
rtt min/avg/max/mdev = 25.384/25.384/25.384/0.000 ms                                                  
+ ip netns exec vm1 ping 172.16.0.100 -c 1                                                            
PING 172.16.0.100 (172.16.0.100) 56(84) bytes of data.                                                
64 bytes from 172.16.0.100: icmp_seq=1 ttl=63 time=8.13 ms                                            
                                                                                                      
--- 172.16.0.100 ping statistics ---                                                                  
1 packets transmitted, 1 received, 0% packet loss, time 0ms                                           
rtt min/avg/max/mdev = 8.133/8.133/8.133/0.000 ms                                                     
+ ovs-vsctl set open . external_ids:ovn-ofctrl-wait-before-clear=7000                                 
+ systemctl restart ovn-controller                                                                    
+ ip netns exec vm1 ping 172.16.0.102 -c 300 -i 0.1                                                   
+ wait                                                                                                
+ tail ping.log                                                                                       
64 bytes from 172.16.0.102: icmp_seq=295 ttl=62 time=0.036 ms                                         
64 bytes from 172.16.0.102: icmp_seq=296 ttl=62 time=0.035 ms                                         
64 bytes from 172.16.0.102: icmp_seq=297 ttl=62 time=0.035 ms                                         
64 bytes from 172.16.0.102: icmp_seq=298 ttl=62 time=0.036 ms                                         
64 bytes from 172.16.0.102: icmp_seq=299 ttl=62 time=0.038 ms                                         
64 bytes from 172.16.0.102: icmp_seq=300 ttl=62 time=0.036 ms                                         
                                                                                                      
--- 172.16.0.102 ping statistics ---                                                                  
300 packets transmitted, 300 received, 0% packet loss, time 31090ms                                   
rtt min/avg/max/mdev = 0.017/0.072/2.844/0.184 ms

<=== no downtime

Comment 5 errata-xmlrpc 2022-11-21 18:21:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn-2021 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8569


Note You need to log in before you can comment on or make changes to this bug.