Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2163559

Summary: ovn-controller coredumped on all nodes (controllers+compute) and many FIP flows were affected
Product: Red Hat Enterprise Linux Fast Datapath Reporter: David Hill <dhill>
Component: ovn-2021Assignee: Ales Musil <amusil>
Status: CLOSED ERRATA QA Contact: Jianlin Shi <jishi>
Severity: urgent Docs Contact:
Priority: urgent    
Version: FDP 20.FCC: amusil, ctrautma, dceara, gurpsing, jiji, ltamagno, mmichels
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ovn-2021-21.12.0-134.el8fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-06 20:05:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Hill 2023-01-23 21:58:14 UTC
Description of problem:
ovn-controller coredumped on all nodes (controllers+compute) and many FIP flows were affected

Dec 23 11:28:55 overcloud-controller-0 systemd-coredump[720680]: Process 558845 (ovn-controller) of user 0 dumped core.#012#012Stack trace of thread 7:#012#0  0x0000561dfcd3d271 n/a (/usr/bin/ovn-controller)#012#1  0x0000000000000000 n/a (n/a)#012#2  0x0000000000000000 n/a (n/a)


Version-Release number of selected component (if applicable):


How reproducible:
Twice up to now

Steps to Reproduce:
1. No serious clues
2.
3.

Actual results:

Expected results:
No core dumps and no affected flows

Additional info:
In the ovs logs we see this prior the crash:
2022-12-23T10:23:55.239Z|09912|connmgr|INFO|br-int<->unix#1: 402 flow_mods 10 s ago (402 adds)
2022-12-23T10:24:55.239Z|09913|connmgr|INFO|br-int<->unix#1: 960 flow_mods in the last 52 s (957 adds, 3 deletes)
2022-12-23T10:25:55.239Z|09914|connmgr|INFO|br-int<->unix#1: 388 flow_mods in the 18 s starting 50 s ago (385 adds, 3 deletes)
2022-12-23T10:27:41.955Z|09915|connmgr|INFO|br-int<->unix#1: 714 flow_mods in the 7 s starting 10 s ago (348 adds, 366 deletes)
2022-12-23T10:28:05.852Z|09916|connmgr|INFO|br-int<->unix#1: 1748 flow_mods in the 22 s starting 23 s ago (339 adds, 1409 deletes)
2022-12-23T10:29:26.671Z|00001|timeval(handler48)|WARN|Unreasonably long 1565ms poll interval (0ms user, 2ms system)
2022-12-23T10:29:26.672Z|00002|timeval(handler48)|WARN|faults: 1 minor, 0 major
2022-12-23T10:29:26.672Z|00003|timeval(handler48)|WARN|context switches: 0 voluntary, 1 involuntary

Comment 4 Mark Michelson 2023-03-27 15:40:42 UTC
It appears the corresponding customer case has been closed. We also suspect that this core dump might be fixed by backporting commit 2e4f393650ccf298f26787583c13a88197ba8348 from OVN main (https://github.com/ovn-org/ovn/commit/2e4f393650ccf298f26787583c13a88197ba8348) . Once we backport this fix, we will close this issue.

Comment 5 David Hill 2023-05-04 18:31:13 UTC
We didn't have the core dump and it didn't happen again ...

Comment 9 OVN Bot 2023-06-08 17:45:20 UTC
ovn-2021 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2213611

Comment 13 Jianlin Shi 2023-06-25 01:39:34 UTC
Hi Ales,

this is the log I found in ovn-2021.spec:
* Thu May 18 2023 Ales Musil <amusil> - 21.12.0-132                                        
- [branch-21.12] ovs: Bump submodule to v2.17.6 (#2163559)                                                                                                                                                  
[Upstream: 52ef956bb4e9de2d418805dd43b337184f1aa560]

which specific patch fix the issue? any reproducer for the issue? thanks

Comment 14 Ales Musil 2023-07-03 05:53:04 UTC
Hi,

the commit https://github.com/ovn-org/ovn/commit/2e4f393650ccf298f26787583c13a88197ba8348
has a test which was used to reproduce the original issue. Basing the reproducer on that is the best chance.

Thanks,
Ales

Comment 15 Jianlin Shi 2023-07-03 07:11:33 UTC
tested with following script:

enable_coredump()
{
        ulimit -c unlimited
        ulimit -s unlimited
        sysctl -w fs.suid_dumpable=2
        if ! sysctl kernel.core_pattern | grep systemd-coredump                                       
        then
                sysctl -w kernel.core_pattern="|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c " 
        fi
        rm -rf /var/lib/systemd/coredump/*                                                            
        rm -rf /run/log/journal/*
        rm -rf /var/log/journal/*
        systemctl restart systemd-journald
}

enable_coredump


systemctl start openvswitch                                                                           
systemctl start ovn-northd
ovn-nbctl set-connection ptcp:6641                                                                    
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:127.0.0.1:6642 external_1 
systemctl restart ovn-controller                                                                      

ovs-vsctl add-port br-int p1 -- set interface p1 type=internal                                        
ovs-vsctl set interface p1 external-ids:iface-id=sw0-port1                                            
ovn-nbctl --wait=hv sync
ovn-appctl debug/pause                                                                                
sleep 2
ovn-appctl -t ovn-controller debug/status
ovn-nbctl ls-add sw0 -- lsp-add sw0 sw0-port1
ovn-nbctl lsp-del sw0-port1
ovn-nbctl --wait=sb sync

ovn-appctl debug/resume
ovn-nbctl --wait=hv sync

ovn-nbctl ls-del sw0
ovn-nbctl --wait=hv sync                                                                              
                                                                                                      
                                                                                                      
coredumpctl list

reproduced on ovn-2021-21.12.0-130.el8:

+ ovn-nbctl --wait=hv sync                                                                            
+ coredumpctl list                                                                                    
TIME                            PID   UID   GID SIG COREFILE  EXE                                     
Mon 2023-07-03 03:07:01 EDT   34521   993   990  11 none      /usr/bin/ovn-controller   

<=== coredump

Verified on ovn-2021-21.12.0-134.el8:

              
[root@sweetpig-8 bz2163559]# rpm -qa | grep -E "openvswitch2.17|ovn-2021"                             
ovn-2021-21.12.0-130.el8fdp.x86_64                                                                    
ovn-2021-host-21.12.0-130.el8fdp.x86_64                                                               
ovn-2021-central-21.12.0-130.el8fdp.x86_64                                                            
openvswitch2.17-2.17.0-106.el8fdp.x86_64

+ ovn-nbctl --wait=hv sync
+ coredumpctl list
No coredumps found.

<=== no coredump

[root@sweetpig-8 bz2163559]# rpm -qa | grep -E "openvswitch2.17|ovn-2021"                             
ovn-2021-21.12.0-134.el8fdp.x86_64                                                                    
ovn-2021-host-21.12.0-134.el8fdp.x86_64                                                               
openvswitch2.17-2.17.0-106.el8fdp.x86_64                                                              
ovn-2021-central-21.12.0-134.el8fdp.x86_64

Comment 17 errata-xmlrpc 2023-07-06 20:05:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn-2021 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3995