The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 2163559 - ovn-controller coredumped on all nodes (controllers+compute) and many FIP flows were affected
Summary: ovn-controller coredumped on all nodes (controllers+compute) and many FIP flo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn-2021
Version: FDP 20.F
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: ---
Assignee: Ales Musil
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-01-23 21:58 UTC by David Hill
Modified: 2023-07-06 20:05 UTC (History)
7 users (show)

Fixed In Version: ovn-2021-21.12.0-134.el8fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-06 20:05:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-2648 0 None None None 2023-01-23 22:01:01 UTC
Red Hat Product Errata RHBA-2023:3995 0 None None None 2023-07-06 20:05:54 UTC

Description David Hill 2023-01-23 21:58:14 UTC
Description of problem:
ovn-controller coredumped on all nodes (controllers+compute) and many FIP flows were affected

Dec 23 11:28:55 overcloud-controller-0 systemd-coredump[720680]: Process 558845 (ovn-controller) of user 0 dumped core.#012#012Stack trace of thread 7:#012#0  0x0000561dfcd3d271 n/a (/usr/bin/ovn-controller)#012#1  0x0000000000000000 n/a (n/a)#012#2  0x0000000000000000 n/a (n/a)


Version-Release number of selected component (if applicable):


How reproducible:
Twice up to now

Steps to Reproduce:
1. No serious clues
2.
3.

Actual results:

Expected results:
No core dumps and no affected flows

Additional info:
In the ovs logs we see this prior the crash:
2022-12-23T10:23:55.239Z|09912|connmgr|INFO|br-int<->unix#1: 402 flow_mods 10 s ago (402 adds)
2022-12-23T10:24:55.239Z|09913|connmgr|INFO|br-int<->unix#1: 960 flow_mods in the last 52 s (957 adds, 3 deletes)
2022-12-23T10:25:55.239Z|09914|connmgr|INFO|br-int<->unix#1: 388 flow_mods in the 18 s starting 50 s ago (385 adds, 3 deletes)
2022-12-23T10:27:41.955Z|09915|connmgr|INFO|br-int<->unix#1: 714 flow_mods in the 7 s starting 10 s ago (348 adds, 366 deletes)
2022-12-23T10:28:05.852Z|09916|connmgr|INFO|br-int<->unix#1: 1748 flow_mods in the 22 s starting 23 s ago (339 adds, 1409 deletes)
2022-12-23T10:29:26.671Z|00001|timeval(handler48)|WARN|Unreasonably long 1565ms poll interval (0ms user, 2ms system)
2022-12-23T10:29:26.672Z|00002|timeval(handler48)|WARN|faults: 1 minor, 0 major
2022-12-23T10:29:26.672Z|00003|timeval(handler48)|WARN|context switches: 0 voluntary, 1 involuntary

Comment 4 Mark Michelson 2023-03-27 15:40:42 UTC
It appears the corresponding customer case has been closed. We also suspect that this core dump might be fixed by backporting commit 2e4f393650ccf298f26787583c13a88197ba8348 from OVN main (https://github.com/ovn-org/ovn/commit/2e4f393650ccf298f26787583c13a88197ba8348) . Once we backport this fix, we will close this issue.

Comment 5 David Hill 2023-05-04 18:31:13 UTC
We didn't have the core dump and it didn't happen again ...

Comment 9 OVN Bot 2023-06-08 17:45:20 UTC
ovn-2021 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2213611

Comment 13 Jianlin Shi 2023-06-25 01:39:34 UTC
Hi Ales,

this is the log I found in ovn-2021.spec:
* Thu May 18 2023 Ales Musil <amusil> - 21.12.0-132                                        
- [branch-21.12] ovs: Bump submodule to v2.17.6 (#2163559)                                                                                                                                                  
[Upstream: 52ef956bb4e9de2d418805dd43b337184f1aa560]

which specific patch fix the issue? any reproducer for the issue? thanks

Comment 14 Ales Musil 2023-07-03 05:53:04 UTC
Hi,

the commit https://github.com/ovn-org/ovn/commit/2e4f393650ccf298f26787583c13a88197ba8348
has a test which was used to reproduce the original issue. Basing the reproducer on that is the best chance.

Thanks,
Ales

Comment 15 Jianlin Shi 2023-07-03 07:11:33 UTC
tested with following script:

enable_coredump()
{
        ulimit -c unlimited
        ulimit -s unlimited
        sysctl -w fs.suid_dumpable=2
        if ! sysctl kernel.core_pattern | grep systemd-coredump                                       
        then
                sysctl -w kernel.core_pattern="|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c " 
        fi
        rm -rf /var/lib/systemd/coredump/*                                                            
        rm -rf /run/log/journal/*
        rm -rf /var/log/journal/*
        systemctl restart systemd-journald
}

enable_coredump


systemctl start openvswitch                                                                           
systemctl start ovn-northd
ovn-nbctl set-connection ptcp:6641                                                                    
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:127.0.0.1:6642 external_1 
systemctl restart ovn-controller                                                                      

ovs-vsctl add-port br-int p1 -- set interface p1 type=internal                                        
ovs-vsctl set interface p1 external-ids:iface-id=sw0-port1                                            
ovn-nbctl --wait=hv sync
ovn-appctl debug/pause                                                                                
sleep 2
ovn-appctl -t ovn-controller debug/status
ovn-nbctl ls-add sw0 -- lsp-add sw0 sw0-port1
ovn-nbctl lsp-del sw0-port1
ovn-nbctl --wait=sb sync

ovn-appctl debug/resume
ovn-nbctl --wait=hv sync

ovn-nbctl ls-del sw0
ovn-nbctl --wait=hv sync                                                                              
                                                                                                      
                                                                                                      
coredumpctl list

reproduced on ovn-2021-21.12.0-130.el8:

+ ovn-nbctl --wait=hv sync                                                                            
+ coredumpctl list                                                                                    
TIME                            PID   UID   GID SIG COREFILE  EXE                                     
Mon 2023-07-03 03:07:01 EDT   34521   993   990  11 none      /usr/bin/ovn-controller   

<=== coredump

Verified on ovn-2021-21.12.0-134.el8:

              
[root@sweetpig-8 bz2163559]# rpm -qa | grep -E "openvswitch2.17|ovn-2021"                             
ovn-2021-21.12.0-130.el8fdp.x86_64                                                                    
ovn-2021-host-21.12.0-130.el8fdp.x86_64                                                               
ovn-2021-central-21.12.0-130.el8fdp.x86_64                                                            
openvswitch2.17-2.17.0-106.el8fdp.x86_64

+ ovn-nbctl --wait=hv sync
+ coredumpctl list
No coredumps found.

<=== no coredump

[root@sweetpig-8 bz2163559]# rpm -qa | grep -E "openvswitch2.17|ovn-2021"                             
ovn-2021-21.12.0-134.el8fdp.x86_64                                                                    
ovn-2021-host-21.12.0-134.el8fdp.x86_64                                                               
openvswitch2.17-2.17.0-106.el8fdp.x86_64                                                              
ovn-2021-central-21.12.0-134.el8fdp.x86_64

Comment 17 errata-xmlrpc 2023-07-06 20:05:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn-2021 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3995


Note You need to log in before you can comment on or make changes to this bug.