Description of problem: ovn-controller coredumped on all nodes (controllers+compute) and many FIP flows were affected Dec 23 11:28:55 overcloud-controller-0 systemd-coredump[720680]: Process 558845 (ovn-controller) of user 0 dumped core.#012#012Stack trace of thread 7:#012#0 0x0000561dfcd3d271 n/a (/usr/bin/ovn-controller)#012#1 0x0000000000000000 n/a (n/a)#012#2 0x0000000000000000 n/a (n/a) Version-Release number of selected component (if applicable): How reproducible: Twice up to now Steps to Reproduce: 1. No serious clues 2. 3. Actual results: Expected results: No core dumps and no affected flows Additional info: In the ovs logs we see this prior the crash: 2022-12-23T10:23:55.239Z|09912|connmgr|INFO|br-int<->unix#1: 402 flow_mods 10 s ago (402 adds) 2022-12-23T10:24:55.239Z|09913|connmgr|INFO|br-int<->unix#1: 960 flow_mods in the last 52 s (957 adds, 3 deletes) 2022-12-23T10:25:55.239Z|09914|connmgr|INFO|br-int<->unix#1: 388 flow_mods in the 18 s starting 50 s ago (385 adds, 3 deletes) 2022-12-23T10:27:41.955Z|09915|connmgr|INFO|br-int<->unix#1: 714 flow_mods in the 7 s starting 10 s ago (348 adds, 366 deletes) 2022-12-23T10:28:05.852Z|09916|connmgr|INFO|br-int<->unix#1: 1748 flow_mods in the 22 s starting 23 s ago (339 adds, 1409 deletes) 2022-12-23T10:29:26.671Z|00001|timeval(handler48)|WARN|Unreasonably long 1565ms poll interval (0ms user, 2ms system) 2022-12-23T10:29:26.672Z|00002|timeval(handler48)|WARN|faults: 1 minor, 0 major 2022-12-23T10:29:26.672Z|00003|timeval(handler48)|WARN|context switches: 0 voluntary, 1 involuntary
It appears the corresponding customer case has been closed. We also suspect that this core dump might be fixed by backporting commit 2e4f393650ccf298f26787583c13a88197ba8348 from OVN main (https://github.com/ovn-org/ovn/commit/2e4f393650ccf298f26787583c13a88197ba8348) . Once we backport this fix, we will close this issue.
We didn't have the core dump and it didn't happen again ...
ovn-2021 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2213611
Hi Ales, this is the log I found in ovn-2021.spec: * Thu May 18 2023 Ales Musil <amusil> - 21.12.0-132 - [branch-21.12] ovs: Bump submodule to v2.17.6 (#2163559) [Upstream: 52ef956bb4e9de2d418805dd43b337184f1aa560] which specific patch fix the issue? any reproducer for the issue? thanks
Hi, the commit https://github.com/ovn-org/ovn/commit/2e4f393650ccf298f26787583c13a88197ba8348 has a test which was used to reproduce the original issue. Basing the reproducer on that is the best chance. Thanks, Ales
tested with following script: enable_coredump() { ulimit -c unlimited ulimit -s unlimited sysctl -w fs.suid_dumpable=2 if ! sysctl kernel.core_pattern | grep systemd-coredump then sysctl -w kernel.core_pattern="|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c " fi rm -rf /var/lib/systemd/coredump/* rm -rf /run/log/journal/* rm -rf /var/log/journal/* systemctl restart systemd-journald } enable_coredump systemctl start openvswitch systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:127.0.0.1:6642 external_1 systemctl restart ovn-controller ovs-vsctl add-port br-int p1 -- set interface p1 type=internal ovs-vsctl set interface p1 external-ids:iface-id=sw0-port1 ovn-nbctl --wait=hv sync ovn-appctl debug/pause sleep 2 ovn-appctl -t ovn-controller debug/status ovn-nbctl ls-add sw0 -- lsp-add sw0 sw0-port1 ovn-nbctl lsp-del sw0-port1 ovn-nbctl --wait=sb sync ovn-appctl debug/resume ovn-nbctl --wait=hv sync ovn-nbctl ls-del sw0 ovn-nbctl --wait=hv sync coredumpctl list reproduced on ovn-2021-21.12.0-130.el8: + ovn-nbctl --wait=hv sync + coredumpctl list TIME PID UID GID SIG COREFILE EXE Mon 2023-07-03 03:07:01 EDT 34521 993 990 11 none /usr/bin/ovn-controller <=== coredump Verified on ovn-2021-21.12.0-134.el8: [root@sweetpig-8 bz2163559]# rpm -qa | grep -E "openvswitch2.17|ovn-2021" ovn-2021-21.12.0-130.el8fdp.x86_64 ovn-2021-host-21.12.0-130.el8fdp.x86_64 ovn-2021-central-21.12.0-130.el8fdp.x86_64 openvswitch2.17-2.17.0-106.el8fdp.x86_64 + ovn-nbctl --wait=hv sync + coredumpctl list No coredumps found. <=== no coredump [root@sweetpig-8 bz2163559]# rpm -qa | grep -E "openvswitch2.17|ovn-2021" ovn-2021-21.12.0-134.el8fdp.x86_64 ovn-2021-host-21.12.0-134.el8fdp.x86_64 openvswitch2.17-2.17.0-106.el8fdp.x86_64 ovn-2021-central-21.12.0-134.el8fdp.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn-2021 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3995