Created attachment 1910858 [details] ovn-controller logs Description of problem: Some ovn-controllers crash time to time, issuing following error in the logs: 2022-09-09T16:23:07.660Z|77029|util|EMER|controller/ofctrl.c:1203: assertion ovs_list_is_empty(&f->list_node) failed in flood_remove_flows_for_sb_uuid() It causes problems at scale because customer's southbound database is very large and chokes the southbound database with each reconnection. Version-Release number of selected component (if applicable): ovn2.13-20.12.0-192.el8fdp.x86_64 How reproducible: Intermittent Steps to Reproduce: 1. Unknown 2. 3. Actual results: Expected results: Additional info: coredump file is attached
Putting the stack trace here, so we don't loose it: #0 0x00007f0064b4070f in raise () from /lib64/libc.so.6 #1 0x00007f0064b2ab25 in abort () from /lib64/libc.so.6 #2 0x0000558157200954 in ovs_abort_valist (err_no=err_no@entry=0, format=format@entry=0x5581572eb850 "%s: assertion %s failed in %s()", args=args@entry=0x7fff877328e0) at lib/util.c:444 #3 0x0000558157208964 in vlog_abort_valist (module_=<optimized out>, message=0x5581572eb850 "%s: assertion %s failed in %s()", args=args@entry=0x7fff877328e0) at lib/vlog.c:1249 #4 0x0000558157208a0a in vlog_abort (module=module@entry=0x5581575b1900 <this_module>, message=message@entry=0x5581572eb850 "%s: assertion %s failed in %s()") at lib/vlog.c:1263 #5 0x000055815720066b in ovs_assert_failure (where=where@entry=0x5581572c654e "controller/ofctrl.c:1203", function=function@entry=0x5581572c6c00 <__func__.35103> "flood_remove_flows_for_sb_uuid", condition=condition@entry=0x5581572c6a08 "ovs_list_is_empty(&f->list_node)") at lib/util.c:86 #6 0x000055815712b302 in flood_remove_flows_for_sb_uuid (flow_table=flow_table@entry=0x558157cbfda0, sb_uuid=sb_uuid@entry=0x55824630af60, flood_remove_nodes=flood_remove_nodes@entry=0x7fff87732b60) at controller/ofctrl.c:1210 #7 0x000055815712b722 in ofctrl_flood_remove_flows (flow_table=0x558157cbfda0, flood_remove_nodes=flood_remove_nodes@entry=0x7fff87732b60) at controller/ofctrl.c:1267 #8 0x0000558157124aa4 in lflow_handle_changed_ref (ref_type=ref_type@entry=REF_TYPE_PORTBINDING, ref_name=ref_name@entry=0x7fff87732c00 "310_937", l_ctx_in=l_ctx_in@entry=0x7fff87732cc0, l_ctx_out=l_ctx_out@entry=0x7fff87732c70, changed=changed@entry=0x7fff87732bff) at controller/lflow.c:505 #9 0x00005581571270b9 in lflow_handle_flows_for_lport (pb=<optimized out>, l_ctx_in=l_ctx_in@entry=0x7fff87732cc0, l_ctx_out=l_ctx_out@entry=0x7fff87732c70) at controller/lflow.c:1931 #10 0x0000558157141b4e in lflow_output_runtime_data_handler (node=0x7fff87738210, data=<optimized out>) at controller/ovn-controller.c:2390 #11 0x000055815715e0fb in engine_compute (recompute_allowed=<optimized out>, node=<optimized out>) at lib/inc-proc-eng.c:353 #12 engine_run_node (recompute_allowed=true, node=0x7fff87738210) at lib/inc-proc-eng.c:402 #13 engine_run (recompute_allowed=recompute_allowed@entry=true) at lib/inc-proc-eng.c:427 #14 0x00005581571156fc in main (argc=<optimized out>, argv=<optimized out>) at controller/ovn-controller.c:3196
Does this still happen when you have enabled "ovn-monitor-all" again? If not we might have a potential fix. Still the fix is a bif of a guess as this seem to be almost impossible to reproduce.
Yes, the high load of ovndb-server is gone since ovn-monitor-all was set to true. RHOS is now much more responsive.
(In reply to Ales Musil from comment #10) > Does this still happen when you have enabled "ovn-monitor-all" again? > If not we might have a potential fix. Still the fix is a bif of a guess as > this seem to be almost impossible to reproduce. Yes, it still happens. The crashes happen also on the nodes that used to have ovn-monitor-all=true set and there were 3 crashes just today: 15:57 $ ansible -f100 -b -m shell -a "zgrep EMER /var/log/containers/openvswitch/ovn-controller*" compute-d | grep "2022-10-24" -c [WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details 3 and about 66 crashes in the last 10 days.