Bug 2125723
| Summary: | ovn-controller crashes with assertion ovs_list_is_empty | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Jakub Libosvar <jlibosva> | ||||
| Component: | ovn2.13 | Assignee: | Ales Musil <amusil> | ||||
| Status: | ASSIGNED --- | QA Contact: | Ehsan Elahi <eelahi> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | FDP 22.C | CC: | amusil, ctrautma, jiji, jishi, jmelvin, ljozsa, mmichels, ralongi | ||||
| Target Milestone: | --- | Flags: | amusil:
needinfo?
(ljozsa) |
||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | Type: | Bug | |||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 2122808 | ||||||
| Attachments: |
|
||||||
Putting the stack trace here, so we don't loose it:
#0 0x00007f0064b4070f in raise () from /lib64/libc.so.6
#1 0x00007f0064b2ab25 in abort () from /lib64/libc.so.6
#2 0x0000558157200954 in ovs_abort_valist (err_no=err_no@entry=0, format=format@entry=0x5581572eb850 "%s: assertion %s failed in %s()", args=args@entry=0x7fff877328e0) at lib/util.c:444
#3 0x0000558157208964 in vlog_abort_valist (module_=<optimized out>, message=0x5581572eb850 "%s: assertion %s failed in %s()", args=args@entry=0x7fff877328e0) at lib/vlog.c:1249
#4 0x0000558157208a0a in vlog_abort (module=module@entry=0x5581575b1900 <this_module>, message=message@entry=0x5581572eb850 "%s: assertion %s failed in %s()") at lib/vlog.c:1263
#5 0x000055815720066b in ovs_assert_failure (where=where@entry=0x5581572c654e "controller/ofctrl.c:1203", function=function@entry=0x5581572c6c00 <__func__.35103> "flood_remove_flows_for_sb_uuid",
condition=condition@entry=0x5581572c6a08 "ovs_list_is_empty(&f->list_node)") at lib/util.c:86
#6 0x000055815712b302 in flood_remove_flows_for_sb_uuid (flow_table=flow_table@entry=0x558157cbfda0, sb_uuid=sb_uuid@entry=0x55824630af60, flood_remove_nodes=flood_remove_nodes@entry=0x7fff87732b60)
at controller/ofctrl.c:1210
#7 0x000055815712b722 in ofctrl_flood_remove_flows (flow_table=0x558157cbfda0, flood_remove_nodes=flood_remove_nodes@entry=0x7fff87732b60) at controller/ofctrl.c:1267
#8 0x0000558157124aa4 in lflow_handle_changed_ref (ref_type=ref_type@entry=REF_TYPE_PORTBINDING, ref_name=ref_name@entry=0x7fff87732c00 "310_937", l_ctx_in=l_ctx_in@entry=0x7fff87732cc0,
l_ctx_out=l_ctx_out@entry=0x7fff87732c70, changed=changed@entry=0x7fff87732bff) at controller/lflow.c:505
#9 0x00005581571270b9 in lflow_handle_flows_for_lport (pb=<optimized out>, l_ctx_in=l_ctx_in@entry=0x7fff87732cc0, l_ctx_out=l_ctx_out@entry=0x7fff87732c70) at controller/lflow.c:1931
#10 0x0000558157141b4e in lflow_output_runtime_data_handler (node=0x7fff87738210, data=<optimized out>) at controller/ovn-controller.c:2390
#11 0x000055815715e0fb in engine_compute (recompute_allowed=<optimized out>, node=<optimized out>) at lib/inc-proc-eng.c:353
#12 engine_run_node (recompute_allowed=true, node=0x7fff87738210) at lib/inc-proc-eng.c:402
#13 engine_run (recompute_allowed=recompute_allowed@entry=true) at lib/inc-proc-eng.c:427
#14 0x00005581571156fc in main (argc=<optimized out>, argv=<optimized out>) at controller/ovn-controller.c:3196
Does this still happen when you have enabled "ovn-monitor-all" again? If not we might have a potential fix. Still the fix is a bif of a guess as this seem to be almost impossible to reproduce. Yes, the high load of ovndb-server is gone since ovn-monitor-all was set to true. RHOS is now much more responsive. (In reply to Ales Musil from comment #10) > Does this still happen when you have enabled "ovn-monitor-all" again? > If not we might have a potential fix. Still the fix is a bif of a guess as > this seem to be almost impossible to reproduce. Yes, it still happens. The crashes happen also on the nodes that used to have ovn-monitor-all=true set and there were 3 crashes just today: 15:57 $ ansible -f100 -b -m shell -a "zgrep EMER /var/log/containers/openvswitch/ovn-controller*" compute-d | grep "2022-10-24" -c [WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details 3 and about 66 crashes in the last 10 days. |
Created attachment 1910858 [details] ovn-controller logs Description of problem: Some ovn-controllers crash time to time, issuing following error in the logs: 2022-09-09T16:23:07.660Z|77029|util|EMER|controller/ofctrl.c:1203: assertion ovs_list_is_empty(&f->list_node) failed in flood_remove_flows_for_sb_uuid() It causes problems at scale because customer's southbound database is very large and chokes the southbound database with each reconnection. Version-Release number of selected component (if applicable): ovn2.13-20.12.0-192.el8fdp.x86_64 How reproducible: Intermittent Steps to Reproduce: 1. Unknown 2. 3. Actual results: Expected results: Additional info: coredump file is attached