Bug 2125723

Summary: ovn-controller crashes with assertion ovs_list_is_empty
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Jakub Libosvar <jlibosva>
Component: ovn2.13Assignee: Ales Musil <amusil>
Status: ASSIGNED --- QA Contact: Ehsan Elahi <eelahi>
Severity: urgent Docs Contact:
Priority: medium    
Version: FDP 22.CCC: amusil, ctrautma, jiji, jishi, jmelvin, ljozsa, mmichels, ralongi
Target Milestone: ---Flags: amusil: needinfo? (ljozsa)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2122808    
Attachments:
Description Flags
ovn-controller logs none

Description Jakub Libosvar 2022-09-09 20:52:53 UTC
Created attachment 1910858 [details]
ovn-controller logs

Description of problem:
Some ovn-controllers crash time to time, issuing following error in the logs:
2022-09-09T16:23:07.660Z|77029|util|EMER|controller/ofctrl.c:1203: assertion ovs_list_is_empty(&f->list_node) failed in flood_remove_flows_for_sb_uuid()

It causes problems at scale because customer's southbound database is very large and chokes the southbound database with each reconnection.

Version-Release number of selected component (if applicable):
ovn2.13-20.12.0-192.el8fdp.x86_64

How reproducible:
Intermittent

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:


Expected results:


Additional info:
coredump file is attached

Comment 9 Ales Musil 2022-10-19 09:56:42 UTC
Putting the stack trace here, so we don't loose it:

#0  0x00007f0064b4070f in raise () from /lib64/libc.so.6                                                                                                                                                                          
#1  0x00007f0064b2ab25 in abort () from /lib64/libc.so.6                                                                                                                                                                          
#2  0x0000558157200954 in ovs_abort_valist (err_no=err_no@entry=0, format=format@entry=0x5581572eb850 "%s: assertion %s failed in %s()", args=args@entry=0x7fff877328e0) at lib/util.c:444                                        
#3  0x0000558157208964 in vlog_abort_valist (module_=<optimized out>, message=0x5581572eb850 "%s: assertion %s failed in %s()", args=args@entry=0x7fff877328e0) at lib/vlog.c:1249                                                
#4  0x0000558157208a0a in vlog_abort (module=module@entry=0x5581575b1900 <this_module>, message=message@entry=0x5581572eb850 "%s: assertion %s failed in %s()") at lib/vlog.c:1263                                                
#5  0x000055815720066b in ovs_assert_failure (where=where@entry=0x5581572c654e "controller/ofctrl.c:1203", function=function@entry=0x5581572c6c00 <__func__.35103> "flood_remove_flows_for_sb_uuid",                              
    condition=condition@entry=0x5581572c6a08 "ovs_list_is_empty(&f->list_node)") at lib/util.c:86                                                                                                                                 
#6  0x000055815712b302 in flood_remove_flows_for_sb_uuid (flow_table=flow_table@entry=0x558157cbfda0, sb_uuid=sb_uuid@entry=0x55824630af60, flood_remove_nodes=flood_remove_nodes@entry=0x7fff87732b60)                           
    at controller/ofctrl.c:1210                                                                                                                                                                                                   
#7  0x000055815712b722 in ofctrl_flood_remove_flows (flow_table=0x558157cbfda0, flood_remove_nodes=flood_remove_nodes@entry=0x7fff87732b60) at controller/ofctrl.c:1267                                                           
#8  0x0000558157124aa4 in lflow_handle_changed_ref (ref_type=ref_type@entry=REF_TYPE_PORTBINDING, ref_name=ref_name@entry=0x7fff87732c00 "310_937", l_ctx_in=l_ctx_in@entry=0x7fff87732cc0,                                       
    l_ctx_out=l_ctx_out@entry=0x7fff87732c70, changed=changed@entry=0x7fff87732bff) at controller/lflow.c:505                                                                                                                     
#9  0x00005581571270b9 in lflow_handle_flows_for_lport (pb=<optimized out>, l_ctx_in=l_ctx_in@entry=0x7fff87732cc0, l_ctx_out=l_ctx_out@entry=0x7fff87732c70) at controller/lflow.c:1931                                          
#10 0x0000558157141b4e in lflow_output_runtime_data_handler (node=0x7fff87738210, data=<optimized out>) at controller/ovn-controller.c:2390                                                                                       
#11 0x000055815715e0fb in engine_compute (recompute_allowed=<optimized out>, node=<optimized out>) at lib/inc-proc-eng.c:353                                                                                                      
#12 engine_run_node (recompute_allowed=true, node=0x7fff87738210) at lib/inc-proc-eng.c:402                                                                                                                                       
#13 engine_run (recompute_allowed=recompute_allowed@entry=true) at lib/inc-proc-eng.c:427                                                                                                                                         
#14 0x00005581571156fc in main (argc=<optimized out>, argv=<optimized out>) at controller/ovn-controller.c:3196

Comment 10 Ales Musil 2022-10-21 05:54:53 UTC
Does this still happen when you have enabled "ovn-monitor-all" again?
If not we might have a potential fix. Still the fix is a bif of a guess as this seem to be almost impossible to reproduce.

Comment 11 Jeremy 2022-10-24 14:38:31 UTC
Yes, the high load of ovndb-server is gone since ovn-monitor-all was set to true. RHOS is now much more responsive.

Comment 12 Jakub Libosvar 2022-10-24 15:58:43 UTC
(In reply to Ales Musil from comment #10)
> Does this still happen when you have enabled "ovn-monitor-all" again?
> If not we might have a potential fix. Still the fix is a bif of a guess as
> this seem to be almost impossible to reproduce.

Yes, it still happens. The crashes happen also on the nodes that used to have ovn-monitor-all=true set and there were 3 crashes just today:

15:57 $ ansible -f100 -b -m shell -a "zgrep EMER /var/log/containers/openvswitch/ovn-controller*" compute-d | grep "2022-10-24" -c
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
3

and about 66 crashes in the last 10 days.