Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2125723

Summary: ovn-controller crashes with assertion ovs_list_is_empty
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Jakub Libosvar <jlibosva>
Component: ovn2.13Assignee: Ales Musil <amusil>
Status: CLOSED WONTFIX QA Contact: Ehsan Elahi <eelahi>
Severity: urgent Docs Contact:
Priority: medium    
Version: FDP 22.CCC: amusil, ctrautma, jiji, jishi, jmelvin, ljozsa, mmichels, ralongi
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-10-29 20:30:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2122808    
Attachments:
Description Flags
ovn-controller logs none

Description Jakub Libosvar 2022-09-09 20:52:53 UTC
Created attachment 1910858 [details]
ovn-controller logs

Description of problem:
Some ovn-controllers crash time to time, issuing following error in the logs:
2022-09-09T16:23:07.660Z|77029|util|EMER|controller/ofctrl.c:1203: assertion ovs_list_is_empty(&f->list_node) failed in flood_remove_flows_for_sb_uuid()

It causes problems at scale because customer's southbound database is very large and chokes the southbound database with each reconnection.

Version-Release number of selected component (if applicable):
ovn2.13-20.12.0-192.el8fdp.x86_64

How reproducible:
Intermittent

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:


Expected results:


Additional info:
coredump file is attached

Comment 9 Ales Musil 2022-10-19 09:56:42 UTC
Putting the stack trace here, so we don't loose it:

#0  0x00007f0064b4070f in raise () from /lib64/libc.so.6                                                                                                                                                                          
#1  0x00007f0064b2ab25 in abort () from /lib64/libc.so.6                                                                                                                                                                          
#2  0x0000558157200954 in ovs_abort_valist (err_no=err_no@entry=0, format=format@entry=0x5581572eb850 "%s: assertion %s failed in %s()", args=args@entry=0x7fff877328e0) at lib/util.c:444                                        
#3  0x0000558157208964 in vlog_abort_valist (module_=<optimized out>, message=0x5581572eb850 "%s: assertion %s failed in %s()", args=args@entry=0x7fff877328e0) at lib/vlog.c:1249                                                
#4  0x0000558157208a0a in vlog_abort (module=module@entry=0x5581575b1900 <this_module>, message=message@entry=0x5581572eb850 "%s: assertion %s failed in %s()") at lib/vlog.c:1263                                                
#5  0x000055815720066b in ovs_assert_failure (where=where@entry=0x5581572c654e "controller/ofctrl.c:1203", function=function@entry=0x5581572c6c00 <__func__.35103> "flood_remove_flows_for_sb_uuid",                              
    condition=condition@entry=0x5581572c6a08 "ovs_list_is_empty(&f->list_node)") at lib/util.c:86                                                                                                                                 
#6  0x000055815712b302 in flood_remove_flows_for_sb_uuid (flow_table=flow_table@entry=0x558157cbfda0, sb_uuid=sb_uuid@entry=0x55824630af60, flood_remove_nodes=flood_remove_nodes@entry=0x7fff87732b60)                           
    at controller/ofctrl.c:1210                                                                                                                                                                                                   
#7  0x000055815712b722 in ofctrl_flood_remove_flows (flow_table=0x558157cbfda0, flood_remove_nodes=flood_remove_nodes@entry=0x7fff87732b60) at controller/ofctrl.c:1267                                                           
#8  0x0000558157124aa4 in lflow_handle_changed_ref (ref_type=ref_type@entry=REF_TYPE_PORTBINDING, ref_name=ref_name@entry=0x7fff87732c00 "310_937", l_ctx_in=l_ctx_in@entry=0x7fff87732cc0,                                       
    l_ctx_out=l_ctx_out@entry=0x7fff87732c70, changed=changed@entry=0x7fff87732bff) at controller/lflow.c:505                                                                                                                     
#9  0x00005581571270b9 in lflow_handle_flows_for_lport (pb=<optimized out>, l_ctx_in=l_ctx_in@entry=0x7fff87732cc0, l_ctx_out=l_ctx_out@entry=0x7fff87732c70) at controller/lflow.c:1931                                          
#10 0x0000558157141b4e in lflow_output_runtime_data_handler (node=0x7fff87738210, data=<optimized out>) at controller/ovn-controller.c:2390                                                                                       
#11 0x000055815715e0fb in engine_compute (recompute_allowed=<optimized out>, node=<optimized out>) at lib/inc-proc-eng.c:353                                                                                                      
#12 engine_run_node (recompute_allowed=true, node=0x7fff87738210) at lib/inc-proc-eng.c:402                                                                                                                                       
#13 engine_run (recompute_allowed=recompute_allowed@entry=true) at lib/inc-proc-eng.c:427                                                                                                                                         
#14 0x00005581571156fc in main (argc=<optimized out>, argv=<optimized out>) at controller/ovn-controller.c:3196

Comment 10 Ales Musil 2022-10-21 05:54:53 UTC
Does this still happen when you have enabled "ovn-monitor-all" again?
If not we might have a potential fix. Still the fix is a bif of a guess as this seem to be almost impossible to reproduce.

Comment 11 Jeremy 2022-10-24 14:38:31 UTC
Yes, the high load of ovndb-server is gone since ovn-monitor-all was set to true. RHOS is now much more responsive.

Comment 12 Jakub Libosvar 2022-10-24 15:58:43 UTC
(In reply to Ales Musil from comment #10)
> Does this still happen when you have enabled "ovn-monitor-all" again?
> If not we might have a potential fix. Still the fix is a bif of a guess as
> this seem to be almost impossible to reproduce.

Yes, it still happens. The crashes happen also on the nodes that used to have ovn-monitor-all=true set and there were 3 crashes just today:

15:57 $ ansible -f100 -b -m shell -a "zgrep EMER /var/log/containers/openvswitch/ovn-controller*" compute-d | grep "2022-10-24" -c
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
3

and about 66 crashes in the last 10 days.

Comment 18 Mark Michelson 2024-10-29 20:30:11 UTC
THis issue is being closed because it is one of three open OVN Bugzilla issues. If this issue is still a problem in modern OVN versions, please create a Jira issue.

Comment 19 Red Hat Bugzilla 2025-02-27 04:25:04 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days