Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2125723

Summary:

ovn-controller crashes with assertion ovs_list_is_empty

Product:

Red Hat Enterprise Linux Fast Datapath

Reporter:

Jakub Libosvar <jlibosva>

Component:

ovn2.13

Assignee:

Ales Musil <amusil>

Status:

CLOSED WONTFIX

QA Contact:

Ehsan Elahi <eelahi>

Severity:

urgent

Docs Contact:

Priority:

medium

Version:

FDP 22.C

CC:

amusil, ctrautma, jiji, jishi, jmelvin, ljozsa, mmichels, ralongi

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2024-10-29 20:30:11 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2122808

Attachments:

Description	Flags
ovn-controller logs	none

Description Jakub Libosvar 2022-09-09 20:52:53 UTC

Created attachment 1910858 [details]
ovn-controller logs

Description of problem:
Some ovn-controllers crash time to time, issuing following error in the logs:
2022-09-09T16:23:07.660Z|77029|util|EMER|controller/ofctrl.c:1203: assertion ovs_list_is_empty(&f->list_node) failed in flood_remove_flows_for_sb_uuid()

It causes problems at scale because customer's southbound database is very large and chokes the southbound database with each reconnection.

Version-Release number of selected component (if applicable):
ovn2.13-20.12.0-192.el8fdp.x86_64

How reproducible:
Intermittent

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:


Expected results:


Additional info:
coredump file is attached

Comment 9 Ales Musil 2022-10-19 09:56:42 UTC

Putting the stack trace here, so we don't loose it:

#0  0x00007f0064b4070f in raise () from /lib64/libc.so.6                                                                                                                                                                          
#1  0x00007f0064b2ab25 in abort () from /lib64/libc.so.6                                                                                                                                                                          
#2  0x0000558157200954 in ovs_abort_valist (err_no=err_no@entry=0, format=format@entry=0x5581572eb850 "%s: assertion %s failed in %s()", args=args@entry=0x7fff877328e0) at lib/util.c:444                                        
#3  0x0000558157208964 in vlog_abort_valist (module_=<optimized out>, message=0x5581572eb850 "%s: assertion %s failed in %s()", args=args@entry=0x7fff877328e0) at lib/vlog.c:1249                                                
#4  0x0000558157208a0a in vlog_abort (module=module@entry=0x5581575b1900 <this_module>, message=message@entry=0x5581572eb850 "%s: assertion %s failed in %s()") at lib/vlog.c:1263                                                
#5  0x000055815720066b in ovs_assert_failure (where=where@entry=0x5581572c654e "controller/ofctrl.c:1203", function=function@entry=0x5581572c6c00 <__func__.35103> "flood_remove_flows_for_sb_uuid",                              
    condition=condition@entry=0x5581572c6a08 "ovs_list_is_empty(&f->list_node)") at lib/util.c:86                                                                                                                                 
#6  0x000055815712b302 in flood_remove_flows_for_sb_uuid (flow_table=flow_table@entry=0x558157cbfda0, sb_uuid=sb_uuid@entry=0x55824630af60, flood_remove_nodes=flood_remove_nodes@entry=0x7fff87732b60)                           
    at controller/ofctrl.c:1210                                                                                                                                                                                                   
#7  0x000055815712b722 in ofctrl_flood_remove_flows (flow_table=0x558157cbfda0, flood_remove_nodes=flood_remove_nodes@entry=0x7fff87732b60) at controller/ofctrl.c:1267                                                           
#8  0x0000558157124aa4 in lflow_handle_changed_ref (ref_type=ref_type@entry=REF_TYPE_PORTBINDING, ref_name=ref_name@entry=0x7fff87732c00 "310_937", l_ctx_in=l_ctx_in@entry=0x7fff87732cc0,                                       
    l_ctx_out=l_ctx_out@entry=0x7fff87732c70, changed=changed@entry=0x7fff87732bff) at controller/lflow.c:505                                                                                                                     
#9  0x00005581571270b9 in lflow_handle_flows_for_lport (pb=<optimized out>, l_ctx_in=l_ctx_in@entry=0x7fff87732cc0, l_ctx_out=l_ctx_out@entry=0x7fff87732c70) at controller/lflow.c:1931                                          
#10 0x0000558157141b4e in lflow_output_runtime_data_handler (node=0x7fff87738210, data=<optimized out>) at controller/ovn-controller.c:2390                                                                                       
#11 0x000055815715e0fb in engine_compute (recompute_allowed=<optimized out>, node=<optimized out>) at lib/inc-proc-eng.c:353                                                                                                      
#12 engine_run_node (recompute_allowed=true, node=0x7fff87738210) at lib/inc-proc-eng.c:402                                                                                                                                       
#13 engine_run (recompute_allowed=recompute_allowed@entry=true) at lib/inc-proc-eng.c:427                                                                                                                                         
#14 0x00005581571156fc in main (argc=<optimized out>, argv=<optimized out>) at controller/ovn-controller.c:3196

Comment 10 Ales Musil 2022-10-21 05:54:53 UTC

Does this still happen when you have enabled "ovn-monitor-all" again?
If not we might have a potential fix. Still the fix is a bif of a guess as this seem to be almost impossible to reproduce.

Comment 11 Jeremy 2022-10-24 14:38:31 UTC

Yes, the high load of ovndb-server is gone since ovn-monitor-all was set to true. RHOS is now much more responsive.

Comment 12 Jakub Libosvar 2022-10-24 15:58:43 UTC

(In reply to Ales Musil from comment #10)
> Does this still happen when you have enabled "ovn-monitor-all" again?
> If not we might have a potential fix. Still the fix is a bif of a guess as
> this seem to be almost impossible to reproduce.

Yes, it still happens. The crashes happen also on the nodes that used to have ovn-monitor-all=true set and there were 3 crashes just today:

15:57 $ ansible -f100 -b -m shell -a "zgrep EMER /var/log/containers/openvswitch/ovn-controller*" compute-d | grep "2022-10-24" -c
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
3

and about 66 crashes in the last 10 days.

Comment 18 Mark Michelson 2024-10-29 20:30:11 UTC

THis issue is being closed because it is one of three open OVN Bugzilla issues. If this issue is still a problem in modern OVN versions, please create a Jira issue.

Comment 19 Red Hat Bugzilla 2025-02-27 04:25:04 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days