Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2132964

Summary: Potential crash in ovn-controller when handling deleted port bindings
Product: Red Hat Enterprise Linux Fast Datapath Reporter: OvS team <ovs-bugzilla>
Component: openvswitch2.17Assignee: xsimonar
Status: CLOSED ERRATA QA Contact: Jianlin Shi <jishi>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: FDP 22.HCC: ctrautma, eelahi, jhsiao, ralongi, tredaelli, xsimonar
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch2.17-2.17.0-46.el9fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-21 18:19:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description OvS team 2022-10-07 11:26:36 UTC
+++ This bug was initially created as a clone of Bug #2126450 +++

Adding and deleting a port might cause a crash in ovn-controller, when both notifications from sb are received in the same batch by ovn-controller.

Backtrace:

#0  0x000000000040d947 in handle_deleted_lport (pb=0xc87480, b_ctx_in=0x7ffe44f6a4e0, b_ctx_out=0x7ffe44f6a470) at controller/binding.c:2514
#1  0x000000000040e204 in handle_deleted_vif_lport (pb=0xc87480, lport_type=lport_type@entry=LP_VIF, b_ctx_in=b_ctx_in@entry=0x7ffe44f6a4e0, b_ctx_out=b_ctx_out@entry=0x7ffe44f6a470) at controller/binding.c:2587
#2  0x0000000000412286 in binding_handle_port_binding_changes (b_ctx_in=b_ctx_in@entry=0x7ffe44f6a4e0, b_ctx_out=b_ctx_out@entry=0x7ffe44f6a470) at controller/binding.c:2921
#3  0x000000000043daf9 in runtime_data_sb_port_binding_handler (node=0x7ffe44f6d9a0, data=0xbbb380) at controller/ovn-controller.c:1617
#4  0x000000000045df8e in engine_compute (recompute_allowed=<optimized out>, node=<optimized out>) at lib/inc-proc-eng.c:414
#5  engine_run_node (recompute_allowed=true, node=0x7ffe44f6d9a0) at lib/inc-proc-eng.c:476
#6  engine_run (recompute_allowed=recompute_allowed@entry=true) at lib/inc-proc-eng.c:501
#7  0x000000000040a835 in main (argc=<optimized out>, argv=<optimized out>) at controller/ovn-controller.c:4127

(gdb) p pb->datapath
$3 = (struct sbrec_datapath_binding *) 0x0

The issue can be reproduced using the following ovn test case:
OVN_FOR_EACH_NORTHD([
AT_SETUP([ovn-controller port addition and deletion])
ovn_start
net_add n1

sim_add hv1
as hv1
ovs-vsctl add-br br-phys
ovn_attach n1 br-phys 192.168.0.1
ovn-appctl vlog/set dbg

ovs-vsctl set interface p1 external-ids:iface-id=sw0-port1
check ovn-nbctl --wait=hv sync
ovn-appctl debug/pause
OVS_WAIT_UNTIL([test x$(as hv1 ovn-appctl -t ovn-controller debug/status) = "xpaused"])

ovn-nbctl ls-add sw0 -- lsp-add sw0 sw0-port1 -- lsp-set-addresses sw0-port1 "50:54:00:00:00:01 192.168.0.2"
ovn-nbctl lsp-del sw0-port1

ovn-appctl debug/resume
check ovn-nbctl --wait=hv sync

ovn-nbctl ls-del sw0
check ovn-nbctl --wait=hv sync

OVN_CLEANUP([hv1])
AT_CLEANUP
])

Comment 1 OvS team 2022-10-07 11:26:39 UTC
* Fri Oct 07 2022 Open vSwitch CI <ovs-ci> - 2.17.0-46
- Merging upstream branch-2.17 [RH git: b2b4334db0]
    Commit list:
    09e22fec45 daemon-unix: Fix file descriptor leak when monitor restarts child.
    53df50db26 vconn: Allow ECONNREFUSED in refuse connection test.
    26a11ca610 dpdk: Use DPDK 21.11.2 release.
    edf699ec64 m4: Test avx512 for x86 only.
    1989caf9ea ovsdb-idl: Preserve references for rows deleted in same IDL run as their insertion. (#2126450)
    db6a612cd7 python: idl: Fix idl.Row.__str__ method.
    73d7bf64a7 bond: Avoid deadlock while updating post recirculation rules.
    70a63391cb ofproto-dpif-upcall: Add debug commands to pause/resume revalidators.
    cf0e12f8ae test-list: Fix false-positive build failure with GCC 12.

Comment 4 Jianlin Shi 2022-10-17 01:37:03 UTC
tried with following script:
enable_coredump()                                                                                     
{
        ulimit -c unlimited
        ulimit -s unlimited
        sysctl -w fs.suid_dumpable=2                                                                  
        if ! sysctl kernel.core_pattern | grep systemd-coredump                                       
        then
                sysctl -w kernel.core_pattern="|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e"
        fi
        rm -rf /var/lib/systemd/coredump/*
        rm -rf /run/log/journal/*                                                                     
        rm -rf /var/log/journal/*
        systemctl restart systemd-journald
}

check_coredump()
{
        coredumpctl list
}

enable_coredump
systemctl start openvswitch
systemctl start ovn-northd
ovn-nbctl set-connection ptcp:6641
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.50.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.50.25
systemctl restart ovn-controller
ps aux | grep ovn-controller

ovn-appctl vlog/set dbg

ovs-vsctl add-port br-int p1 -- set interface p1 type=internal external-ids:iface-id=sw0-port1
ovn-nbctl --wait=hv sync
ovn-appctl debug/pause
ovn-appctl -t ovn-controller debug/status

ovn-nbctl ls-add sw0 -- lsp-add sw0 sw0-port1 -- lsp-set-addresses sw0-port1 "50:54:00:00:00:01 192.168.0.2"
ovn-nbctl lsp-del sw0-port1

ovn-appctl debug/resume
ovn-nbctl --wait=hv sync
ps aux | grep ovn-controller

ovn-nbctl ls-del sw0
ovn-nbctl --wait=hv sync
ps aux | grep ovn-controller
check_coredump

ovn-controller still crash on openvswitch2.17-50.el9:

[root@dell-per740-33 bz2132964]# rpm -qa | grep -E "openvswitch2.17|ovn22.06"
ovn22.06-22.06.0-64.el9fdp.x86_64
ovn22.06-central-22.06.0-64.el9fdp.x86_64
ovn22.06-host-22.06.0-64.el9fdp.x86_64
openvswitch2.17-2.17.0-50.el9fdp.x86_64

+ systemctl restart ovn-controller                                                                    
+ ps aux                                                                                              
+ grep ovn-controller                       
openvsw+   40817  0.0  0.0 238236  6640 ?        S<sl 21:32   0:00 ovn-controller unix:/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --user openvswitch:openvswitch --no-chdir --log-file
=/var/log/ovn/ovn-controller.log --pidfile=/run/ovn/ovn-controller.pid --detach                       
root       40822  0.0  0.0   6412  2264 pts/0    S+   21:32   0:00 grep ovn-controller
+ ovn-appctl vlog/set dbg  
+ ovs-vsctl add-port br-int p1 -- set interface p1 type=internal external-ids:iface-id=sw0-port1                                                                                                            
+ ovn-nbctl --wait=hv sync                                                                                                                                                                                  
+ ovn-appctl debug/pause                                                                                                                                                                                    
+ ovn-appctl -t ovn-controller debug/status                                                                                                                                                                 
paused    
+ ovn-nbctl ls-add sw0 -- lsp-add sw0 sw0-port1 -- lsp-set-addresses sw0-port1 '50:54:00:00:00:01 192.168.0.2'
+ ovn-nbctl lsp-del sw0-port1                                                                                                                                                                               
+ ovn-appctl debug/resume        
+ ovn-nbctl --wait=hv sync                
+ ps aux            
+ grep ovn-controller                              
openvsw+   40951  0.0  0.0 238264  7100 ?        S<sl 21:32   0:00 ovn-controller unix:/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --user openvswitch:openvswitch --no-chdir --log-file
=/var/log/ovn/ovn-controller.log --pidfile=/run/ovn/ovn-controller.pid --detach
root       40956  0.0  0.0   6412  2264 pts/0    S+   21:32   0:00 grep ovn-controller                
+ ovn-nbctl ls-del sw0           
+ ovn-nbctl --wait=hv sync                         
+ ps aux                              
+ grep ovn-controller                  
openvsw+   40951  0.0  0.0 238264  7100 ?        S<sl 21:32   0:00 ovn-controller unix:/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --user openvswitch:openvswitch --no-chdir --log-file
=/var/log/ovn/ovn-controller.log --pidfile=/run/ovn/ovn-controller.pid --detach
root       40960  0.0  0.0   6412  2176 pts/0    S+   21:32   0:00 grep ovn-controller
+ check_coredump
+ coredumpctl list
TIME                          PID UID GID SIG     COREFILE EXE                     SIZE
Sun 2022-10-16 21:32:23 EDT 40817 986 986 SIGSEGV none     /usr/bin/ovn-controller  n/a

I don't know why there is no coredump file generated.

Xavier, could you please help to check? thanks

Comment 5 xsimonar 2022-10-17 14:28:57 UTC
Hi Jianlin

The reproducer I initially posted does not always reproduce the issue.
Can you try adding "ovn-nbctl --wait=sb sync" right before "ovn-appctl debug/resume" ?

Thanks
Xavier

Comment 6 Jianlin Shi 2022-10-18 00:46:52 UTC
(In reply to xsimonar from comment #5)
> Hi Jianlin
> 
> The reproducer I initially posted does not always reproduce the issue.
> Can you try adding "ovn-nbctl --wait=sb sync" right before "ovn-appctl
> debug/resume" ?
> 
> Thanks
> Xavier

the same result:

+ systemctl restart ovn-controller                                                                    
+ ps aux                                                                                              
+ grep ovn-controller                                                                                 
openvsw+   88564  0.0  0.0 238232  6748 ?        S<sl 20:45   0:00 ovn-controller unix:/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --user openvswitch:openvswitch --no-chdir --log-file=/var/log/ovn/ovn-controller.log --pidfile=/run/ovn/ovn-controller.pid --detach                       
root       88569  0.0  0.0   6412  2264 pts/0    S+   20:45   0:00 grep ovn-controller                
+ ovn-appctl vlog/set dbg                                                                             
+ ovs-vsctl add-port br-int p1 -- set interface p1 type=internal external-ids:iface-id=sw0-port1      
+ ovn-nbctl --wait=hv sync                                                                            
+ ovn-appctl debug/pause                                                                              
+ ovn-appctl -t ovn-controller debug/status                                                           
paused                                                                                                
+ ovn-nbctl ls-add sw0 -- lsp-add sw0 sw0-port1 -- lsp-set-addresses sw0-port1 '50:54:00:00:00:01 192.168.0.2'
+ ovn-nbctl lsp-del sw0-port1                                                                         
+ ovn-nbctl --wait=sb sync                                                                            
+ ovn-appctl debug/resume                                                                             
+ ovn-nbctl --wait=hv sync                                                                            
+ ps aux                                                                                              
+ grep ovn-controller                                                                                 
openvsw+   88698  0.0  0.0 238264  7008 ?        S<sl 20:45   0:00 ovn-controller unix:/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --user openvswitch:openvswitch --no-chdir --log-file=/var/log/ovn/ovn-controller.log --pidfile=/run/ovn/ovn-controller.pid --detach                       
root       88703  0.0  0.0   6412  2316 pts/0    S+   20:45   0:00 grep ovn-controller                
+ ovn-nbctl ls-del sw0                                                                                
+ ovn-nbctl --wait=hv sync                                                                            
+ ps aux                                                                                              
+ grep ovn-controller                                                                                 
openvsw+   88698  0.0  0.0 238264  7008 ?        S<sl 20:45   0:00 ovn-controller unix:/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --user openvswitch:openvswitch --no-chdir --log-file=/var/log/ovn/ovn-controller.log --pidfile=/run/ovn/ovn-controller.pid --detach                       
root       88707  0.0  0.0   6412  2300 pts/0    S+   20:45   0:00 grep ovn-controller                
+ check_coredump                                                                                      
+ coredumpctl list                                                                                    
TIME                          PID UID GID SIG     COREFILE EXE                     SIZE               
Mon 2022-10-17 20:45:27 EDT 88564 986 986 SIGSEGV none     /usr/bin/ovn-controller  n/a 

[root@dell-per740-33 bz2132964]# rpm -qa | grep -E "openvswitch2.17|ovn22.06"
ovn22.06-22.06.0-64.el9fdp.x86_64
ovn22.06-central-22.06.0-64.el9fdp.x86_64
ovn22.06-host-22.06.0-64.el9fdp.x86_64
openvswitch2.17-2.17.0-50.el9fdp.x86_64

Comment 7 xsimonar 2022-10-18 14:45:38 UTC
Hi

Sorry, I was confused by the message "there is no coredump file generated"

The issue is fixed in OVS code branch, but it requires OVN to use the proper OVS submodule. 
The submodule change is not backported yet to ovn-22.06.

As a side note, to generate the coredump, I had to do/check the following:
"cat /proc/<ovn-controller-pid>/limits" I suspect the "Max core file size" is 0.
Set DefaultLimitCORE=infinity in /etc/systemd/system.conf.

Thanks

Comment 8 Jianlin Shi 2022-10-18 23:29:02 UTC
(In reply to xsimonar from comment #7)
> Hi
> 
> Sorry, I was confused by the message "there is no coredump file generated"
> 
> The issue is fixed in OVS code branch, but it requires OVN to use the proper
> OVS submodule. 
> The submodule change is not backported yet to ovn-22.06.
> 
> As a side note, to generate the coredump, I had to do/check the following:
> "cat /proc/<ovn-controller-pid>/limits" I suspect the "Max core file size"
> is 0.
> Set DefaultLimitCORE=infinity in /etc/systemd/system.conf.
> 
> Thanks

then how could I verify the bug? is the submodule change backported to ovn-22.09?

Comment 9 xsimonar 2022-10-20 07:22:58 UTC
No, there is no downstream yet. 
There is an other (unrelated) issue preventing me so far to do the submodule change
I think we should move the state back to ASSIGNED.

Comment 10 Jianlin Shi 2022-10-20 07:30:46 UTC
(In reply to xsimonar from comment #9)
> No, there is no downstream yet. 
> There is an other (unrelated) issue preventing me so far to do the submodule
> change
> I think we should move the state back to ASSIGNED.

if it can't be fixed in 22.J, then we need to ask Mark to help to remove it from errata.

Comment 12 Jianlin Shi 2022-11-02 06:19:56 UTC
no crash when test with ovn22.06-22.06.0-75:

+ grep ovn-controller
openvsw+   37236  0.0  0.0 238256  6980 ?        S<sl 02:18   0:00 ovn-controller unix:/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --user openvswitch:openvswitch --no-chdir --log-file=/var/log/ovn/ovn-controller.log --pidfile=/run/ovn/ovn-controller.pid --detach
root       37241  0.0  0.0   6412  2244 pts/0    S+   02:18   0:00 grep ovn-controller                
+ ovn-appctl vlog/set dbg                                                                             
+ ovs-vsctl add-port br-int p1 -- set interface p1 type=internal external-ids:iface-id=sw0-port1      
+ ovn-nbctl --wait=hv sync                                                                            
+ ovn-appctl debug/pause
+ ovn-appctl -t ovn-controller debug/status                                                           
paused
+ ovn-nbctl ls-add sw0 -- lsp-add sw0 sw0-port1 -- lsp-set-addresses sw0-port1 '50:54:00:00:00:01 192.168.0.2'
+ ovn-nbctl lsp-del sw0-port1                                                                         
+ ovn-appctl debug/resume                                                                             
+ ovn-nbctl --wait=hv sync                                                                            
+ ps aux
+ grep ovn-controller
openvsw+   37236  2.0  0.0 238264  7316 ?        S<sl 02:18   0:00 ovn-controller unix:/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --user openvswitch:openvswitch --no-chdir --log-file=/var/log/ovn/ovn-controller.log --pidfile=/run/ovn/ovn-controller.pid --detach
root       37305  0.0  0.0   6412  2332 pts/0    S+   02:18   0:00 grep ovn-controller                
+ ovn-nbctl ls-del sw0
+ ovn-nbctl --wait=hv sync                                                                            
+ ps aux
+ grep ovn-controller
openvsw+   37236  2.0  0.0 238264  7316 ?        S<sl 02:18   0:00 ovn-controller unix:/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --user openvswitch:openvswitch --no-chdir --log-file=/var/log/ovn/ovn-controller.log --pidfile=/run/ovn/ovn-controller.pid --detach                       
root       37309  0.0  0.0   6412  2236 pts/0    S+   02:18   0:00 grep ovn-controller                
+ check_coredump                                                                                      
+ coredumpctl list                                                                                    
No coredumps found.                                                                                   
[root@dell-per730-20 bz2132964]# rpm -qa | grep -E "openvswitch2.17|ovn22.06"                         
openvswitch2.17-2.17.0-50.el9fdp.x86_64                                                               
ovn22.06-22.06.0-75.el9fdp.x86_64                                                                     
ovn22.06-central-22.06.0-75.el9fdp.x86_64                                                             
ovn22.06-host-22.06.0-75.el9fdp.x86_64

Comment 15 errata-xmlrpc 2022-11-21 18:19:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (openvswitch2.17 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8567

Comment 16 Ehsan Elahi 2023-01-25 18:34:59 UTC
The fix does not seem to be backported on ovn-2021.