Bug 1936328

Summary: ovn-controller crashes when a container port is changed to normal port and then deleted
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Numan Siddique <nusiddiq>
Component: ovn2.13Assignee: Numan Siddique <nusiddiq>
Status: CLOSED ERRATA QA Contact: Ehsan Elahi <eelahi>
Severity: urgent Docs Contact:
Priority: urgent    
Version: FDP 20.HCC: ctrautma, darya, dcbw, dsedgmen, jishi, ovnteam, ralongi, rkhan
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovn2.13-20.12.0-99.fdp8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-20 19:28:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Numan Siddique 2021-03-08 08:25:59 UTC
Description of problem:


Steps to reproduce
-----

ovn-nbctl ls-add ls
ovn-nbctl lsp-add ls vm1
ovn-nbctl lsp-add ls vm-cont vm1 1
ovs-vsctl set Interface vm1 external_ids:iface-id=vm1
ovn-appctl -t ovn-controller debug/pause
ovn-nbctl clear logical_switch_port vm-cont parent_name
ovs-vsctl set Interface vm1 external_ids:iface-id=foo
ovn-nbctl lsp-del vm-cont
ovn-appctl -t ovn-controller debug/resume


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Jianlin Shi 2021-03-08 10:05:34 UTC
tested with following script:

enable_coredump()                  
{                                                    
        ulimit -c unlimited             
        ulimit -s unlimited                            
        sysctl -w fs.suid_dumpable=2                 
        if ! sysctl kernel.core_pattern | grep systemd-coredump
        then                              
                sysctl -w kernel.core_pattern="|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e"
        fi
        rm -rf /var/lib/systemd/coredump/*
        rm -rf /run/log/journal/*
        rm -rf /var/log/journal/*
        systemctl restart systemd-journald
}                                            
                                              
enable_coredump
                                    
for i in {1..10}                     
do                                   
systemctl start openvswitch  
systemctl start ovn-northd  
ovn-nbctl set-connection ptcp:6641
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:1.1.169.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=1.1.169.25
systemctl restart ovn-controller     
systemctl status ovn-controller
ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal
    
ovn-nbctl ls-add ls
ovn-nbctl lsp-add ls vm1
ovn-nbctl lsp-add ls vm-cont vm1 1
ovs-vsctl set Interface vm1 external_ids:iface-id=vm1
ovn-appctl -t ovn-controller debug/pause
ovn-nbctl clear logical_switch_port vm-cont parent_name
ovs-vsctl set Interface vm1 external_ids:iface-id=foo
ovn-nbctl lsp-del vm-cont
ovn-appctl -t ovn-controller debug/resume
if coredumpctl list
then
        break
fi 

systemctl stop ovn-controller &>/dev/null
        systemctl stop ovn-northd &>/dev/null
        systemctl stop openvswitch &>/dev/null
        sleep 1
        rm -rf /etc/openvswitch/*.db
        rm -rf /etc/openvswitch/*.pem
        rm -rf /var/lib/openvswitch/*
        rm -rf /var/lib/ovn/*
        rm -rf /etc/ovn/*.db
        rm -rf /etc/ovn/*.pem
        # clean up log
        rm -rf /var/log/ovn/*
        rm -rf /var/log/openvswitch/*
        netns_clean.sh
        sync
done    
        
echo $i

reproduced on 20.12.0-24;

[root@wsfd-advnetlab16 bz1936328]# rpm -qa | grep -E "openvswitch2.13|ovn2.13"
ovn2.13-20.12.0-24.el8fdp.x86_64
ovn2.13-host-20.12.0-24.el8fdp.x86_64
openvswitch2.13-2.13.0-95.el8fdp.x86_64
python3-openvswitch2.13-2.13.0-95.el8fdp.x86_64
ovn2.13-central-20.12.0-24.el8fdp.x86_64

+ ovn-appctl -t ovn-controller debug/resume
+ coredumpctl list
TIME                            PID   UID   GID SIG COREFILE  EXE
Mon 2021-03-08 05:04:22 EST  123461   992   989   6 present   /usr/bin/ovn-controller
+ break
+ echo 2

[root@wsfd-advnetlab16 bz1936328]# coredumpctl info                                                                                                                                                [63/1808]
           PID: 123461 (ovn-controller)                           
           UID: 992 (openvswitch)                                   
           GID: 989 (openvswitch)                                   
        Signal: 6 (ABRT)                                      
     Timestamp: Mon 2021-03-08 05:04:22 EST (18s ago)
  Command Line: ovn-controller unix:/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info >
    Executable: /usr/bin/ovn-controller                                                               
 Control Group: /system.slice/ovn-controller.service              
          Unit: ovn-controller.service                                  
         Slice: system.slice                                             
       Boot ID: 1a50ae1ea3394ba7a53685e24a1bc9b4                         
    Machine ID: 0350fa343ed14eea8d477b906349f017                     
      Hostname: wsfd-advnetlab16.anl.lab.eng.bos.redhat.com
       Storage: /var/lib/systemd/coredump/core.ovn-controller.992.1a50ae1ea3394ba7a53685e24a1bc9b4.12>
       Message: Process 123461 (ovn-controller) of user 992 dumped core.                              
                                                                        
                Stack trace of thread 123461:                      
                #0  0x00007f52086867ff raise (libc.so.6)                      
                #1  0x00007f5208670c35 abort (libc.so.6)                 
                #2  0x0000563d5a2b59a4 ovs_abort_valist (ovn-controller) 
                #3  0x0000563d5a2bd794 vlog_abort_valist (ovn-controller)
                #4  0x0000563d5a2bd83a vlog_abort (ovn-controller)       
                #5  0x0000563d5a2b56bb ovs_assert_failure (ovn-controller)
                #6  0x0000563d5a29bc0a ovsdb_idl_txn_write__ (ovn-controller)
                #7  0x0000563d5a22c9e0 sbrec_port_binding_set_chassis (ovn-controller)
                #8  0x0000563d5a1cfe67 release_lport (ovn-controller)                 
                #9  0x0000563d5a1d07db release_local_binding_children (ovn-controller)
                #10 0x0000563d5a1d2904 binding_handle_ovs_interface_changes (ovn-controller)
                #11 0x0000563d5a1f7a0b runtime_data_ovs_interface_handler (ovn-controller)  
                #12 0x0000563d5a2122b3 engine_run (ovn-controller)                        
                #13 0x0000563d5a1cd50b main (ovn-controller)      
                #14 0x00007f52086727b3 __libc_start_main (libc.so.6)
                #15 0x0000563d5a1cebfe _start (ovn-controller)
                
                Stack trace of thread 123465:
                #0  0x00007f5208740ca1 __poll (libc.so.6)
                #1  0x0000563d5a2b0de5 time_poll (ovn-controller)
                #2  0x0000563d5a2a63fc poll_block (ovn-controller)
                #3  0x0000563d5a2a5578 stopwatch_thread (ovn-controller)
                #4  0x0000563d5a28f3e3 ovsthread_wrapper (ovn-controller)
                #5  0x00007f52092d914a start_thread (libpthread.so.0)
                #6  0x00007f520874bf23 __clone (libc.so.6)
                
                Stack trace of thread 123463:
                #0  0x00007f5208740ca1 __poll (libc.so.6)
                #1  0x0000563d5a2b0de5 time_poll (ovn-controller)
                #2  0x0000563d5a2a63fc poll_block (ovn-controller)
                #3  0x0000563d5a28cdda ovsrcu_postpone_thread (ovn-controller)
                #4  0x0000563d5a28f3e3 ovsthread_wrapper (ovn-controller)
                #5  0x00007f52092d914a start_thread (libpthread.so.0)
                #6  0x00007f520874bf23 __clone (libc.so.6)
                
                Stack trace of thread 123462:
                #0  0x00007f5208740ca1 __poll (libc.so.6)
                #1  0x0000563d5a2b0de5 time_poll (ovn-controller)
                #2  0x0000563d5a2a63fc poll_block (ovn-controller)
                #3  0x0000563d5a1eed16 pinctrl_handler (ovn-controller)
                #4  0x0000563d5a28f3e3 ovsthread_wrapper (ovn-controller)
                #5  0x00007f52092d914a start_thread (libpthread.so.0)
                #6  0x00007f520874bf23 __clone (libc.so.6)

Comment 3 Dan Williams 2021-03-30 20:54:30 UTC
Updated patch on review: http://patchwork.ozlabs.org/project/ovn/patch/20210329132159.2005894-1-numans@ovn.org/

Comment 4 Dan Williams 2021-04-06 01:19:46 UTC
Patch committed to upstream OVN, now waiting on downstream backport.

Comment 6 Dan Williams 2021-04-13 20:45:39 UTC
Fixed in ovn2.13-20.12.0-99.fdp8

Comment 9 Ehsan Elahi 2021-04-29 08:56:08 UTC
Tested on 3 versions:  ovn2.13-20.12.0-97.fdp8, ovn2.13-20.12.0-101.fdp8 and ovn2.13-20.12.0-109.fdp. 
ovn-controller crashes in version earlier than ovn2.13-20.12.0-99.fdp8 but works fine on later versions. Here's how the bug reproduced:

enable_coredump()                  
{                                                    
        ulimit -c unlimited             
        ulimit -s unlimited                            
        sysctl -w fs.suid_dumpable=2                 
        if ! sysctl kernel.core_pattern | grep systemd-coredump
        then                              
                sysctl -w kernel.core_pattern="|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e"
        fi
        rm -rf /var/lib/systemd/coredump/*
        rm -rf /run/log/journal/*
        rm -rf /var/log/journal/*
        systemctl restart systemd-journald
}  
enable_coredump
systemctl start openvswitch
systemctl start ovn-northd
ovn-sbctl set-connection ptcp:6642
ovn-nbctl set-connection ptcp:6641
sleep 5
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:1.1.169.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=1.1.169.25
systemctl start ovn-controller
sleep 20

systemctl status openvswitch
systemctl status ovn-northd
systemctl status ovn-controller

ovs-vsctl add-port br-int iface1 -- set interface iface1 type=internal
ovn-nbctl ls-add ls
ovn-nbctl lsp-add ls lsp1
ovn-nbctl lsp-add ls lsp1c lsp1 1
ovs-vsctl set Interface iface1 external_ids:iface-id=lsp1
ovn-appctl -t ovn-controller debug/pause

# changed child port to normal
ovn-nbctl clear logical_switch_port lsp1c parent_name

#bind to an interface
ovs-vsctl set Interface iface1 external_ids:iface-id=lsp1c

# delete the same port
ovn-nbctl lsp-del lsp1c
ovn-appctl -t ovn-controller debug/resume

if coredumpctl list
then
     break
fi 
#check if there's any error in the ovn-controller.log
cat /var/log/ovn/ovn-controller.log|grep ERR"

# cleanup everything
systemctl stop ovn-controller &>/dev/null
systemctl stop ovn-northd &>/dev/null
systemctl stop openvswitch &>/dev/null
sleep 5
rm -rf /etc/openvswitch/*.db
rm -rf /etc/openvswitch/*.pem
rm -rf /var/lib/openvswitch/*
rm -rf /var/lib/ovn/*
rm -rf /etc/ovn/*.db
rm -rf /etc/ovn/*.pem
# clean up log
rm -rf /var/log/ovn/*
rm -rf /var/log/openvswitch/*

Comment 10 Ehsan Elahi 2021-04-29 18:44:37 UTC
Tested and verified on following versions:
ovn2.13-20.12.0-101.el8fdp
ovn2.13-20.12.0-104.el8fdp
ovn2.13-20.12.0-109.el8fdp
ovn-2021-21.03.0-21.el8fdp
ovn2.13-20.12.0-104.el7fdp

Automated as per comment 9. Test result as below:

:: [ 09:04:53 ] :: [   PASS   ] :: Command 'ovn_setup hv1' (Expected 0, got 0)
:: [ 09:04:53 ] :: [  BEGIN   ] :: Running 'ovs-vsctl add-port br-int iface1 -- set interface iface1 type=internal'
:: [ 09:04:53 ] :: [   PASS   ] :: Command 'ovs-vsctl add-port br-int iface1 -- set interface iface1 type=internal' (Expected 0, got 0)
:: [ 09:04:53 ] :: [  BEGIN   ] :: Running 'ovs-vsctl add-port br-int iface2 -- set interface iface2 type=internal'
:: [ 09:04:53 ] :: [   PASS   ] :: Command 'ovs-vsctl add-port br-int iface2 -- set interface iface2 type=internal' (Expected 0, got 0)
:: [ 09:04:53 ] :: [  BEGIN   ] :: Running 'ovn-nbctl ls-add ls'
:: [ 09:04:53 ] :: [   PASS   ] :: Command 'ovn-nbctl ls-add ls' (Expected 0, got 0)
:: [ 09:04:53 ] :: [  BEGIN   ] :: Running 'ovn-nbctl lsp-add ls lsp1'
:: [ 09:04:53 ] :: [   PASS   ] :: Command 'ovn-nbctl lsp-add ls lsp1' (Expected 0, got 0)
:: [ 09:04:53 ] :: [  BEGIN   ] :: Running 'ovn-nbctl lsp-add ls lsp1c lsp1 1'
:: [ 09:04:53 ] :: [   PASS   ] :: Command 'ovn-nbctl lsp-add ls lsp1c lsp1 1' (Expected 0, got 0)
:: [ 09:04:53 ] :: [  BEGIN   ] :: Running 'ovs-vsctl set Interface iface1 external_ids:iface-id=lsp1'
:: [ 09:04:53 ] :: [   PASS   ] :: Command 'ovs-vsctl set Interface iface1 external_ids:iface-id=lsp1' (Expected 0, got 0)
:: [ 09:04:53 ] :: [  BEGIN   ] :: Running 'ovn-appctl -t ovn-controller debug/pause'
:: [ 09:04:53 ] :: [   PASS   ] :: Command 'ovn-appctl -t ovn-controller debug/pause' (Expected 0, got 0)
:: [ 09:04:53 ] :: [  BEGIN   ] :: Running 'ovn-nbctl clear logical_switch_port lsp1c parent_name'
:: [ 09:04:53 ] :: [   PASS   ] :: Command 'ovn-nbctl clear logical_switch_port lsp1c parent_name' (Expected 0, got 0)
:: [ 09:04:53 ] :: [  BEGIN   ] :: Running 'ovs-vsctl set Interface iface1 external_ids:iface-id=lsp1c'
:: [ 09:04:53 ] :: [   PASS   ] :: Command 'ovs-vsctl set Interface iface1 external_ids:iface-id=lsp1c' (Expected 0, got 0)
:: [ 09:04:53 ] :: [  BEGIN   ] :: Running 'ovn-nbctl lsp-del lsp1c'
:: [ 09:04:53 ] :: [   PASS   ] :: Command 'ovn-nbctl lsp-del lsp1c' (Expected 0, got 0)
:: [ 09:04:53 ] :: [  BEGIN   ] :: Running 'ovn-appctl -t ovn-controller debug/resume'
:: [ 09:04:54 ] :: [   PASS   ] :: Command 'ovn-appctl -t ovn-controller debug/resume' (Expected 0, got 0)
:: [ 09:04:54 ] :: [  BEGIN   ] :: Running 'grep -E -i "Duplicate IP set on switch" /var/log/ovn/ovn-northd.log'
:: [ 09:04:54 ] :: [   PASS   ] :: Command 'grep -E -i "Duplicate IP set on switch" /var/log/ovn/ovn-northd.log' (Expected 1, got 1)
:: [ 09:04:54 ] :: [  BEGIN   ] :: Running 'grep -E -i "ovsdb_idl.*transaction error" /var/log/ovn/ovn-northd.log'
:: [ 09:04:54 ] :: [   PASS   ] :: Command 'grep -E -i "ovsdb_idl.*transaction error" /var/log/ovn/ovn-northd.log' (Expected 1, got 1)
:: [ 09:04:54 ] :: [  BEGIN   ] :: Running 'grep -E -i "ERR.*crash" /var/log/ovn/ovn-northd.log'
:: [ 09:04:54 ] :: [   PASS   ] :: Command 'grep -E -i "ERR.*crash" /var/log/ovn/ovn-northd.log' (Expected 1, got 1)
:: [ 09:04:54 ] :: [  BEGIN   ] :: Running 'grep -E -i "OpenFlow error" /var/log/ovn/ovn-controller.log'
:: [ 09:04:54 ] :: [   PASS   ] :: Command 'grep -E -i "OpenFlow error" /var/log/ovn/ovn-controller.log' (Expected 1, got 1)
:: [ 09:04:54 ] :: [  BEGIN   ] :: Running 'grep -E -i "service check: Unsupported protocol" /var/log/ovn/ovn-controller.log'
:: [ 09:04:54 ] :: [   PASS   ] :: Command 'grep -E -i "service check: Unsupported protocol" /var/log/ovn/ovn-controller.log' (Expected 1, got 1)
:: [ 09:04:54 ] :: [  BEGIN   ] :: Running 'grep -E -i "Failed to acquire udpif_key corresponding to unexpected flow" /var/log/openvswitch/ovs-vswitchd.log'
:: [ 09:04:54 ] :: [   PASS   ] :: Command 'grep -E -i "Failed to acquire udpif_key corresponding to unexpected flow" /var/log/openvswitch/ovs-vswitchd.log' (Expected 1, got 1)
:: [ 09:04:54 ] :: [  BEGIN   ] :: Running 'grep -E -i "integrity violation" /var/log/ovn/ovsdb-server*'
:: [ 09:04:54 ] :: [   PASS   ] :: Command 'grep -E -i "integrity violation" /var/log/ovn/ovsdb-server*' (Expected 1, got 1)
:: [ 09:04:54 ] :: [  BEGIN   ] :: Running 'cat /var/log/ovn/ovn-controller.log|grep ERR'
:: [ 09:04:54 ] :: [   PASS   ] :: Command 'cat /var/log/ovn/ovn-controller.log|grep ERR' (Expected 1, got 1)
:: [ 09:04:54 ] :: [  BEGIN   ] :: Running 'ovn_cleanup'
ovn_cleanup ...
ovn_cleanup ... end
:: [ 09:04:59 ] :: [   PASS   ] :: Command 'ovn_cleanup' (Expected 0, got 0)
:: [ 09:04:59 ] :: [  BEGIN   ] :: Running 'coredumpctl list'
No coredumps found.
:: [ 09:04:59 ] :: [   PASS   ] :: Command 'coredumpctl list' (Expected 1, got 1)
:: [ 09:04:59 ] :: [  BEGIN   ] :: Running 'coredumpctl list'
No coredumps found.
:: [ 09:04:59 ] :: [   PASS   ] :: Command 'coredumpctl list' (Expected 0-255, got 1)
fs.suid_dumpable = 0

Comment 12 errata-xmlrpc 2021-05-20 19:28:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2080

Comment 15 Red Hat Bugzilla 2023-09-15 01:02:58 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days