Description of problem: Seeing an assertion error in ovsdb IDL in latest OCP OVN jobs: lib/ovsdb-idl.c:4553: assertion row->new_datum != NULL failed in ovsdb_idl_txn_write__() which causes ovn-controller to crash and restart. Saw it for two different ovn-controllers in this job: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2018/pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry/1297883269903159296/artifacts/e2e-ovn-step-registry/gather-extra/pods/openshift-ovn-kubernetes_ovnkube-node-qd7wb_ovn-controller_previous.log https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2018/pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry/1297883269903159296/artifacts/e2e-ovn-step-registry/gather-extra/pods/openshift-ovn-kubernetes_ovnkube-node-x5smh_ovn-controller_previous.log The controllers seem to be fine after restart. Full logs Here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2018/pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry/1297883269903159296/artifacts/e2e-ovn-step-registry/gather-extra/pods/ https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2018/pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry/1297883269903159296 Looks like this is where it is failing: https://github.com/openvswitch/ovs/blob/b80675c6b3f4c27952446b6dc681a50d4c92a64a/lib/ovsdb-idl.c#L4400
Hi Tim, Can you please share the OVN rpm version? Would it be possible to get enable core dumps for ovn-controller (and other OVN components) for the next time this happens? That would help a lot with troubleshooting the issue. Thanks, Dumitru
According to dcbw it's probably ovn2.13-20.06.1-6.el7fdp I'm trying to launch another cluster to confirm. We dont currently get the versions or core dumps our collection mechanism (must-gather). Andrew is working on it here: https://github.com/openshift/must-gather/pull/169/files
can confirm it is ovn2.13-20.06.1-6.el7fdp.x86_64
Now seeing this in upstream OVN-k8s: https://github.com/ovn-org/ovn-kubernetes/pull/1642#issuecomment-682306702 Interestingly, even when ovn-controller restarts, it does not seem to recover: 2020-08-27T22:17:13.033Z|00366|util|EMER|lib/ovsdb-idl.c:4612: assertion row->new_datum != NULL failed in ovsdb_idl_txn_write__() 2020-08-27T22:17:15.444Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-controller.log 2020-08-27T22:17:15.445Z|00002|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting... 2020-08-27T22:17:15.445Z|00003|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected 2020-08-27T22:17:15.449Z|00004|main|INFO|OVS IDL reconnected, force recompute. 2020-08-27T22:17:15.449Z|00005|main|INFO|OVNSB IDL reconnected, force recompute. However we dont see any messages here about any new ports or anything, but on the other worker node I still see activity for the next 13 minutes: 2020-08-27T22:20:18.372Z|00285|binding|INFO|Releasing lport network-policy-1432_server-l9l2h from this chassis. 2020-08-27T22:20:19.796Z|00286|binding|INFO|Claiming lport network-policy-2180_server-9hp9d for this chassis. 2020-08-27T22:20:19.797Z|00287|binding|INFO|network-policy-2180_server-9hp9d: Claiming 0a:58:0a:f4:00:62 10.244.0.98 2020-08-27T22:20:23.289Z|00288|binding|INFO|Releasing lport network-policy-2980_client-a-xpsk2 from this chassis. 2020-08-27T22:20:25.633Z|00289|binding|INFO|Claiming lport network-policy-8644_server-5qlgf for this chassis. 2020-08-27T22:20:25.635Z|00290|binding|INFO|network-policy-8644_server-5qlgf: Claiming 0a:58:0a:f4:00:63 10.244.0.99 2020-08-27T22:22:18.297Z|00291|binding|INFO|Releasing lport network-policy-7148_server-6v7s7 from this chassis. 2020-08-27T22:23:29.733Z|00292|binding|INFO|Releasing lport network-policy-2180_server-9hp9d from this chassis. 2020-08-27T22:23:39.729Z|00293|binding|INFO|Releasing lport network-policy-8644_server-5qlgf from this chassis. 2020-08-27T22:27:19.767Z|00294|binding|INFO|Claiming lport network-policy-3483_server-b6skc for this chassis. 2020-08-27T22:27:19.767Z|00295|binding|INFO|network-policy-3483_server-b6skc: Claiming 0a:58:0a:f4:00:64 10.244.0.100 2020-08-27T22:30:27.752Z|00296|binding|INFO|Releasing lport network-policy-3483_server-b6skc from this chassis. Also, ports being created on the affected node are missing flows in OVS.
Created attachment 1712959 [details] ovn-controller (ovn-20.06.2-1.fc31) core dump
(In reply to Dumitru Ceara from comment #5) > Created attachment 1712959 [details] > ovn-controller (ovn-20.06.2-1.fc31) core dump I managed to get an ovn-controller core dump when this issue happens while running ovn-k8s upstream kind tests locally. Seems to be related to the recent incremental processing work in ovn-controller: (gdb) bt #0 0x00007f9eeba36625 in raise () from /lib64/libc.so.6 #1 0x00007f9eeba1f8d9 in abort () from /lib64/libc.so.6 #2 0x0000564c5a985544 in ovs_abort_valist (err_no=err_no@entry=0, format=format@entry=0x564c5aa65050 "%s: assertion %s failed in %s()", args=args@entry=0x7ffd85545020) at lib/util.c:419 #3 0x0000564c5a98d325 in vlog_abort_valist (module_=<optimized out>, message=0x564c5aa65050 "%s: assertion %s failed in %s()", args=args@entry=0x7ffd85545020) at lib/vlog.c:1249 #4 0x0000564c5a98d3ca in vlog_abort (module=module@entry=0x564c5ab1de60 <this_module>, message=message@entry=0x564c5aa65050 "%s: assertion %s failed in %s()") at lib/vlog.c:1263 #5 0x0000564c5a98525b in ovs_assert_failure (where=where@entry=0x564c5aa60863 "lib/ovsdb-idl.c:4612", function=function@entry=0x564c5aa61b70 <__func__.11198> "ovsdb_idl_txn_write__", condition=condition@entry=0x564c5aa607d9 "row->new_datum != NULL") at lib/util.c:86 #6 0x0000564c5a96bc8a in ovsdb_idl_txn_write__ (row_=0x564c5c409690, column=0x564c5aafa680 <sbrec_port_binding_columns>, datum=0x7ffd85545160, owns_datum=<optimized out>) at lib/ovsdb-idl.c:4654 #7 0x0000564c5a96dd1b in ovsdb_idl_txn_write_clone (row=<optimized out>, column=<optimized out>, datum=<optimized out>) at lib/ovsdb-idl.c:4716 #8 0x0000564c5a8ffd10 in sbrec_port_binding_set_chassis (row=row@entry=0x564c5c409690, chassis=chassis@entry=0x0) at lib/ovn-sb-idl.c:20619 #9 0x0000564c5a8b449f in release_lport (pb=0x564c5c409690, sb_readonly=<optimized out>, tracked_datapaths=0x564c5bed4d08) at controller/binding.c:893 #10 0x0000564c5a8b6b7b in release_local_binding (tracked_dp_bindings=0x564c5bed4d08, sb_readonly=<optimized out>, lbinding=0x564c5c801f60, chassis_rec=0x564c5bf18c50) at controller/binding.c:981 #11 consider_iface_release (iface_rec=0x564c5c979270, b_ctx_in=0x7ffd855452a0, b_ctx_in=0x7ffd855452a0, b_ctx_out=0x7ffd85545250, iface_id=0x564c5c41f430 "network-policy-4261_client-a-ghvnq") at controller/binding.c:1772 #12 binding_handle_ovs_interface_changes (b_ctx_in=0x7ffd855452a0, b_ctx_out=0x7ffd85545250) at controller/binding.c:1869 #13 0x0000564c5a8d351b in runtime_data_ovs_interface_handler (node=0x7ffd85546640, data=0x564c5bed4c20) at controller/ovn-controller.c:1275 #14 0x0000564c5a8ebc03 in engine_compute (recompute_allowed=<optimized out>, node=<optimized out>) at lib/inc-proc-eng.c:306 #15 engine_run_node (recompute_allowed=<optimized out>, node=0x7ffd85546640) at lib/inc-proc-eng.c:352 #16 engine_run (recompute_allowed=<optimized out>) at lib/inc-proc-eng.c:377 #17 0x0000564c5a8b206a in main (argc=<optimized out>, argv=<optimized out>) at controller/ovn-controller.c:2403 I'll continue debugging.
A simpler way to reproduce the issue, on an OVN-only deployment: ovn-nbctl lr-add rtr ovn-nbctl lrp-add rtr rtr-ls 00:00:00:00:01:00 42.42.42.1/24 ovn-nbctl ls-add ls ovn-nbctl lsp-add ls ls-rtr ovn-nbctl lsp-set-addresses ls-rtr 00:00:00:00:01:00 ovn-nbctl lsp-set-type ls-rtr router ovn-nbctl lsp-set-options ls-rtr router-port=rtr-ls ovn-nbctl lsp-add ls vm1 ovn-nbctl lsp-set-addresses vm1 00:00:00:00:00:01 ip netns add vm1 ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal ip link set vm1 netns vm1 ip netns exec vm1 ip link set vm1 address 00:00:00:00:00:01 ip netns exec vm1 ip addr add 42.42.42.2/24 dev vm1 ip netns exec vm1 ip link set vm1 up ovs-vsctl set Interface vm1 external_ids:iface-id=vm1 # Attach gdb to ovn-controller and run a script that deletes the SB port binding and OVS interface at the same time: $ cat > /tmp/test.sh <<EOF ovn-nbctl lsp-del vm1 ovs-vsctl del-port vm1 sleep 0.5 EOF $ chmod +x /tmp/test.sh $ cat > /tmp/script.gdb <<EOF shell /tmp/test.sh c EOF $ gdb -p `pidof ovn-controller` -x /tmp/script.gdb [...] Program received signal SIGABRT, Aborted. 0x00007fcd9f107377 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_7.2.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-16.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 python-libs-2.7.5-86.el7.x86_64 unbound-libs-1.6.6-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) bt #0 0x00007fcd9f107377 in raise () from /lib64/libc.so.6 #1 0x00007fcd9f108a68 in abort () from /lib64/libc.so.6 #2 0x0000000000517a04 in ovs_abort_valist (err_no=0, format=0x6378f8 "%s: assertion %s failed in %s()", args=0x7fff9b30ce28) at lib/util.c:419 #3 0x0000000000520516 in vlog_abort_valist (module_=0x914bc0 <this_module>, message=0x6378f8 "%s: assertion %s failed in %s()", args=0x7fff9b30ce28) at lib/vlog.c:1249 #4 0x00000000005205bc in vlog_abort (module=0x914bc0 <this_module>, message=0x6378f8 "%s: assertion %s failed in %s()") at lib/vlog.c:1263 #5 0x00000000005172aa in ovs_assert_failure (where=0x632fab "lib/ovsdb-idl.c:4612", function=0x633850 <__func__.10200> "ovsdb_idl_txn_write__", condition=0x632b36 "row->new_datum != NULL") at lib/util.c:86 #6 0x00000000004f7ba9 in ovsdb_idl_txn_write__ (row_=0x157b0c0, column=0x8f19c0 <sbrec_port_binding_columns>, datum=0x7fff9b30cff0, owns_datum=false) at lib/ovsdb-idl.c:4612 #7 0x00000000004f7f97 in ovsdb_idl_txn_write_clone (row=0x157b0c0, column=0x8f19c0 <sbrec_port_binding_columns>, datum=0x7fff9b30cff0) at lib/ovsdb-idl.c:4717 #8 0x0000000000462922 in sbrec_port_binding_set_chassis (row=0x157b0c0, chassis=0x0) at lib/ovn-sb-idl.c:21551 #9 0x0000000000409400 in release_lport (pb=0x157b0c0, sb_readonly=false, tracked_datapaths=0x1507718) at controller/binding.c:893 #10 0x000000000040972d in release_local_binding (chassis_rec=0x157cce0, lbinding=0x1528070, sb_readonly=false, tracked_dp_bindings=0x1507718) at controller/binding.c:981 #11 0x000000000040b2d8 in consider_iface_release (iface_rec=0x1571d80, iface_id=0x159dc90 "vm1", b_ctx_in=0x7fff9b30d1d0, b_ctx_out=0x7fff9b30d180) at controller/binding.c:1772 #12 0x000000000040b502 in binding_handle_ovs_interface_changes (b_ctx_in=0x7fff9b30d1d0, b_ctx_out=0x7fff9b30d180) at controller/binding.c:1869 ---Type <return> to continue, or q <return> to quit--- #13 0x000000000042cd3d in runtime_data_ovs_interface_handler (node=0x7fff9b324aa0, data=0x1507630) at controller/ovn-controller.c:1311 #14 0x000000000044b368 in engine_compute (recompute_allowed=<optimized out>, node=<optimized out>) at lib/inc-proc-eng.c:306 #15 engine_run_node (recompute_allowed=<optimized out>, node=0x7fff9b324aa0) at lib/inc-proc-eng.c:352 #16 engine_run (recompute_allowed=<optimized out>) at lib/inc-proc-eng.c:377 #17 0x000000000042fd76 in main (argc=10, argv=0x7fff9b326e68) at controller/ovn-controller.c:2449
reproduced on ovn2.13-20.06.1-6.el8fdp.x86_64: [root@dell-per740-12 bz1871961]# rpm -qa | grep -E "openvswitch|ovn" │ openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch │ ovn2.13-debuginfo-20.06.1-6.el8fdp.x86_64 │ ovn2.13-central-debuginfo-20.06.1-6.el8fdp.x86_64 │ openvswitch2.13-2.13.0-58.el8fdp.x86_64 │ ovn2.13-20.06.1-6.el8fdp.x86_64 │ ovn2.13-host-20.06.1-6.el8fdp.x86_64 │ ovn2.13-host-debuginfo-20.06.1-6.el8fdp.x86_64 │ openvswitch2.13-debugsource-2.13.0-58.el8fdp.x86_64 │ ovn2.13-central-20.06.1-6.el8fdp.x86_64 │ openvswitch2.13-debuginfo-2.13.0-58.el8fdp.x86_64 [root@dell-per740-12 bz1871961]# cat rep.sh systemctl start openvswitch systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.50.26:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.50.26 systemctl restart ovn-controller ovs-vsctl set bridge br-int other_config:hwaddr="3e:5d:ff:25:10:f8" ovn-nbctl lr-add rtr ovn-nbctl lrp-add rtr rtr-ls 00:00:00:00:01:00 42.42.42.1/24 ovn-nbctl ls-add ls ovn-nbctl lsp-add ls ls-rtr ovn-nbctl lsp-set-addresses ls-rtr 00:00:00:00:01:00 ovn-nbctl lsp-set-type ls-rtr router ovn-nbctl lsp-set-options ls-rtr router-port=rtr-ls ovn-nbctl lsp-add ls vm1 ovn-nbctl lsp-set-addresses vm1 00:00:00:00:00:01 ip netns add vm1 ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal ip link set vm1 netns vm1 ip netns exec vm1 ip link set vm1 address 00:00:00:00:00:01 ip netns exec vm1 ip addr add 42.42.42.2/24 dev vm1 ip netns exec vm1 ip link set vm1 up ovs-vsctl set Interface vm1 external_ids:iface-id=vm1 GNU gdb (GDB) Red Hat Enterprise Linux 8.2-11.el8 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word". Attaching to process 46581 [New LWP 46582] [New LWP 46583] [New LWP 46584] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007f3cf8cc3f21 in __GI___poll (fds=fds@entry=0x55aec713dbf0, nfds=nfds@entry=6, timeout=timeout@entry=4992) at ../sysdeps/unix/sysv/linux/poll.c:29 29 return SYSCALL_CANCEL (poll, fds, nfds, timeout); Thread 1 "ovn-controller" received signal SIGABRT, Aborted. __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 50 return ret; (gdb) <=== abort Verified on ovn20.06.2-3: [root@dell-per740-12 bz1871961]# rpm -qa | grep -E "openvswitch|ovn" openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch ovn2.13-20.06.2-3.el8fdp.x86_64 ovn2.13-central-debuginfo-20.06.2-3.el8fdp.x86_64 openvswitch2.13-2.13.0-58.el8fdp.x86_64 openvswitch2.13-debugsource-2.13.0-58.el8fdp.x86_64 ovn2.13-central-20.06.2-3.el8fdp.x86_64 ovn2.13-debuginfo-20.06.2-3.el8fdp.x86_64 ovn2.13-host-debuginfo-20.06.2-3.el8fdp.x86_64 openvswitch2.13-debuginfo-2.13.0-58.el8fdp.x86_64 ovn2.13-host-20.06.2-3.el8fdp.x86_64 [root@dell-per740-12 bz1871961]# gdb -p `pidof ovn-controller` -x /tmp/script.gdb GNU gdb (GDB) Red Hat Enterprise Linux 8.2-11.el8 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word". Attaching to process 48155 [New LWP 48156] [New LWP 48157] [New LWP 48159] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007f645a20cf21 in __GI___poll (fds=fds@entry=0x5631b0a13c90, nfds=nfds@entry=6, timeout=timeout@entry=4999) at ../sysdeps/unix/sysv/linux/poll.c:29 29 return SYSCALL_CANCEL (poll, fds, nfds, timeout); ^C Thread 1 "ovn-controller" received signal SIGINT, Interrupt. 0x00007f645a20cf21 in __GI___poll (fds=fds@entry=0x5631b0a14e70, nfds=nfds@entry=6, timeout=timeout@entry=5000) at ../sysdeps/unix/sysv/linux/poll.c:29 29 return SYSCALL_CANCEL (poll, fds, nfds, timeout); <=== not abort
also verified on rhel7 version: [root@dell-per740-12 bz1871961]# rpm -qa | grep -E "openvswitch|ovn" openvswitch2.13-2.13.0-48.el7fdp.x86_64 ovn2.13-20.06.2-3.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch ovn2.13-central-20.06.2-3.el7fdp.x86_64 ovn2.13-debuginfo-20.06.2-3.el7fdp.x86_64 ovn2.13-host-20.06.2-3.el7fdp.x86_64 Loaded symbols for /lib64/libnss_files.so.2 0x00007fa2f3356c3d in poll () at ../sysdeps/unix/syscall-template.S:81 81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) <=== not abort
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3769