+++ This bug was initially created as a clone of Bug #1808125 +++ It seems that ovn-controller sometimes misses port binding notifications. We're interested in why ovn-controller doesn't set up any port bindings for k8s-ci-op-7pxtd-m-2.c.openshift-gce-devel-ci.internal even though that interface has been added to OVS itself with the right iface-id. --- Additional comment from Dan Williams on 2020-02-27 21:18:14 UTC --- ovn2.12.x86_64 0:2.12.0-32.el7fdn --- Additional comment from Dan Williams on 2020-02-27 21:18:39 UTC --- openvswitch2.12.x86_64 0:2.12.0-21.el7fdn --- Additional comment from Dumitru Ceara on 2020-02-28 10:13:23 UTC --- --- Additional comment from Dumitru Ceara on 2020-02-28 10:52:16 UTC --- --- Additional comment from Dumitru Ceara on 2020-02-28 12:11:30 UTC --- From the sbdb_leader.log we see that the ovn-controller where the issue is seen ("server13" in the logs) sends an incorrect "monitor_cond_since" request: 2020-02-27T21:03:38Z|02948|jsonrpc|DBG|ssl:10.0.0.3:38012: received request, method="monitor_cond_since", params=[ "OVN_Southbound", ["monid","OVN_Southbound"], [...] "Port_Binding": [ {"where": [ ["type","==","patch"], ["type","==","chassisredirect"], ["type","==","external"] ], "columns": [ "chassis", "datapath", "encap", "gateway_chassis", "ha_chassis_group", "logical_port", "mac", "nat_addresses", "options", "parent_port", "tag", "tunnel_key", "type", "virtual_parent" ] } ], [...] After this point the ovn-controller facing the issue ("server13") doesn't update the monitor condition, i.e., it doesn't send any other monitor_cond_since requests to OVSDB SB server. From the sbdb_follower2.log where the ovn-controller that doesn't hit the issue is connected ("server9" in the logs), we see the correct "monitor_cond_since" request, specifically no "where" clause for Port_Binding: 2020-02-27T21:03:38Z|01610|jsonrpc|DBG|ssl:10.0.0.4:42758: received request, method="monitor_cond_since", params=[ "OVN_Southbound", ["monid","OVN_Southbound"], [...] "Port_Binding": [ {"columns": [ "chassis", "datapath", "encap", "gateway_chassis", "ha_chassis_group", "logical_port", "mac", "nat_addresses", "options", "parent_port", "tag", "tunnel_key", "type", "virtual_parent" ] } ], This indicates that ovn-controller sometimes fails to properly disable conditional monitoring when ovn-monitor-all=true. --- Additional comment from Dumitru Ceara on 2020-02-28 16:01:05 UTC --- Steps to replicate the issue using the attached OVS, OVN SB, OVN NB database files: 1. start OVS with the ovsdb.db database. 2. start OVN northd and NB/SB databases. E.g.: /usr/share/ovn/scripts/ovn-ctl start_northd --db-nb-cluster-local-addr=127.0.0.1 --db-sb-cluster-local-addr=127.0.0.1 3. Attach the following gdb script to ovsdb-server SB: $ cat gdb-commands b ovsdb_jsonrpc_parse_monitor_request if (strcmp(table->schema->name, "Database")) commands shell ovs-vsctl set open . external-ids:ovn-monitor-all=true print sleep(1) dis c end c $ ovs-vsctl set open . external-ids:ovn-monitor-all=false $ ovn-sbctl destroy chassis . $ ovn-sbctl destroy port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal $ spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2) $ gdb -p $spid -x gdb-commands This clears ovn-monitor-all from the OVS database and also the relevant chassis and port bindings in the SB DB. Then it attaches to the ovsdb-server for SB and sets a breakpoint for when it would receive the monitor_cond_since command from ovn-controller for the OVN_Southbound database. At this point, before replying to ovn-controller the breakpoint script sets ovn-monitor-all=true (simulating what ovn-k8s would do), sleeps for 1 sec making sure ovn-controller received the update from OVS, and continues. 4. Start ovn-controller: /usr/share/ovn/scripts/ovn-ctl start_controller 5. Check that the port binding wasn't claimed by ovn-controller: $ ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis : [] Root cause: ovn-controller shouldn't call update_sb_monitors if the SB IDL is in state IDL_S_DATA_MONITOR_COND_SINCE_REQUESTED because when ovsdb-server replies the IDL code will: - call ovsdb_idl_db_parse_monitor_reply() [1] - call ovsdb_idl_db_clear() [2] - which sets table->cond_changed = false; for every table monitored by the IDL. [3] [1] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L767 [2] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L2153 [3] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L613 --- Additional comment from Dumitru Ceara on 2020-02-28 23:02:04 UTC --- Fix sent upstream for review: https://patchwork.ozlabs.org/patch/1246910/ --- Additional comment from Dumitru Ceara on 2020-03-30 09:38:02 UTC --- Fix merged upstream: https://github.com/openvswitch/ovs/commit/2b7e536fa5e20be10e620b959e05557f88862d2c
hi Dumitru, the db files in https://bugzilla.redhat.com/show_bug.cgi?id=1808125#c6 doesn't work on ovn2.13, could you help to provides db files for 2.13, thanks
Created attachment 1677528 [details] OVS conf.db
Created attachment 1677557 [details] NB DB. NB DB taken from BZ 1808125 and fixed.
Created attachment 1677558 [details] SB DB. SB DB taken from BZ 1808125 and fixed.
/usr/share/ovn/scripts/ovn-ctl restart_northd --db-nb-cluster-local-addr=127.0.0.1 --db-sb-cluster-local-addr=127.0.0.1 still failed with nbdb and sbdb attached: Exiting ovn-northd (19184) [ OK ] ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed () Waiting for OVN_Northbound to come up 2020-04-10T00:48:49Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2020-04-10T00:48:49Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected [ OK ] Upgrading database OVN_Northbound from schema version 5.18.0 to 5.20.0 2020-04-10T00:48:50Z|00001|ovsdb|WARN|/usr/share/ovn/ovn-nb.ovsschema: changed 2 columns in 'OVN_Northbound' database from ephemeral to persistent, including 'status' column in 'Connection' table, because clusters do not support ephemeral columns 2020-04-10T00:49:20Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock) /usr/share/openvswitch/scripts/ovs-lib: line 602: 19488 Alarm clock "$@" [FAILED] ovsdb-tool: ovsdb error: /etc/ovn/ovnsb_db.db: unexpected file format ovsdb-server: ovsdb error: /etc/ovn/ovnsb_db.db: unexpected file format
Created attachment 1677739 [details] standalone/simplifie NB DB
Created attachment 1677740 [details] standalone/simplifie SB DB
Hi Jianlin, I simplified the NB/SB dbs to include only the relevant configurations. Also, they're not clustered DBs anymore so the steps to replicate the issue are: 1. start OVS with the attached conf.db database. 2. start OVN northd and NB/SB with the attached databases. E.g.: /usr/share/ovn/scripts/ovn-ctl start_northd 3. Attach the following gdb script to ovsdb-server SB: $ cat gdb-commands b ovsdb_jsonrpc_parse_monitor_request if (strcmp(table->schema->name, "Database")) commands shell ovs-vsctl set open . external-ids:ovn-monitor-all=true print sleep(1) dis c end c $ ovs-vsctl set open . external-ids:ovn-monitor-all=false $ ovn-sbctl destroy chassis . $ ovn-sbctl destroy port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal $ spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2) $ gdb -p $spid -x gdb-commands This clears ovn-monitor-all from the OVS database and also the relevant chassis and port bindings in the SB DB. Then it attaches to the ovsdb-server for SB and sets a breakpoint for when it would receive the monitor_cond_since command from ovn-controller for the OVN_Southbound database. At this point, before replying to ovn-controller the breakpoint script sets ovn-monitor-all=true (simulating what ovn-k8s would do), sleeps for 1 sec making sure ovn-controller received the update from OVS, and continues. 4. Start ovn-controller: /usr/share/ovn/scripts/ovn-ctl start_controller 5. Check that the port binding wasn't claimed by ovn-controller: $ ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis : [] Thanks, Dumitru
reproduced with following steps: openvswitch-debuginfo and openvswitch-debugsource should be installed. and yum debuginfo-install glibc-2.28-101.el8.x86_64 libcap-ng-0.7.9-5.el8.x86_64 libevent-2.1.8-5.el8.x86_64 libibverbs-26.0-8.el8.x86_64 libmnl-1.0.4-6.el8.x86_64 libnl3-3.5.0-1.el8.x86_64 numactl-libs-2.0.12-9.el8.x86_64 openssl-libs-1.1.1c-15.el8.x86_64 python3-libs-3.6.8-23.el8.x86_64 unbound-libs-1.7.3-10.el8.x86_64 zlib-1.2.11-13.el8.x86_64 [root@hp-dl380pg8-12 ~]# rpm -qa | grep -E "openvswitch|ovn" openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch ovn2.13-host-2.13.0-7.el8fdp.x86_64 ovn2.13-2.13.0-7.el8fdp.x86_64 openvswitch2.13-2.13.0-13.el8fdp.x86_64 openvswitch2.13-debugsource-2.13.0-13.el8fdp.x86_64 ovn2.13-central-2.13.0-7.el8fdp.x86_64 openvswitch2.13-debuginfo-2.13.0-13.el8fdp.x86_64 1. systemctl start openvswitch cp conf.db /etc/openvswitch systemctl restart openvswitch 2. /usr/share/ovn/scripts/ovn-ctl start_northd cp nbdb.db /etc/ovn/ovnnb_db.db cp sbdb.db /etc/ovn/ovnsb_db.db /usr/share/ovn/scripts/ovn-ctl restart_northd 3. ovs-vsctl set open . external-ids:ovn-monitor-all=false ovn-sbctl destroy chassis . ovn-sbctl clear port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2) gdb -p $spid -x gdb-commands 4. /usr/share/ovn/scripts/ovn-ctl start_controller ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis : [] <=== no chassis 5. quit gdb 6. [root@hp-dl380pg8-12 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis : [] <=== still no chassis Verified on ovn2.13.0-11: === the gdb output 0x00007f5fad77bee8 in __GI___poll (fds=fds@entry=0x557d37c51c00, nfds=nfds@entry=5, timeout=timeout@entry=2500) at ../sysdeps/unix/sysv/linux/poll.c:29 29 return SYSCALL_CANCEL (poll, fds, nfds, timeout); Breakpoint 1 at 0x557d370a1870: file ../ovsdb/jsonrpc-server.c, line 1276. Breakpoint 1, ovsdb_jsonrpc_parse_monitor_request (dbmon=0x557d37caf6a0, table=table@entry=0x557d37c510f0, cond=0x557d37caf760, monitor_request=0x557d37c2dee0) at ../ovsdb/jsonrpc-server.c:1276 1276 { $1 = 0 Program received signal SIGPIPE, Broken pipe. 0x00007f5fae31df5e in __libc_send (fd=22, buf=0x557d37cab980, len=33, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/send.c:28 --Type <RET> for more, q to quit, c to continue without paging--q Quit (gdb) quit A debugging session is active. Inferior 1 [process 33544] will be detached. Quit anyway? (y or n) y Detaching from program: /usr/sbin/ovsdb-server, process 33544 [Inferior 1 (process 33544) detached] [root@hp-dl380pg8-12 ~]# /usr/share/ovn/scripts/ovn-ctl start_controller Starting ovn-controller [ OK ] [root@hp-dl380pg8-12 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis : ba234fd5-784f-4363-8c92-be28df81df6c <=== chassis is claimed [root@hp-dl380pg8-12 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis : ba234fd5-784f-4363-8c92-be28df81df6c [root@hp-dl380pg8-12 ~]# rpm -qa | grep -E "openvswitch|ovn" openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch ovn2.13-central-2.13.0-11.el8fdp.x86_64 openvswitch2.13-2.13.0-13.el8fdp.x86_64 openvswitch2.13-debugsource-2.13.0-13.el8fdp.x86_64 ovn2.13-host-2.13.0-11.el8fdp.x86_64 openvswitch2.13-debuginfo-2.13.0-13.el8fdp.x86_64 ovn2.13-2.13.0-11.el8fdp.x86_64
reproduced on ovn2.13.0-7 rhel7 version: [root@dell-per740-42 ~]# /usr/share/ovn/scripts/ovn-ctl start_controller Starting ovn-controller [ OK ] [root@dell-per740-42 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis : [] <=== not claimed Verified on 2.13.0-11: [root@dell-per740-42 ~]# /usr/share/ovn/scripts/ovn-ctl start_controller Starting ovn-controller [ OK ] [root@dell-per740-42 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis : b1ff50fd-4be8-410e-af3d-1169cf913be8 [root@dell-per740-42 ~]# rpm -qa | grep -E "openvswitch|ovn" ovn2.13-central-2.13.0-11.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch openvswitch2.13-devel-2.13.0-9.el7fdp.x86_64 ovn2.13-2.13.0-11.el7fdp.x86_64 ovn2.13-host-2.13.0-11.el7fdp.x86_64 openvswitch2.13-2.13.0-9.el7fdp.x86_64 openvswitch2.13-debuginfo-2.13.0-9.el7fdp.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1501