Created attachment 1666294 [details] the logs It seems that ovn-controller sometimes misses port binding notifications. We're interested in why ovn-controller doesn't set up any port bindings for k8s-ci-op-7pxtd-m-2.c.openshift-gce-devel-ci.internal even though that interface has been added to OVS itself with the right iface-id.
ovn2.12.x86_64 0:2.12.0-32.el7fdn
openvswitch2.12.x86_64 0:2.12.0-21.el7fdn
Created attachment 1666372 [details] Full sbdb logs (leader + followers)
Created attachment 1666376 [details] working vs not-working ovn-controller logs (different nodes in the same run)
From the sbdb_leader.log we see that the ovn-controller where the issue is seen ("server13" in the logs) sends an incorrect "monitor_cond_since" request: 2020-02-27T21:03:38Z|02948|jsonrpc|DBG|ssl:10.0.0.3:38012: received request, method="monitor_cond_since", params=[ "OVN_Southbound", ["monid","OVN_Southbound"], [...] "Port_Binding": [ {"where": [ ["type","==","patch"], ["type","==","chassisredirect"], ["type","==","external"] ], "columns": [ "chassis", "datapath", "encap", "gateway_chassis", "ha_chassis_group", "logical_port", "mac", "nat_addresses", "options", "parent_port", "tag", "tunnel_key", "type", "virtual_parent" ] } ], [...] After this point the ovn-controller facing the issue ("server13") doesn't update the monitor condition, i.e., it doesn't send any other monitor_cond_since requests to OVSDB SB server. From the sbdb_follower2.log where the ovn-controller that doesn't hit the issue is connected ("server9" in the logs), we see the correct "monitor_cond_since" request, specifically no "where" clause for Port_Binding: 2020-02-27T21:03:38Z|01610|jsonrpc|DBG|ssl:10.0.0.4:42758: received request, method="monitor_cond_since", params=[ "OVN_Southbound", ["monid","OVN_Southbound"], [...] "Port_Binding": [ {"columns": [ "chassis", "datapath", "encap", "gateway_chassis", "ha_chassis_group", "logical_port", "mac", "nat_addresses", "options", "parent_port", "tag", "tunnel_key", "type", "virtual_parent" ] } ], This indicates that ovn-controller sometimes fails to properly disable conditional monitoring when ovn-monitor-all=true.
Steps to replicate the issue using the attached OVS, OVN SB, OVN NB database files: 1. start OVS with the ovsdb.db database. 2. start OVN northd and NB/SB databases. E.g.: /usr/share/ovn/scripts/ovn-ctl start_northd --db-nb-cluster-local-addr=127.0.0.1 --db-sb-cluster-local-addr=127.0.0.1 3. Attach the following gdb script to ovsdb-server SB: $ cat gdb-commands b ovsdb_jsonrpc_parse_monitor_request if (strcmp(table->schema->name, "Database")) commands shell ovs-vsctl set open . external-ids:ovn-monitor-all=true print sleep(1) dis c end c $ ovs-vsctl set open . external-ids:ovn-monitor-all=false $ ovn-sbctl destroy chassis . $ ovn-sbctl destroy port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal $ spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2) $ gdb -p $spid -x gdb-commands This clears ovn-monitor-all from the OVS database and also the relevant chassis and port bindings in the SB DB. Then it attaches to the ovsdb-server for SB and sets a breakpoint for when it would receive the monitor_cond_since command from ovn-controller for the OVN_Southbound database. At this point, before replying to ovn-controller the breakpoint script sets ovn-monitor-all=true (simulating what ovn-k8s would do), sleeps for 1 sec making sure ovn-controller received the update from OVS, and continues. 4. Start ovn-controller: /usr/share/ovn/scripts/ovn-ctl start_controller 5. Check that the port binding wasn't claimed by ovn-controller: $ ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis : [] Root cause: ovn-controller shouldn't call update_sb_monitors if the SB IDL is in state IDL_S_DATA_MONITOR_COND_SINCE_REQUESTED because when ovsdb-server replies the IDL code will: - call ovsdb_idl_db_parse_monitor_reply() [1] - call ovsdb_idl_db_clear() [2] - which sets table->cond_changed = false; for every table monitored by the IDL. [3] [1] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L767 [2] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L2153 [3] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L613
Fix sent upstream for review: https://patchwork.ozlabs.org/patch/1246910/
Fix merged upstream: https://github.com/openvswitch/ovs/commit/2b7e536fa5e20be10e620b959e05557f88862d2c