Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1808125

Summary: [OVN SCALE][OVN 2.12] ovn-controller with monitor-all misses port bindings sometimes
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Dan Williams <dcbw>
Component: ovn2.12Assignee: Dumitru Ceara <dceara>
Status: CLOSED NEXTRELEASE QA Contact: Jianlin Shi <jishi>
Severity: high Docs Contact:
Priority: high    
Version: RHEL 8.0CC: ctrautma, dceara, jishi, mmichels, nusiddiq, ralongi
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1818754 (view as bug list) Environment:
Last Closed: 2020-04-15 14:08:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1818754    
Attachments:
Description Flags
the logs
none
Full sbdb logs (leader + followers)
none
working vs not-working ovn-controller logs (different nodes in the same run) none

Description Dan Williams 2020-02-27 21:16:44 UTC
Created attachment 1666294 [details]
the logs

It seems that ovn-controller sometimes misses port binding notifications.

We're interested in why ovn-controller doesn't set up any port bindings for k8s-ci-op-7pxtd-m-2.c.openshift-gce-devel-ci.internal even though that interface has been added to OVS itself with the right iface-id.

Comment 1 Dan Williams 2020-02-27 21:18:14 UTC
ovn2.12.x86_64 0:2.12.0-32.el7fdn

Comment 2 Dan Williams 2020-02-27 21:18:39 UTC
openvswitch2.12.x86_64 0:2.12.0-21.el7fdn

Comment 3 Dumitru Ceara 2020-02-28 10:13:23 UTC
Created attachment 1666372 [details]
Full sbdb logs (leader + followers)

Comment 4 Dumitru Ceara 2020-02-28 10:52:16 UTC
Created attachment 1666376 [details]
working vs not-working ovn-controller logs (different nodes in the same run)

Comment 5 Dumitru Ceara 2020-02-28 12:11:30 UTC
From the sbdb_leader.log we see that the ovn-controller where the issue is seen ("server13" in the logs) sends an incorrect "monitor_cond_since" request:

2020-02-27T21:03:38Z|02948|jsonrpc|DBG|ssl:10.0.0.3:38012: received request, method="monitor_cond_since", params=[
    "OVN_Southbound",
    ["monid","OVN_Southbound"],
    [...]
    "Port_Binding":
        [
            {"where":
                [
                    ["type","==","patch"],
                    ["type","==","chassisredirect"],
                    ["type","==","external"]
                ],
            "columns":
                [
                    "chassis",
                    "datapath",
                    "encap",
                    "gateway_chassis",
                    "ha_chassis_group",
                    "logical_port",
                    "mac",
                    "nat_addresses",
                    "options",
                    "parent_port",
                    "tag",
                    "tunnel_key",
                    "type",
                    "virtual_parent"
                ]
            }
        ],
        [...]
After this point the ovn-controller facing the issue ("server13") doesn't update the monitor condition, i.e., it doesn't send any other monitor_cond_since requests to OVSDB SB server.

From the sbdb_follower2.log where the ovn-controller that doesn't hit the issue is connected ("server9" in the logs), we see the correct "monitor_cond_since" request, specifically no "where" clause for Port_Binding:

2020-02-27T21:03:38Z|01610|jsonrpc|DBG|ssl:10.0.0.4:42758: received request, method="monitor_cond_since", params=[
    "OVN_Southbound",
    ["monid","OVN_Southbound"],
    [...]
    "Port_Binding":
        [
            {"columns":
                [
                    "chassis",
                    "datapath",
                    "encap",
                    "gateway_chassis",
                    "ha_chassis_group",
                    "logical_port",
                    "mac",
                    "nat_addresses",
                    "options",
                    "parent_port",
                    "tag",
                    "tunnel_key",
                    "type",
                    "virtual_parent"
                ]
            }
        ],

This indicates that ovn-controller sometimes fails to properly disable conditional monitoring when ovn-monitor-all=true.

Comment 6 Dumitru Ceara 2020-02-28 16:01:05 UTC
Steps to replicate the issue using the attached OVS, OVN SB, OVN NB database files:

1. start OVS with the ovsdb.db database.
2. start OVN northd and NB/SB databases. E.g.:
/usr/share/ovn/scripts/ovn-ctl start_northd --db-nb-cluster-local-addr=127.0.0.1 --db-sb-cluster-local-addr=127.0.0.1
3. Attach the following gdb script to ovsdb-server SB:

$ cat gdb-commands
b ovsdb_jsonrpc_parse_monitor_request if (strcmp(table->schema->name, "Database"))
commands
shell ovs-vsctl set open . external-ids:ovn-monitor-all=true
print sleep(1)
dis
c
end
c

$ ovs-vsctl set open . external-ids:ovn-monitor-all=false
$ ovn-sbctl destroy chassis .
$ ovn-sbctl destroy port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
$ spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2)
$ gdb -p $spid -x gdb-commands

This clears ovn-monitor-all from the OVS database and also the relevant chassis and port bindings in the SB DB. Then it attaches to the ovsdb-server for SB and sets a breakpoint for when it would receive the monitor_cond_since command from ovn-controller for the OVN_Southbound database. At this point, before replying to ovn-controller the breakpoint script sets ovn-monitor-all=true (simulating what ovn-k8s would do), sleeps for 1 sec making sure ovn-controller received the update from OVS, and continues.

4. Start ovn-controller:
/usr/share/ovn/scripts/ovn-ctl start_controller

5. Check that the port binding wasn't claimed by ovn-controller:
$ ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : []

Root cause:
ovn-controller shouldn't call update_sb_monitors if the SB IDL is in state IDL_S_DATA_MONITOR_COND_SINCE_REQUESTED because when ovsdb-server replies the IDL code will:
- call ovsdb_idl_db_parse_monitor_reply() [1]
- call ovsdb_idl_db_clear() [2]
- which sets table->cond_changed = false; for every table monitored by the IDL. [3]

[1] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L767
[2] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L2153
[3] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L613

Comment 7 Dumitru Ceara 2020-02-28 23:02:04 UTC
Fix sent upstream for review: https://patchwork.ozlabs.org/patch/1246910/

Comment 8 Dumitru Ceara 2020-03-30 09:38:02 UTC
Fix merged upstream: https://github.com/openvswitch/ovs/commit/2b7e536fa5e20be10e620b959e05557f88862d2c