The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1808125 - [OVN SCALE][OVN 2.12] ovn-controller with monitor-all misses port bindings sometimes
Summary: [OVN SCALE][OVN 2.12] ovn-controller with monitor-all misses port bindings so...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn2.12
Version: RHEL 8.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Dumitru Ceara
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks: 1818754
TreeView+ depends on / blocked
 
Reported: 2020-02-27 21:16 UTC by Dan Williams
Modified: 2020-04-15 14:08 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1818754 (view as bug list)
Environment:
Last Closed: 2020-04-15 14:08:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
the logs (500.95 KB, application/x-bzip)
2020-02-27 21:16 UTC, Dan Williams
no flags Details
Full sbdb logs (leader + followers) (5.71 MB, application/gzip)
2020-02-28 10:13 UTC, Dumitru Ceara
no flags Details
working vs not-working ovn-controller logs (different nodes in the same run) (5.16 MB, application/gzip)
2020-02-28 10:52 UTC, Dumitru Ceara
no flags Details

Description Dan Williams 2020-02-27 21:16:44 UTC
Created attachment 1666294 [details]
the logs

It seems that ovn-controller sometimes misses port binding notifications.

We're interested in why ovn-controller doesn't set up any port bindings for k8s-ci-op-7pxtd-m-2.c.openshift-gce-devel-ci.internal even though that interface has been added to OVS itself with the right iface-id.

Comment 1 Dan Williams 2020-02-27 21:18:14 UTC
ovn2.12.x86_64 0:2.12.0-32.el7fdn

Comment 2 Dan Williams 2020-02-27 21:18:39 UTC
openvswitch2.12.x86_64 0:2.12.0-21.el7fdn

Comment 3 Dumitru Ceara 2020-02-28 10:13:23 UTC
Created attachment 1666372 [details]
Full sbdb logs (leader + followers)

Comment 4 Dumitru Ceara 2020-02-28 10:52:16 UTC
Created attachment 1666376 [details]
working vs not-working ovn-controller logs (different nodes in the same run)

Comment 5 Dumitru Ceara 2020-02-28 12:11:30 UTC
From the sbdb_leader.log we see that the ovn-controller where the issue is seen ("server13" in the logs) sends an incorrect "monitor_cond_since" request:

2020-02-27T21:03:38Z|02948|jsonrpc|DBG|ssl:10.0.0.3:38012: received request, method="monitor_cond_since", params=[
    "OVN_Southbound",
    ["monid","OVN_Southbound"],
    [...]
    "Port_Binding":
        [
            {"where":
                [
                    ["type","==","patch"],
                    ["type","==","chassisredirect"],
                    ["type","==","external"]
                ],
            "columns":
                [
                    "chassis",
                    "datapath",
                    "encap",
                    "gateway_chassis",
                    "ha_chassis_group",
                    "logical_port",
                    "mac",
                    "nat_addresses",
                    "options",
                    "parent_port",
                    "tag",
                    "tunnel_key",
                    "type",
                    "virtual_parent"
                ]
            }
        ],
        [...]
After this point the ovn-controller facing the issue ("server13") doesn't update the monitor condition, i.e., it doesn't send any other monitor_cond_since requests to OVSDB SB server.

From the sbdb_follower2.log where the ovn-controller that doesn't hit the issue is connected ("server9" in the logs), we see the correct "monitor_cond_since" request, specifically no "where" clause for Port_Binding:

2020-02-27T21:03:38Z|01610|jsonrpc|DBG|ssl:10.0.0.4:42758: received request, method="monitor_cond_since", params=[
    "OVN_Southbound",
    ["monid","OVN_Southbound"],
    [...]
    "Port_Binding":
        [
            {"columns":
                [
                    "chassis",
                    "datapath",
                    "encap",
                    "gateway_chassis",
                    "ha_chassis_group",
                    "logical_port",
                    "mac",
                    "nat_addresses",
                    "options",
                    "parent_port",
                    "tag",
                    "tunnel_key",
                    "type",
                    "virtual_parent"
                ]
            }
        ],

This indicates that ovn-controller sometimes fails to properly disable conditional monitoring when ovn-monitor-all=true.

Comment 6 Dumitru Ceara 2020-02-28 16:01:05 UTC
Steps to replicate the issue using the attached OVS, OVN SB, OVN NB database files:

1. start OVS with the ovsdb.db database.
2. start OVN northd and NB/SB databases. E.g.:
/usr/share/ovn/scripts/ovn-ctl start_northd --db-nb-cluster-local-addr=127.0.0.1 --db-sb-cluster-local-addr=127.0.0.1
3. Attach the following gdb script to ovsdb-server SB:

$ cat gdb-commands
b ovsdb_jsonrpc_parse_monitor_request if (strcmp(table->schema->name, "Database"))
commands
shell ovs-vsctl set open . external-ids:ovn-monitor-all=true
print sleep(1)
dis
c
end
c

$ ovs-vsctl set open . external-ids:ovn-monitor-all=false
$ ovn-sbctl destroy chassis .
$ ovn-sbctl destroy port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
$ spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2)
$ gdb -p $spid -x gdb-commands

This clears ovn-monitor-all from the OVS database and also the relevant chassis and port bindings in the SB DB. Then it attaches to the ovsdb-server for SB and sets a breakpoint for when it would receive the monitor_cond_since command from ovn-controller for the OVN_Southbound database. At this point, before replying to ovn-controller the breakpoint script sets ovn-monitor-all=true (simulating what ovn-k8s would do), sleeps for 1 sec making sure ovn-controller received the update from OVS, and continues.

4. Start ovn-controller:
/usr/share/ovn/scripts/ovn-ctl start_controller

5. Check that the port binding wasn't claimed by ovn-controller:
$ ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : []

Root cause:
ovn-controller shouldn't call update_sb_monitors if the SB IDL is in state IDL_S_DATA_MONITOR_COND_SINCE_REQUESTED because when ovsdb-server replies the IDL code will:
- call ovsdb_idl_db_parse_monitor_reply() [1]
- call ovsdb_idl_db_clear() [2]
- which sets table->cond_changed = false; for every table monitored by the IDL. [3]

[1] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L767
[2] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L2153
[3] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L613

Comment 7 Dumitru Ceara 2020-02-28 23:02:04 UTC
Fix sent upstream for review: https://patchwork.ozlabs.org/patch/1246910/

Comment 8 Dumitru Ceara 2020-03-30 09:38:02 UTC
Fix merged upstream: https://github.com/openvswitch/ovs/commit/2b7e536fa5e20be10e620b959e05557f88862d2c


Note You need to log in before you can comment on or make changes to this bug.