Bug 1808125
| Summary: | [OVN SCALE][OVN 2.12] ovn-controller with monitor-all misses port bindings sometimes | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Dan Williams <dcbw> | ||||||||
| Component: | ovn2.12 | Assignee: | Dumitru Ceara <dceara> | ||||||||
| Status: | CLOSED NEXTRELEASE | QA Contact: | Jianlin Shi <jishi> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | high | ||||||||||
| Version: | RHEL 8.0 | CC: | ctrautma, dceara, jishi, mmichels, nusiddiq, ralongi | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 1818754 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2020-04-15 14:08:57 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1818754 | ||||||||||
| Attachments: |
|
||||||||||
ovn2.12.x86_64 0:2.12.0-32.el7fdn openvswitch2.12.x86_64 0:2.12.0-21.el7fdn Created attachment 1666372 [details]
Full sbdb logs (leader + followers)
Created attachment 1666376 [details]
working vs not-working ovn-controller logs (different nodes in the same run)
From the sbdb_leader.log we see that the ovn-controller where the issue is seen ("server13" in the logs) sends an incorrect "monitor_cond_since" request:
2020-02-27T21:03:38Z|02948|jsonrpc|DBG|ssl:10.0.0.3:38012: received request, method="monitor_cond_since", params=[
"OVN_Southbound",
["monid","OVN_Southbound"],
[...]
"Port_Binding":
[
{"where":
[
["type","==","patch"],
["type","==","chassisredirect"],
["type","==","external"]
],
"columns":
[
"chassis",
"datapath",
"encap",
"gateway_chassis",
"ha_chassis_group",
"logical_port",
"mac",
"nat_addresses",
"options",
"parent_port",
"tag",
"tunnel_key",
"type",
"virtual_parent"
]
}
],
[...]
After this point the ovn-controller facing the issue ("server13") doesn't update the monitor condition, i.e., it doesn't send any other monitor_cond_since requests to OVSDB SB server.
From the sbdb_follower2.log where the ovn-controller that doesn't hit the issue is connected ("server9" in the logs), we see the correct "monitor_cond_since" request, specifically no "where" clause for Port_Binding:
2020-02-27T21:03:38Z|01610|jsonrpc|DBG|ssl:10.0.0.4:42758: received request, method="monitor_cond_since", params=[
"OVN_Southbound",
["monid","OVN_Southbound"],
[...]
"Port_Binding":
[
{"columns":
[
"chassis",
"datapath",
"encap",
"gateway_chassis",
"ha_chassis_group",
"logical_port",
"mac",
"nat_addresses",
"options",
"parent_port",
"tag",
"tunnel_key",
"type",
"virtual_parent"
]
}
],
This indicates that ovn-controller sometimes fails to properly disable conditional monitoring when ovn-monitor-all=true.
Steps to replicate the issue using the attached OVS, OVN SB, OVN NB database files: 1. start OVS with the ovsdb.db database. 2. start OVN northd and NB/SB databases. E.g.: /usr/share/ovn/scripts/ovn-ctl start_northd --db-nb-cluster-local-addr=127.0.0.1 --db-sb-cluster-local-addr=127.0.0.1 3. Attach the following gdb script to ovsdb-server SB: $ cat gdb-commands b ovsdb_jsonrpc_parse_monitor_request if (strcmp(table->schema->name, "Database")) commands shell ovs-vsctl set open . external-ids:ovn-monitor-all=true print sleep(1) dis c end c $ ovs-vsctl set open . external-ids:ovn-monitor-all=false $ ovn-sbctl destroy chassis . $ ovn-sbctl destroy port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal $ spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2) $ gdb -p $spid -x gdb-commands This clears ovn-monitor-all from the OVS database and also the relevant chassis and port bindings in the SB DB. Then it attaches to the ovsdb-server for SB and sets a breakpoint for when it would receive the monitor_cond_since command from ovn-controller for the OVN_Southbound database. At this point, before replying to ovn-controller the breakpoint script sets ovn-monitor-all=true (simulating what ovn-k8s would do), sleeps for 1 sec making sure ovn-controller received the update from OVS, and continues. 4. Start ovn-controller: /usr/share/ovn/scripts/ovn-ctl start_controller 5. Check that the port binding wasn't claimed by ovn-controller: $ ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis : [] Root cause: ovn-controller shouldn't call update_sb_monitors if the SB IDL is in state IDL_S_DATA_MONITOR_COND_SINCE_REQUESTED because when ovsdb-server replies the IDL code will: - call ovsdb_idl_db_parse_monitor_reply() [1] - call ovsdb_idl_db_clear() [2] - which sets table->cond_changed = false; for every table monitored by the IDL. [3] [1] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L767 [2] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L2153 [3] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L613 Fix sent upstream for review: https://patchwork.ozlabs.org/patch/1246910/ Fix merged upstream: https://github.com/openvswitch/ovs/commit/2b7e536fa5e20be10e620b959e05557f88862d2c |
Created attachment 1666294 [details] the logs It seems that ovn-controller sometimes misses port binding notifications. We're interested in why ovn-controller doesn't set up any port bindings for k8s-ci-op-7pxtd-m-2.c.openshift-gce-devel-ci.internal even though that interface has been added to OVS itself with the right iface-id.