Bug 1814100

Summary: [scale] enable monitor-all to reduce load on southbound database
Product: OpenShift Container Platform Reporter: Dan Williams <dcbw>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: bbennett, rkhan, zzhao
Version: 4.5   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1814099 Environment:
Last Closed: 2020-08-04 18:05:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1814099    

Description Dan Williams 2020-03-17 03:00:10 UTC
+++ This bug was initially created as a clone of Bug #1814099 +++

+++ This bug was initially created as a clone of Bug #1814098 +++

Setting monitor-all=true in each node's ovsdb causes each ovn-controller to monitor all chassis events, which reduces load on the southbound database at the expense of a bit more CPU and network activity on each node. This increases the ability to scale.

See OVN bug https://bugzilla.redhat.com/1808125 for more details.

Comment 4 Dan Williams 2020-03-23 18:57:08 UTC
Anurag, you should see

1) oc rsh into one of the ovn-controller containers and run 'ovs-vsctl get Open_vSwitch .  external-ids | grep monitor-all' you should see ovn-monitor-all=true
2) all ovn-node pods should start, leading to all nodes being Ready

Comment 5 Anurag saxena 2020-04-02 18:53:44 UTC
@dcbw, this doesn't seem to present in latest nightly or CI. Can you reference the PR here?

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-04-02-101459   True        False         33m     Cluster version is 4.5.0-0.nightly-2020-04-02-101459

$ oc rsh -c ovn-controller ovnkube-node-fcs6r
sh-4.2# ovs-vsctl get Open_vSwitch .  external-ids
{hostname="ip-10-0-174-13.ap-northeast-1.compute.internal", ovn-bridge-mappings="physnet:br-local", ovn-encap-ip="10.0.174.13", ovn-encap-type=geneve, ovn-nb="ssl:10.0.133.150:9641,ssl:10.0.156.112:9641,ssl:10.0.169.235:9641", ovn-openflow-probe-interval="180", ovn-remote="ssl:10.0.133.150:9642,ssl:10.0.156.112:9642,ssl:10.0.169.235:9642", ovn-remote-probe-interval="100000", rundir="/var/run/openvswitch", system-id="72e75ee7-269c-43a5-b64f-02652e46bc9d"}
sh-4.2# ovs-vsctl get Open_vSwitch .  external-ids | grep monitor-all
sh-4.2# exit
exit
command terminated with exit code 1


# oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.ci-2020-04-02-142911   True        False         20m     Cluster version is 4.5.0-0.ci-2020-04-02-142911

# oc exec -c ovn-controller ovnkube-node-mm9st -n openshift-ovn-kubernetes -- ovs-vsctl get Open_vSwitch .  external-ids | grep monitor-all
#

Comment 6 Dan Williams 2020-04-02 21:08:40 UTC
Sorry this one got convoluted.

The original PR this bug was filed for was reverted. But we now have a *new* PR merged for release-4.5 that re-implements this in conjunction with OVN changes.

https://github.com/openshift/ovn-kubernetes/pull/126

So if you retest with tomorrow's image, the validation instructions should still be correct and you should see monitor-all in the ovs-vsctl output.

You just happened to test this bug during the revert window and before PR #126 landed. And we forgot to update the bug with that status. Sorry!

Comment 9 errata-xmlrpc 2020-08-04 18:05:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409