Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1797654

Summary: [OVN] OVN Metadata agent should not monitor the whole Chassis table on the SB database, just its own row
Product: Red Hat OpenStack Reporter: Daniel Alvarez Sanchez <dalvarez>
Component: python-networking-ovnAssignee: Terry Wilson <twilson>
Status: CLOSED CURRENTRELEASE QA Contact: Eran Kuris <ekuris>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 16.0 (Train)CC: apevec, ekuris, jlibosva, lhh, majopela, njohnston, scohen, twilson
Target Milestone: z3Keywords: TestOnly, Triaged
Target Release: 16.0 (Train on RHEL 8.1)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-networking-ovn-7.1.1-0.20200331050215.b96fa44.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-16 10:46:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1797685    

Description Daniel Alvarez Sanchez 2020-02-03 15:01:57 UTC
OVN metadata agent is deployed in compute nodes so for large deployments, this means that SB ovsdb-server has to notify for each table/event to all those connections creating a lot of stress on the server. In particular, the current health check mechanism is hammering the server:

1. nb_cfg is bumped by neutron-server to check the state of the agents
2. Each agent writes this nb_cfg value back to its chassis row on the external_ids column. N write transactions (one per agent).
3. Each of the previous writes generate 2*N notifications (each compute node has at least one ovn-controller and one ovn metadata agent connected to SB db).   2*N*N
4. Per each of those bumps, neutron-server will record the timestamp on the same row (N writes).
5. So N writes from neutron-server, each of them generating again 2*N notifications. 2*N*N

To sum-up, every health check triggers: N writes + 2*N^2 notifications + N writes + 2*N^2 notifications. At scale (100+ compute nodes) we've seen that ovsdb-server is on 100% CPU, fails to respond to pacemaker healthchecks and causes disconnections and failovers which takes down the control plane.


By making OVN metadata agent subscribe just to their own chassis, we would be saving half of the notifications as ovn-controller is still going to be notified.

The BZ to address the remaining half is on core OVN: https://bugzilla.redhat.com/show_bug.cgi?id=1797520

Comment 6 Lon Hohberger 2020-06-18 10:46:50 UTC
According to our records, this should be resolved by python-networking-ovn-7.1.1-0.20200403214619.4114bc5.el8ost.  This build is available now.

Comment 7 Eran Kuris 2020-07-16 06:47:00 UTC
The bug can be veridied by this automation test: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16-dsvm-functional-rhos/lastCompletedBuild/testReport/networking_ovn.tests.functional.test_metadata_agent/TestMetadataAgent/test_metadata_agent_only_monitors_own_chassis/

fix verified
()[neutron@controller-0 /]$ rpm -qa | grep ovn
puppet-ovn-15.4.1-0.20200229002436.192ac4e.el8ost.noarch
python3-networking-ovn-7.1.1-0.20200403214619.4114bc5.el8ost.noarch
[stack@undercloud-0 ~]$ cat core_puddle_version 
RHOS_TRUNK-16.0-RHEL-8-20200706.n.0