DescriptionDaniel Alvarez Sanchez
2020-02-03 15:01:57 UTC
OVN metadata agent is deployed in compute nodes so for large deployments, this means that SB ovsdb-server has to notify for each table/event to all those connections creating a lot of stress on the server. In particular, the current health check mechanism is hammering the server:
1. nb_cfg is bumped by neutron-server to check the state of the agents
2. Each agent writes this nb_cfg value back to its chassis row on the external_ids column. N write transactions (one per agent).
3. Each of the previous writes generate 2*N notifications (each compute node has at least one ovn-controller and one ovn metadata agent connected to SB db). 2*N*N
4. Per each of those bumps, neutron-server will record the timestamp on the same row (N writes).
5. So N writes from neutron-server, each of them generating again 2*N notifications. 2*N*N
To sum-up, every health check triggers: N writes + 2*N^2 notifications + N writes + 2*N^2 notifications. At scale (100+ compute nodes) we've seen that ovsdb-server is on 100% CPU, fails to respond to pacemaker healthchecks and causes disconnections and failovers which takes down the control plane.
By making OVN metadata agent subscribe just to their own chassis, we would be saving half of the notifications as ovn-controller is still going to be notified.
The BZ to address the remaining half is on core OVN: https://bugzilla.redhat.com/show_bug.cgi?id=1797520