2044680 – Additional libovsdb performance and resource consumption fixes

Bug 2044680 - Additional libovsdb performance and resource consumption fixes

Summary: Additional libovsdb performance and resource consumption fixes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Casey Callendrello
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-25 01:28 UTC by Dan Williams
Modified:	2022-03-10 16:42 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:42:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 927	0	None	Merged	Bug 2044680: libovsdb performance and resource consumption fixes	2022-07-11 21:10:22 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:42:34 UTC

Description Dan Williams 2022-01-25 01:28:11 UTC

Backports of:

https://github.com/ovn-org/libovsdb/pull/286
https://github.com/ovn-org/libovsdb/pull/285
https://github.com/ovn-org/libovsdb/pull/280

Comment 5 Dan Williams 2022-01-28 21:14:02 UTC

To verify the fix for https://github.com/ovn-org/libovsdb/pull/280

1) watch for ovnkube-master logs the "endpoint lost leader, reconnecting" message that indicates a RAFT leader change
2) you want to ensure that ovnkube-master does *not* attempt to reconnect to the same IP address but tries at least one other database's IP first

I0127 18:05:51.640192       1 client.go:1032]  "msg"="endpoint lost leader, reconnecting" "database"="OVN_Southbound" "endpoint"="ssl:10.0.0.3:9642" "sid"="6c31b525-caaf-4e35-96de-342398010952"
^^^ leader changed, ovnkube-master should *not* reconnect immediately to 10.0.0.3

I0127 18:05:51.641633       1 client.go:1108]  "msg"="connection lost, reconnecting" "database"="OVN_Southbound" "endpoint"="ssl:10.0.0.4:9642"
^^^ expected result; it will try connecting to a different DB than the one that just lost leadership

E0127 18:05:52.164019       1 client.go:1099]  "msg"="failed to reconnect" "error"="unable to connect to any endpoints: failed to connect to ssl:10.0.0.4:9642: endpoint is not leader. failed to connect to ssl:10.0.0.5:9642: endpoint is not leader. failed to connect to ssl:10.0.0.3:9642: endpoint is not leader" "database"="OVN_Southbound" 

I0127 18:05:52.234130       1 client.go:254]  "msg"="successfully connected" "database"="OVN_Southbound" "endpoint"="ssl:10.0.0.4:9642" "sid"="b3956ec0-e914-4b6e-8fee-30bb49ff59b4"
^^^ eventually reconnects to the new leader; should not be the one that just lost leadership (10.0.0.3)

I0127 18:05:52.234284       1 client.go:276]  "msg"="reconnected - restarting monitors" "database"="OVN_Southbound" 


For https://github.com/ovn-org/libovsdb/pull/285 we should verify that memory usage of ovnkube-master is about the same or lower than before the change over the same benchmark.

Comment 7 Kedar Kulkarni 2022-02-01 20:33:19 UTC

Comment 8 Kedar Kulkarni 2022-02-01 20:35:24 UTC

@Anurag, I and Ross worked on running perf tests on two different builds:

4.10.0-0.nightly-2022-01-31-012936 and 
4.10.0-0.nightly-2022-01-29-215708

Based on the memory usage of ovnkube-master in each of the cluster, I can confirm that this patch does not seem to have any negative perf impact.

Comment 12 errata-xmlrpc 2022-03-10 16:42:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.