Bug 2044680
| Summary: | Additional libovsdb performance and resource consumption fixes | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Dan Williams <dcbw> |
| Component: | Networking | Assignee: | Casey Callendrello <cdc> |
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | urgent | CC: | cdc |
| Version: | 4.10 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-03-10 16:42:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Dan Williams
2022-01-25 01:28:11 UTC
To verify the fix for https://github.com/ovn-org/libovsdb/pull/280 1) watch for ovnkube-master logs the "endpoint lost leader, reconnecting" message that indicates a RAFT leader change 2) you want to ensure that ovnkube-master does *not* attempt to reconnect to the same IP address but tries at least one other database's IP first I0127 18:05:51.640192 1 client.go:1032] "msg"="endpoint lost leader, reconnecting" "database"="OVN_Southbound" "endpoint"="ssl:10.0.0.3:9642" "sid"="6c31b525-caaf-4e35-96de-342398010952" ^^^ leader changed, ovnkube-master should *not* reconnect immediately to 10.0.0.3 I0127 18:05:51.641633 1 client.go:1108] "msg"="connection lost, reconnecting" "database"="OVN_Southbound" "endpoint"="ssl:10.0.0.4:9642" ^^^ expected result; it will try connecting to a different DB than the one that just lost leadership E0127 18:05:52.164019 1 client.go:1099] "msg"="failed to reconnect" "error"="unable to connect to any endpoints: failed to connect to ssl:10.0.0.4:9642: endpoint is not leader. failed to connect to ssl:10.0.0.5:9642: endpoint is not leader. failed to connect to ssl:10.0.0.3:9642: endpoint is not leader" "database"="OVN_Southbound" I0127 18:05:52.234130 1 client.go:254] "msg"="successfully connected" "database"="OVN_Southbound" "endpoint"="ssl:10.0.0.4:9642" "sid"="b3956ec0-e914-4b6e-8fee-30bb49ff59b4" ^^^ eventually reconnects to the new leader; should not be the one that just lost leadership (10.0.0.3) I0127 18:05:52.234284 1 client.go:276] "msg"="reconnected - restarting monitors" "database"="OVN_Southbound" For https://github.com/ovn-org/libovsdb/pull/285 we should verify that memory usage of ovnkube-master is about the same or lower than before the change over the same benchmark. @ @Anurag, I and Ross worked on running perf tests on two different builds: 4.10.0-0.nightly-2022-01-31-012936 and 4.10.0-0.nightly-2022-01-29-215708 Based on the memory usage of ovnkube-master in each of the cluster, I can confirm that this patch does not seem to have any negative perf impact. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |