2022824 – Large number of sessions created by vmware-vsphere-csi-driver-operator during e2e tests

Bug 2022824 - Large number of sessions created by vmware-vsphere-csi-driver-operator during e2e tests

Summary: Large number of sessions created by vmware-vsphere-csi-driver-operator during...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Derek Pryor
QA Contact:	Wei Duan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-12 17:16 UTC by rvanderp
Modified:	2022-03-10 16:27 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:26:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 1104	0	None	open	Bug 2022824: Fix the leak of vSphere client sessions	2022-01-10 20:10:24 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:27:10 UTC

Description rvanderp 2021-11-12 17:16:22 UTC

Description of problem:
During e2e testing, a trend has been observed during 4.10 vSphere e2e periodic runs where sessions rapidly grow.  Shortly after the growth commences, the CI user exhausts all available sessions(500/user).  This creates a cascade where other CI jobs begin to fail.

After some investigation, session growth appears to be related to calls to checkForExistingPolicy[https://github.com/openshift/vmware-vsphere-csi-driver-operator/blob/master/pkg/operator/storageclasscontroller/vmware.go#L284]

A new client connection is created here[https://github.com/openshift/vmware-vsphere-csi-driver-operator/blob/master/pkg/operator/storageclasscontroller/vmware.go#L291] but does not appear to be getting closed.

Also seen in:
https://github.com/openshift/vmware-vsphere-csi-driver-operator/blob/master/pkg/operator/storageclasscontroller/vmware.go#L236

A similar issue https://bugzilla.redhat.com/show_bug.cgi?id=2009859 was recently addressed at https://github.com/openshift/vmware-vsphere-csi-driver-operator/blob/bb7eaac1f3afc9d094aa10a19bcdc48ad84d18ca/pkg/operator/storageclasscontroller/storageclasscontroller.go#L168

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-11-09-181140

How reproducible:
Consistently

Steps to Reproduce:
1. Perform a vSphere IPI install on 4.10.0-0.nightly-2021-11-09-181140
2. Run e2e tests with the suite openshift/conformance/parallel
3. Monitor session growth with govc or the vCenter console

Note: Typically active sessions will reveal the issue.  On occassions, I have seen vCenter classify these sessions as idle and govc won't report them, but they are still counted against the user/cluster session limits.

Actual results:
Active/idle sessions grow and are only closed when the session times out after 30 minutes

Expected results:
Active/idle sessions should not grow during use

Master Log:

Node Log (of failed PODs):

PV Dump:
N/A

PVC Dump:
N/A

StorageClass Dump (if StorageClass used by PV/PVC):
N/A

Additional info:

Session growth appears to coincide with these log entries.  As is seen, the operator is encountering an exhaustion of sessions(confirmed via reviewing the vCenter logs as well).

I1112 16:14:06.265138       1 vmware.go:314] Found existing profile with same name: openshift-storage-policy-rvanderp-dev-kxn7t
I1112 16:14:09.248399       1 vmware.go:314] Found existing profile with same name: openshift-storage-policy-rvanderp-dev-kxn7t
I1112 16:14:10.503282       1 vmware.go:314] Found existing profile with same name: openshift-storage-policy-rvanderp-dev-kxn7t
E1112 16:14:11.859266       1 vmware.go:82] error logging into vcenter: POST https://ibmvcenter.vmc-ci.devcluster.openshift.com/rest/com/vmware/cis/session: 503 Service Unavailable
E1112 16:14:11.859292       1 storageclasscontroller.go:121] error syncing storage policy: error connecting to vcenter API: error logging into vcenter: POST https://ibmvcenter.vmc-ci.devcluster.openshift.com/rest/com/vmware/cis/session: 503 Service Unavailable
E1112 16:14:11.921186       1 base_controller.go:272] VMwareVSphereDriverStorageClassController reconciliation failed: error connecting to vcenter API: error logging into vcenter: POST https://ibmvcenter.vmc-ci.devcluster.openshift.com/rest/com/vmware/cis/session: 503 Service Unavailable

Comment 1 Hemant Kumar 2021-12-06 21:16:07 UTC

We found root cause of the session leakage to be from intree e2e tests that create vcenter connection on the fly and hence the tests themselves needs fixing - for now we are going to disable the intree vsphere tests https://github.com/openshift/origin/pull/26670

Comment 3 David Eads 2021-12-07 15:04:28 UTC

This is a 4.10.0 blocker because in order to get signal on vsphere in general, we had to disable nearly 100 tests for vsphere storage https://github.com/openshift/origin/pull/26670 .

Comment 4 Derek Pryor 2022-01-05 21:37:31 UTC

A fix to the e2e tests has been created and is in the process of being upstreamed (https://github.com/kubernetes/kubernetes/pull/107337)

Comment 7 Wei Duan 2022-01-25 10:28:40 UTC

Verified with 4.10.0-0.nightly-2022-01-24-070025 on vsphere 6.7 test env which could check the session 


openshift-tests run "openshift/conformance/parallel" --provider "${TEST_PROVIDER}" --dry-run | grep storage | openshift-tests run "openshift/conformance/parallel" --provider "${TEST_PROVIDER}" -f - 
...
error: 2 fail, 350 pass, 1438 skip (56m2s)


There is no session growth with govc session.ls.


Update status as "Verified"

Comment 10 errata-xmlrpc 2022-03-10 16:26:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.