Description of problem: During e2e testing, a trend has been observed during 4.10 vSphere e2e periodic runs where sessions rapidly grow. Shortly after the growth commences, the CI user exhausts all available sessions(500/user). This creates a cascade where other CI jobs begin to fail. After some investigation, session growth appears to be related to calls to checkForExistingPolicy[https://github.com/openshift/vmware-vsphere-csi-driver-operator/blob/master/pkg/operator/storageclasscontroller/vmware.go#L284] A new client connection is created here[https://github.com/openshift/vmware-vsphere-csi-driver-operator/blob/master/pkg/operator/storageclasscontroller/vmware.go#L291] but does not appear to be getting closed. Also seen in: https://github.com/openshift/vmware-vsphere-csi-driver-operator/blob/master/pkg/operator/storageclasscontroller/vmware.go#L236 A similar issue https://bugzilla.redhat.com/show_bug.cgi?id=2009859 was recently addressed at https://github.com/openshift/vmware-vsphere-csi-driver-operator/blob/bb7eaac1f3afc9d094aa10a19bcdc48ad84d18ca/pkg/operator/storageclasscontroller/storageclasscontroller.go#L168 Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2021-11-09-181140 How reproducible: Consistently Steps to Reproduce: 1. Perform a vSphere IPI install on 4.10.0-0.nightly-2021-11-09-181140 2. Run e2e tests with the suite openshift/conformance/parallel 3. Monitor session growth with govc or the vCenter console Note: Typically active sessions will reveal the issue. On occassions, I have seen vCenter classify these sessions as idle and govc won't report them, but they are still counted against the user/cluster session limits. Actual results: Active/idle sessions grow and are only closed when the session times out after 30 minutes Expected results: Active/idle sessions should not grow during use Master Log: Node Log (of failed PODs): PV Dump: N/A PVC Dump: N/A StorageClass Dump (if StorageClass used by PV/PVC): N/A Additional info: Session growth appears to coincide with these log entries. As is seen, the operator is encountering an exhaustion of sessions(confirmed via reviewing the vCenter logs as well). I1112 16:14:06.265138 1 vmware.go:314] Found existing profile with same name: openshift-storage-policy-rvanderp-dev-kxn7t I1112 16:14:09.248399 1 vmware.go:314] Found existing profile with same name: openshift-storage-policy-rvanderp-dev-kxn7t I1112 16:14:10.503282 1 vmware.go:314] Found existing profile with same name: openshift-storage-policy-rvanderp-dev-kxn7t E1112 16:14:11.859266 1 vmware.go:82] error logging into vcenter: POST https://ibmvcenter.vmc-ci.devcluster.openshift.com/rest/com/vmware/cis/session: 503 Service Unavailable E1112 16:14:11.859292 1 storageclasscontroller.go:121] error syncing storage policy: error connecting to vcenter API: error logging into vcenter: POST https://ibmvcenter.vmc-ci.devcluster.openshift.com/rest/com/vmware/cis/session: 503 Service Unavailable E1112 16:14:11.921186 1 base_controller.go:272] VMwareVSphereDriverStorageClassController reconciliation failed: error connecting to vcenter API: error logging into vcenter: POST https://ibmvcenter.vmc-ci.devcluster.openshift.com/rest/com/vmware/cis/session: 503 Service Unavailable
We found root cause of the session leakage to be from intree e2e tests that create vcenter connection on the fly and hence the tests themselves needs fixing - for now we are going to disable the intree vsphere tests https://github.com/openshift/origin/pull/26670
This is a 4.10.0 blocker because in order to get signal on vsphere in general, we had to disable nearly 100 tests for vsphere storage https://github.com/openshift/origin/pull/26670 .
A fix to the e2e tests has been created and is in the process of being upstreamed (https://github.com/kubernetes/kubernetes/pull/107337)
Verified with 4.10.0-0.nightly-2022-01-24-070025 on vsphere 6.7 test env which could check the session openshift-tests run "openshift/conformance/parallel" --provider "${TEST_PROVIDER}" --dry-run | grep storage | openshift-tests run "openshift/conformance/parallel" --provider "${TEST_PROVIDER}" -f - ... error: 2 fail, 350 pass, 1438 skip (56m2s) There is no session growth with govc session.ls. Update status as "Verified"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056