Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2047844

Summary:	Leaking sessions to vCenter causing vpxd to crash
Product:	OpenShift Container Platform	Reporter:	Matthew Robson <mrobson>
Component:	Cloud Compute	Assignee:	dmoiseev
Cloud Compute sub component:	Other Providers	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	urgent
Priority:	high	CC:	aos-bugs, hekumar, jspeed
Version:	4.8
Target Milestone:	---
Target Release:	4.8.z
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-02-24 15:26:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2004953, 2048496
Bug Blocks:

Description Matthew Robson 2022-01-28 17:17:58 UTC

Description of problem:

vSphere team reported problem their clusters due to high load and vpxd crashing. Debugging with vmware, they could thousands of sessions from their 'osedeploy' account across their 40+ IPI clusters.

We start to see 503 errors from pods like vsphere problem detector and cluster storage operator.

./var/log/pods/openshift-cluster-storage-operator_cluster-storage-operator-8558ccf8dd-rlsd9_bc544c60-4f94-438f-aa81-daddb8d9b691/cluster-storage-operator/0.log:2022-01-22T01:48:20.811341202+00:00 stderr F I0122 01:48:20.811270       1 status_controller.go:211] clusteroperator/storage diff {"status":{"conditions":[{"lastTransitionTime":"2022-01-21T16:55:43Z","message":"All is well","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2022-01-21T17:48:04Z","message":"All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2022-01-21T17:48:18Z","message":"VSphereProblemDetectorControllerAvailable: failed to connect to server.company.com: POST \"/sdk\": 503 Service Unavailable","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-01-21T16:55:56Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}


Version-Release number of selected component (if applicable):

4.8.27

How reproducible:

Always

Steps to Reproduce:
1. Lets the clusters run
2.
3.

Actual results:

Many sessions causing perf issues.


Expected results: