Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2047844

Summary: Leaking sessions to vCenter causing vpxd to crash
Product: OpenShift Container Platform Reporter: Matthew Robson <mrobson>
Component: Cloud ComputeAssignee: dmoiseev
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED NOTABUG Docs Contact:
Severity: urgent    
Priority: high CC: aos-bugs, hekumar, jspeed
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-24 15:26:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2004953, 2048496    
Bug Blocks:    

Description Matthew Robson 2022-01-28 17:17:58 UTC
Description of problem:

vSphere team reported problem their clusters due to high load and vpxd crashing. Debugging with vmware, they could thousands of sessions from their 'osedeploy' account across their 40+ IPI clusters.

We start to see 503 errors from pods like vsphere problem detector and cluster storage operator.

./var/log/pods/openshift-cluster-storage-operator_cluster-storage-operator-8558ccf8dd-rlsd9_bc544c60-4f94-438f-aa81-daddb8d9b691/cluster-storage-operator/0.log:2022-01-22T01:48:20.811341202+00:00 stderr F I0122 01:48:20.811270       1 status_controller.go:211] clusteroperator/storage diff {"status":{"conditions":[{"lastTransitionTime":"2022-01-21T16:55:43Z","message":"All is well","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2022-01-21T17:48:04Z","message":"All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2022-01-21T17:48:18Z","message":"VSphereProblemDetectorControllerAvailable: failed to connect to server.company.com: POST \"/sdk\": 503 Service Unavailable","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-01-21T16:55:56Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}


Version-Release number of selected component (if applicable):

4.8.27

How reproducible:

Always

Steps to Reproduce:
1. Lets the clusters run
2.
3.

Actual results:

Many sessions causing perf issues.


Expected results: