Bug 2047844 - Leaking sessions to vCenter causing vpxd to crash
Summary: Leaking sessions to vCenter causing vpxd to crash
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.8
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: 4.8.z
Assignee: dmoiseev
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On: 2004953 2048496
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-28 17:17 UTC by Matthew Robson
Modified: 2022-02-24 15:26 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-02-24 15:26:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Matthew Robson 2022-01-28 17:17:58 UTC
Description of problem:

vSphere team reported problem their clusters due to high load and vpxd crashing. Debugging with vmware, they could thousands of sessions from their 'osedeploy' account across their 40+ IPI clusters.

We start to see 503 errors from pods like vsphere problem detector and cluster storage operator.

./var/log/pods/openshift-cluster-storage-operator_cluster-storage-operator-8558ccf8dd-rlsd9_bc544c60-4f94-438f-aa81-daddb8d9b691/cluster-storage-operator/0.log:2022-01-22T01:48:20.811341202+00:00 stderr F I0122 01:48:20.811270       1 status_controller.go:211] clusteroperator/storage diff {"status":{"conditions":[{"lastTransitionTime":"2022-01-21T16:55:43Z","message":"All is well","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2022-01-21T17:48:04Z","message":"All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2022-01-21T17:48:18Z","message":"VSphereProblemDetectorControllerAvailable: failed to connect to server.company.com: POST \"/sdk\": 503 Service Unavailable","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-01-21T16:55:56Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}


Version-Release number of selected component (if applicable):

4.8.27

How reproducible:

Always

Steps to Reproduce:
1. Lets the clusters run
2.
3.

Actual results:

Many sessions causing perf issues.


Expected results:


Note You need to log in before you can comment on or make changes to this bug.