Bug 2097153 - poor performance on API call to vCenter ListTags with thousands of tags
Summary: poor performance on API call to vCenter ListTags with thousands of tags
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: dmoiseev
QA Contact: Huali Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-15 01:19 UTC by Brian Ward
Modified: 2022-10-26 08:17 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Improved performance in vCenter clusters with thousands of tags and heavy API loads. Now machine controllers query only tags related to particular OCP installation.
Clone Of:
Environment:
Last Closed: 2022-08-10 11:17:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 1027 0 None open Bug 2097153: change ListTags call to ListTagsForCategory 2022-06-15 16:07:42 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:18:10 UTC

Description Brian Ward 2022-06-15 01:19:01 UTC
Description of problem:

Installation fails because workers fail to provision on a vSphere IPI install.  Ultimately if we walk away from the install for a day, the workers do show up and the cluster stabilizes.

This vCenter has 2400+ tags, and generally appears bogged down by the machine-api-controller's reconcile loop.  

Some calls at ListTags return after 15 minutes, some after 75 minutes, and some/most are being disconnected or timed out entirely.  When vCenter has time to "catch up" in the early morning hours, we finally see machines get provisioned.  However, the Machine definition frequently is stuck at Provisioning or Provisioned.  CSRs fail to automatically approve (that may be another issue or may just be related to the significant delay in provisioning).

https://github.com/openshift/machine-api-operator/blob/release-4.10/pkg/controller/vsphere/reconciler.go#L1034

Version-Release number of selected component (if applicable):

We tested on 4.10 but this is probably reproducible on all existing vsphere IPI code bases.  

How reproducible:

Every install on this particular vCenter with 2400+ tags which is heavily used by other automation tools.  We can see in the logs there are many Info statements on "Reconciling attached tags"

https://github.com/openshift/machine-api-operator/blob/release-4.10/pkg/controller/vsphere/reconciler.go#L1034

but only very many hours later do we see a successful "Attaching XXX tag to vm"

https://github.com/openshift/machine-api-operator/blob/release-4.10/pkg/controller/vsphere/reconciler.go#L992


Steps to Reproduce:
1. obtain a heavily used vCenter with thousands of tags (fun trick) 
2. run openshift-install vSphere IPI


Actual results:

Failed install due to no workers.

Expected results:

Successful install

Additional info:

The masters come up quickly, likely because there is no call to ListTags on the install process (unless I've missed something).

In the code base, we are requesting all Tags in vCenter, which is not necessary since we are only concerned with our own specific tags:

https://github.com/openshift/installer/blob/master/data/data/vsphere/pre-bootstrap/main.tf#L57

I propose we switch to ListTagsForCategory and select the category we specify during the install stage.

Comment 1 Brian Ward 2022-06-15 01:22:21 UTC
I've tested a patched machine-api-operator and have very good results.  Our calls with GetTagsForCategory are down to 2 minutes, from upwards of 75 minutes on GetTags.

https://github.com/openshift/machine-api-operator/pull/1026

Comment 2 Brian Ward 2022-06-15 01:32:54 UTC
Fix the merge request rebased to master:

https://github.com/openshift/machine-api-operator/pull/1027

Comment 4 Huali Liu 2022-07-07 03:00:19 UTC
It's hard to prepare this prerequisite (obtain a heavily used vCenter with thousands of tags), because the test team has only one vCenter shared by everyone, which will make it unavailable for others.
In addition, based on https://bugzilla.redhat.com/show_bug.cgi?id=2097153#c1, I think we can move this to Verified.

Comment 7 errata-xmlrpc 2022-08-10 11:17:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.